sreArchived
13 messages
Tom de Vriesover 5 years ago
Hi, I'm curious to find out how you're handling non-urgent alerts coming in from your infrastructure. We've set up PagerDuty for all critical alerts that come in (e.g. service down, degraded performance), but we don't really have a proper process for handling non-urgent things, like "disk usage is over 80%". The latter would be something we'd need to action on, but not right now. Any suggestions or examples you're happy with on how you're handling this? We use PagerDuty/Datadog/Slack mainly for our monitoring and alerting.
Abel Luckover 5 years ago
I'm using an exporter that monitors a service. It doesn't have a boolean up/down metric to track the service's healthystate, rather it has a failed scrape counter. Every failed scrape increments the counter.
What would an alert looks like that uses that metric to tell me if the service is UP or DOWN?
What would an alert looks like that uses that metric to tell me if the service is UP or DOWN?
shamil.kashmeriover 5 years ago
Hi all, wondering if anyone has any experience with externalmetrics from cloudwatch via awslabs/k8s-cloudwatch-adapter. Im experiencing what seems to be weird rbac issues, the adapter sets up seemingly fine, and i was able to deploy my custom metric but im seeing a bunch of permissions issues in the logging.
I0717 03:51:59.474073 1 request.go:947] Response Body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"externalmetrics.metrics.aws is forbidden: User \"system:serviceaccount:custom-metrics:k8s-cloudwatch-adapter\" cannot list resource \"externalmetrics\" in API group \"metrics.aws\" at the cluster scope","reason":"Forbidden","details":{"group":"metrics.aws","kind":"externalmetrics"},"code":403}
E0717 03:51:59.474831 1 reflector.go:125] <http://github.com/awslabs/k8s-cloudwatch-adapter/pkg/client/informers/externalversions/factory.go:114|github.com/awslabs/k8s-cloudwatch-adapter/pkg/client/informers/externalversions/factory.go:114>: Failed to list *v1alpha1.ExternalMetric: externalmetrics.metrics.aws is forbidden: User "system:serviceaccount:custom-metrics:k8s-cloudwatch-adapter" cannot list resource "externalmetrics" in API group "metrics.aws" at the cluster scope
I0717 03:52:00.234254 1 authorization.go:73] Forbidden: "/", Reason: ""btaiover 5 years ago
I have been getting random 1 minute connection timeout alerts from my uptime service (uptimerobot) for a few of my sites. My ELB logs show no record of those uptime checks nor are there connection errors on the ELB metrics. Assuming the problem isn’t on the uptime service’s side is there anything else i’m missing that i should check? (note i still continue to get uptime requests for the other sites hosted behind the ELB during that time)
Joe Nilandover 5 years ago
Can anyone recommend a SaaS that can combine CloudWatch metrics with other sources? Datadog seems to be one.
Erik Osterman (Cloud Posse)over 5 years ago
a little bit underwhelming. doesn't handle monitoring configuration, only agent configuraiton.
Eric Bergover 5 years ago
We're at that point at which we need to set up something like PagerDuty. I've heard OpsGenie mentioned here and we are an Atlassian Cloud shop, but i've used PD in the past. We're a small shop at this point (< 20 devs/ops people) and we'll start with just one or two rota.
Sorry if this has been discussed before, but any input or suggestions to help make the choice would be appreciated.
Sorry if this has been discussed before, but any input or suggestions to help make the choice would be appreciated.
Zachover 5 years ago
For that size I think OpsGenie would probably be a good fit for you
Zachover 5 years ago(edited)
We’re using it as well, used to be on pagerduty. PD is the ‘industry standard’ workhorse for alerting but IMO its showing its age. And it isn’t priced very well for smaller companies
btaiover 5 years ago(edited)
We use victorops here, pricing is roughly similar to opsgenie iirc. if you use jira then opsgenie is an atlassian product so i imagine the integration to automatically create actionable jira tickets would be pretty good
Erik Osterman (Cloud Posse)over 5 years ago
Erik Osterman (Cloud Posse)over 5 years ago
Aha! Now that's what I was looking for...