sre
Archived0726
E
erik10 months ago
E
erik12 months ago
archived the channel
JoeHalmost 2 years ago
How do you know how much availability an improvement has added? Are you supposed to track the SLIs and adjust? It seems like infrequent high severity events would make most SLOs just guesses.
Pawel Reinalmost 2 years ago(edited)
What open source tools would you use to build globally distributed synthetic monitoring (multi step API testing)? Names that show up in my research so far K6 / artillery? Grafana Cloud for distributed part? Anything to do DIY globally distributed on AWS?
jaysunabout 2 years ago
what are peopleâs thoughts on honeycomb.io vs datadog? Datadogs costs (shocker) and complexity around itâs costs continue to be a turn off, and honeycombâs vision (observability 2.0) was an interesting read. How do the two compare in practice?
Joe Perezover 2 years ago
anyone using backstage? And any guidance on upgrading postgres versions?
Arivuover 2 years ago
@Shahar Glazner Hi, I need a dashboard for the alert triggered from the specific resource group. Like monthly or weekly
Andyover 2 years ago
đ°ď¸ Hi all. Does anyone use the DataDog APM? We're finding it useful but the billing is hurting us (you get billed per EC2 instance). In our EKS cluster we use c5.2xlarge spot instances to try and keep the overall number of servers down. Just wondering if others have this issue and have any solutions. đShahar Glaznerover 2 years ago
what help do you need?
Arivuover 2 years ago
I need assistance creating an alert dashboard.
Arivuover 2 years ago
I need assistance creating an alert dashboard.
jonjitsuover 2 years ago
Any recommendations for monitoring hadoop clusters and workloads?
Seanover 2 years ago(edited)
Any opinions on SigNoz (the open-source version) as an alternative to âbuild your ownâ or Datadog? https://signoz.io/
Itâs up to 13k stars on GitHub so popularity is clearly growing.
Itâs up to 13k stars on GitHub so popularity is clearly growing.
jonjitsuabout 3 years ago
Any suggestions on methods of monitoring logs for specific sequence of events and firing alerts based off them?
R
Ronak Jainabout 3 years ago
Ronak Jainabout 3 years ago
HI Guys
Anyone can help me regarding to deployment process :
etc/systemd/system/gunicorn.socket -----
/etc/systemd/system/gunicorn.service ----
/etc/nginx/sites-enabled
Applied these commands:
sudo systemctl daemon-reload
sudo systemctl restart gunicorn
sudo systemctl status gunicorn
Anyone can help me regarding to deployment process :
etc/systemd/system/gunicorn.socket -----
[Unit]
Description=gunicorn socket
[Socket]
ListenStream=/run/gunicorn.sock
[Install]
WantedBy=sockets.target/etc/systemd/system/gunicorn.service ----
[Unit]
Description=Gunicorn service
Requires=gunicorn.sock
After=network.target
[Service]
User=ubuntu
Group=www-data
WordkDirectory=/home/ubuntu/PocEnvironment
Environment="PATH=/PocEnvironment/env/bin"
ExecStart=/usr/bin/gunicorn \
--access-logfile - \
--workers 3 \
--bind unix:/run/gunicorn.sock \
PocEnvironment.src.runner:application
[Install]
WantedBy=multi-user.target/etc/nginx/sites-enabled
server{
listen 80;
server_name {IP Address};
location /static/ {
root /home/ubuntu/PocEnvironment;
}
location / {
include proxy_params;
proxy_pass <http://unix>:/run/gunicorn.sock;
# proxy_pass <http://0.0.0.0:8000>;
}
}Applied these commands:
sudo systemctl daemon-reload
sudo systemctl restart gunicorn
sudo systemctl status gunicorn
Niv Weissabout 3 years ago
Hey, weâre using eks fargate and monitoring it using cloudwatch in the meantime.
1. On which metrics do you monitor?
2. Are you using any other observability tools other than cloudwatch that works well with eks fargate nodes?
1. On which metrics do you monitor?
2. Are you using any other observability tools other than cloudwatch that works well with eks fargate nodes?
Ruan Arcegaover 3 years ago
Hi there
Is there some Kafka channel ?
Is there some Kafka channel ?
M
mimoover 3 years ago
Just added an elasticsearch dashboard after i deployed prometheus operator with elasticsearch on my k8s cluster but it doesnt seem to work. what did i miss?
sheldonhover 3 years ago
I wish datadog had put more effort into the core features of an incident tool.
Their incident tool is refreshingly simple, but missing some key things for it to be viable.
Was hoping to get folks on it as a "simple version of OpsGenie" but it's missing:
⢠Dedicated app/alerting (at least as for incident stuff)
⢠Easy slack integration. If i open in a channel, I can't have all updates piped through. I have to have it create a dedicated channel and at my place that's not possible.
⢠Links to other things in datadog don't automatically prettify.
⢠No escalation policy/team schedule for handling.
So many things missing. Seems like it would be really nice experience being in a single place if wasn't just a barebones way to organize a chat.
Their incident tool is refreshingly simple, but missing some key things for it to be viable.
Was hoping to get folks on it as a "simple version of OpsGenie" but it's missing:
⢠Dedicated app/alerting (at least as for incident stuff)
⢠Easy slack integration. If i open in a channel, I can't have all updates piped through. I have to have it create a dedicated channel and at my place that's not possible.
⢠Links to other things in datadog don't automatically prettify.
⢠No escalation policy/team schedule for handling.
So many things missing. Seems like it would be really nice experience being in a single place if wasn't just a barebones way to organize a chat.

Andyalmost 4 years ago
Hi all. Do any teams monitor the status of their Infrastructure independently to their website/service?
I'm asking in terms of separating the SLA responsibility:
1. The uptime of infrastructure is the DevOps team's responsibility
2. The uptime of the website/service is the Development team's responsibility
Or do teams tend to say the uptime of the website/service is a shared responsibility between DevOps and Developers?
I'm asking in terms of separating the SLA responsibility:
1. The uptime of infrastructure is the DevOps team's responsibility
2. The uptime of the website/service is the Development team's responsibility
Or do teams tend to say the uptime of the website/service is a shared responsibility between DevOps and Developers?
E
erikabout 4 years ago
has renamed the channel from "monitoring" to "sre"
Monika Sadlokover 4 years ago
Hi, what is the best solution to monitor ECS Fargate Cluster and tasks using Prometheus and Grafana?
Chris Pichtover 4 years ago
Anyone have experience with both Elastic.co & DataDog ? Am I crazy or is Datadog actually cheaper for most use cases?
bradymover 4 years ago
Curious if anyone's using New Relic for monitoring/logging/alerting? Thoughts about it?
Pierre-Yvesover 4 years ago
Hi, if I have two redundant Thanos server with a Thanos store writting to a common block storage, are the data written twice ? and stored twice ?
Each Thanos store has a cache list of data written , but this cache is not shared
Each Thanos store has a cache list of data written , but this cache is not shared
Parthaover 4 years ago
Hi All,
report.CRITICAL: {âerrorâ:{âroot_causeâ:[{âtypeâ:âillegal_argument_exceptionâ,âreasonâ:âText fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on - Please help on this problem ElasticSearch
report.CRITICAL: {âerrorâ:{âroot_causeâ:[{âtypeâ:âillegal_argument_exceptionâ,âreasonâ:âText fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on - Please help on this problem ElasticSearch
btaiover 4 years ago(edited)
what other vendoried infra (kubernetes) monitoring solutions are people using not named datadog?
Erik Osterman (Cloud Posse)almost 5 years ago
Pierre-Yvesalmost 5 years ago(edited)
Hello, in prometheus server: I didn't managed to move job with
is there a way to move prometheus jobs out of prometheus.yml ? (to keep things organized and not in a single file )
azure_sd_config to file_sd directory it seems that this target is only for static config ? and not for service discovery jobs.is there a way to move prometheus jobs out of prometheus.yml ? (to keep things organized and not in a single file )
Pierre-Yvesalmost 5 years ago(edited)
I was looking for examples on prometheus alertmanager rules. I found that many where listed here:
https://awesome-prometheus-alerts.grep.to/rules
https://awesome-prometheus-alerts.grep.to/rules
uselessuseofcatalmost 5 years ago
Hi, is there any way to archive New Relic logs to S3 bucket so I have them saved after 30 days of retention?
Andrew Nazarovalmost 5 years ago
Has anybody tried this service https://www.netdata.cloud/? Didnât get the trick, no prices found.
bradymalmost 5 years ago(edited)
We're currently testing out an ELK stack deployed via AWS Elasticsearch and I'm having a heck of a time understanding what permissions I'd need to give engineers for them to do things like create saved searches, create visualizations and notebooks. Anyone know a good reference for this? Maybe I'm just missing it somehow, but I've not been able to find anything like this in the documentation. Not sure if this is the best place to ask this, if there's somewhere better please let me know.
Eric Bergalmost 5 years ago
Regarding custom metrics (we're an AWS/k8s/Datadog shop), i'm trying to get ahead of my developers on the issue of custom metrics and how to represent situations where I want to represent ratios of successful or failed requests/events. For example, we have a routine for which we want to track success/failure as well as latency.
One approach is to have a single metric for all of these events and add a tag for
Another approach is to have discrete metrics for the success and failure counts...and maybe another one for the total number of requests.
I'd rather have separate metrics for success, failure, and one for a total number of requests.
Thanks for any input you have on this.
One approach is to have a single metric for all of these events and add a tag for
result where the values are success and fail .Another approach is to have discrete metrics for the success and failure counts...and maybe another one for the total number of requests.
I'd rather have separate metrics for success, failure, and one for a total number of requests.
Thanks for any input you have on this.
Patrick Jahnsalmost 5 years ago(edited)
Are you guys aware of any other json logging format standard besides the Elastic Common Schema ( https://www.elastic.co/what-is/ecs ) - been searching a bit but haven't found something more vendor neutral so far.
Also the opentelemetry spec regarding this aspect is from my point of view quite open - https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/data-model.md#log-and-event-record-definition
Also the opentelemetry spec regarding this aspect is from my point of view quite open - https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/data-model.md#log-and-event-record-definition
btaialmost 5 years ago
Iâve always struggled w/ this but my use case is kind of special so curious whether anyone has run into this. We have a ton of kubernetes deployments in our prod cluster (maybe like 15-20k in our production cluster). We run deployments nightly where we will have thousands of deployments of new pods. When this happens I get a ton of alerts for replica pods going down & unavailable deployment replicas detected. I believe this is somewhat normal procedure as the pods get rotated. I wish that I wouldnât have to resolve all the alerts, but at the same time I donât want to disable alerting during deployment time either. Anyone have a good workaround for this? (Iâm testing out datadog currently)
Garethabout 5 years ago
Good Afternoon all,
Can anybody make a recommendation for filtering unwanted or wanted lines from Logs running on EC2 instance within AWS using the Unified Cloudwatch Agent?
As far as I'm aware there isn't an ability to filter before ingestion into cloudwatch.
I believe AWS recommendation is to filter the log into another log and then consume the filtered log. So, before I have to write something for Centos and windows, I wonder if anybody can make a recommendation for an app that could be used to transform / filter the logs?
Can anybody make a recommendation for filtering unwanted or wanted lines from Logs running on EC2 instance within AWS using the Unified Cloudwatch Agent?
As far as I'm aware there isn't an ability to filter before ingestion into cloudwatch.
I believe AWS recommendation is to filter the log into another log and then consume the filtered log. So, before I have to write something for Centos and windows, I wonder if anybody can make a recommendation for an app that could be used to transform / filter the logs?
Shtrullabout 5 years ago(edited)
HELP
I have the next prom query (to reduce the results while testing I limited a specific reader_id)
and here is a cleaned up result (i have manully removed the ns, container,svc i.e)
before k8s they had one "pod" so if they did sum by (branch_id) they got the right results but becues the pods are dynamic they get the results in double
and each run qeuerys their database not at the same time they are off by a bit
is there a elagent way to first run avg on both pods, and then run sum by branch_id?
I have the next prom query (to reduce the results while testing I limited a specific reader_id)
irate(nexite_reader_all_packets_per_channel_total{reader_id="10000"}[1m])and here is a cleaned up result (i have manully removed the ns, container,svc i.e)
{branch_id="3689", chain_id="3390", channel="37", instance="10.4.2.236:8188", pod="my-super-pod-67cd5b7d5f-pn754", reader_id="10000" } = 108
{branch_id="3689", chain_id="3390", channel="37", instance="10.4.2.40:8188", pod="my-super-pod-67cd5b7d5f-dwmkw", reader_id="10000" } = 163
{branch_id="3689", chain_id="3390", channel="38", instance="10.4.2.236:8188", pod="my-super-pod-67cd5b7d5f-pn754", reader_id="10000" } = 77
{branch_id="3689", chain_id="3390", channel="38", instance="10.4.2.40:8188", pod="my-super-pod-67cd5b7d5f-dwmkw", reader_id="10000" } = 121
{branch_id="3689", chain_id="3390", channel="39", instance="10.4.2.236:8188", pod="my-super-pod-67cd5b7d5f-pn754", reader_id="10000" } = 86
{branch_id="3689", chain_id="3390", channel="39", instance="10.4.2.40:8188", pod="my-super-pod-67cd5b7d5f-dwmkw", reader_id="10000" } = 131before k8s they had one "pod" so if they did sum by (branch_id) they got the right results but becues the pods are dynamic they get the results in double
and each run qeuerys their database not at the same time they are off by a bit
is there a elagent way to first run avg on both pods, and then run sum by branch_id?
Pierre Humberdrozabout 5 years ago(edited)
Does someone know a helm chart for a good / updated query exporter?
https://github.com/albertodonato/query-exporter
https://github.com/free/sql_exporter
https://github.com/justwatchcom/sql_exporter
https://github.com/albertodonato/query-exporter
https://github.com/free/sql_exporter
https://github.com/justwatchcom/sql_exporter
Joan Portaabout 5 years ago
Hi guys! in k8s, I want to useÂ
Opentelemetry collector to gatter logs, In the cluster I have multiple app's. Is it posible to not need in each app POD a sidecar with opentelemetry agent, just only the daemonset? I dont want extra overhead having to put a sidecar to all app's POD'sKareemabout 5 years ago
Has anyone had success trying out Datadog Real User Monitoring (RUM)? Considering it and just curious about anybody's experiences. Also open to alternatives for tracking user events and behavior, more so for troubleshooting client-side interactions rather than analytics.
btaiabout 5 years ago
Anyone have suggestions for a good postgres monitoring system (inefficient sql, debugging iops spikes) that can run completely on prem?
joshmyersabout 5 years ago
Anyone tried AWS Grafana/Prometheus services? Thoughts?
Alex Jurkiewiczabout 5 years ago
Do you use AWS? Upload your images to an ECR registry and use its built-in security scan. (The registry can be unused otherwise.)
zeidabout 5 years ago
I'd like to be able to scan a registry on new images, approve them, and also configure admission controllers to only allow scanned images
zeidabout 5 years ago
Anyone have any experience with https://snyk.io/ ? I'm at products security/monitoring products that we can use through the dev lifecycle.
Tim Birkettabout 5 years ago(edited)
Hi đ - Not sure that this belongs in here, but is loosely related to observability (logging) - Currently running an EFK stack and I configured fluent(-bit|d) to partition logs by namespace (ie an index per namespace like:
The things that have been a bit annoying:
⢠Having to add another index pattern to Kibana every time a new namespace pops up
⢠Having to regularly refresh field lists on index patterns as log fields evolve over time. It was okay with 4 or 5 index patterns but is now a bit tedious with 40+. Developers forget to do it and have issues searching and visualising new logs
Today, I've spent a day having a bit of a hack and have a script that:
1. Keeps Kibana index patterns in sync with the indexes in Elasticsearch based on a prefix
2. Updates index pattern field lists based on the presence of an environment variable
It's over at: https://github.com/devopsmakers/kibana-index-pattern-creator - the script works well and has a DRY_RUN mode. My next step is to get an image up in Dockerhub and get a Helm chart up and running to deploy it to Kubernetes as a
fluentd-<namespace>-<http://YYYY.MM|YYYY.MM>.DD). This is great, it means no field type collisions between teams, teams can search and visualise (I'm from the UK đ) based on their own indexes (making queries kinder to Elasticsearch, and we can create finer grained curator configurations (keep fewer noisy namespace logs around).The things that have been a bit annoying:
⢠Having to add another index pattern to Kibana every time a new namespace pops up
⢠Having to regularly refresh field lists on index patterns as log fields evolve over time. It was okay with 4 or 5 index patterns but is now a bit tedious with 40+. Developers forget to do it and have issues searching and visualising new logs
Today, I've spent a day having a bit of a hack and have a script that:
1. Keeps Kibana index patterns in sync with the indexes in Elasticsearch based on a prefix
2. Updates index pattern field lists based on the presence of an environment variable
It's over at: https://github.com/devopsmakers/kibana-index-pattern-creator - the script works well and has a DRY_RUN mode. My next step is to get an image up in Dockerhub and get a Helm chart up and running to deploy it to Kubernetes as a
CronJob (or 2) - oh, and rewrite the README. Hopefully it comes in handy to others đ