sre

Archived

0726

Archive: https://archive.sweetops.com/monitoring/

erik11 months ago

erikabout 1 year ago

archived the channel

JoeHalmost 2 years ago

How do you know how much availability an improvement has added? Are you supposed to track the SLIs and adjust? It seems like infrequent high severity events would make most SLOs just guesses.

Pawel Reinabout 2 years ago(edited)

What open source tools would you use to build globally distributed synthetic monitoring (multi step API testing)? Names that show up in my research so far K6 / artillery? Grafana Cloud for distributed part? Anything to do DIY globally distributed on AWS?

jaysunabout 2 years ago

what are people’s thoughts on honeycomb.io vs datadog? Datadogs costs (shocker) and complexity around it’s costs continue to be a turn off, and honeycomb’s vision (observability 2.0) was an interesting read. How do the two compare in practice?

Joe Perezover 2 years ago

anyone using backstage? And any guidance on upgrading postgres versions?

Arivuover 2 years ago

@Shahar Glazner Hi, I need a dashboard for the alert triggered from the specific resource group. Like monthly or weekly

Andyover 2 years ago

💰️ Hi all. Does anyone use the DataDog APM? We're finding it useful but the billing is hurting us (you get billed per EC2 instance). In our EKS cluster we use c5.2xlarge spot instances to try and keep the overall number of servers down. Just wondering if others have this issue and have any solutions. 🙏

Shahar Glaznerover 2 years ago

what help do you need?

Arivuover 2 years ago

I need assistance creating an alert dashboard.

Arivuover 2 years ago

I need assistance creating an alert dashboard.

jonjitsuover 2 years ago

Any recommendations for monitoring hadoop clusters and workloads?

Seanalmost 3 years ago(edited)

Any opinions on SigNoz (the open-source version) as an alternative to “build your own” or Datadog? https://signoz.io/
It’s up to 13k stars on GitHub so popularity is clearly growing.

Jim Parkalmost 3 years ago

off to production we go!

jonjitsuabout 3 years ago

Any suggestions on methods of monitoring logs for specific sequence of events and firing alerts based off them?

Ronak Jainover 3 years ago

HI Guys

Anyone can help me regarding to deployment process :
etc/systemd/system/gunicorn.socket -----

[Unit]
    Description=gunicorn socket
    
    [Socket]
    ListenStream=/run/gunicorn.sock
    
    [Install]
    WantedBy=sockets.target

/etc/systemd/system/gunicorn.service ----


[Unit]
Description=Gunicorn service
Requires=gunicorn.sock
After=network.target


[Service]
User=ubuntu
Group=www-data
WordkDirectory=/home/ubuntu/PocEnvironment

Environment="PATH=/PocEnvironment/env/bin"
ExecStart=/usr/bin/gunicorn \
          --access-logfile - \
          --workers 3 \
          --bind unix:/run/gunicorn.sock \
          PocEnvironment.src.runner:application
         
         
[Install] 
WantedBy=multi-user.target

/etc/nginx/sites-enabled

server{
    listen 80;
    server_name {IP Address};

    location /static/ {
            root /home/ubuntu/PocEnvironment;
        }
    location / {
        include proxy_params;
        proxy_pass <http://unix>:/run/gunicorn.sock;
        # proxy_pass <http://0.0.0.0:8000>;
    }

}

Applied these commands:
sudo systemctl daemon-reload
sudo systemctl restart gunicorn
sudo systemctl status gunicorn

Niv Weissover 3 years ago

Hey, we’re using eks fargate and monitoring it using cloudwatch in the meantime.
1. On which metrics do you monitor?
2. Are you using any other observability tools other than cloudwatch that works well with eks fargate nodes?

Ruan Arcegaover 3 years ago

Hi there
Is there some Kafka channel ?

mimoover 3 years ago

Just added an elasticsearch dashboard after i deployed prometheus operator with elasticsearch on my k8s cluster but it doesnt seem to work. what did i miss?

sheldonhalmost 4 years ago

I wish datadog had put more effort into the core features of an incident tool.
Their incident tool is refreshingly simple, but missing some key things for it to be viable.

Was hoping to get folks on it as a "simple version of OpsGenie" but it's missing:

• Dedicated app/alerting (at least as for incident stuff)
• Easy slack integration. If i open in a channel, I can't have all updates piped through. I have to have it create a dedicated channel and at my place that's not possible.
• Links to other things in datadog don't automatically prettify.
• No escalation policy/team schedule for handling.

So many things missing. Seems like it would be really nice experience being in a single place if wasn't just a barebones way to organize a chat.

Andyalmost 4 years ago

Hi all. Do any teams monitor the status of their Infrastructure independently to their website/service?

I'm asking in terms of separating the SLA responsibility:
1. The uptime of infrastructure is the DevOps team's responsibility
2. The uptime of the website/service is the Development team's responsibility
Or do teams tend to say the uptime of the website/service is a shared responsibility between DevOps and Developers?

erikabout 4 years ago

has renamed the channel from "monitoring" to "sre"

Monika Sadlokover 4 years ago

Hi, what is the best solution to monitor ECS Fargate Cluster and tasks using Prometheus and Grafana?

Chris Pichtover 4 years ago

Anyone have experience with both Elastic.co & DataDog ? Am I crazy or is Datadog actually cheaper for most use cases?

bradymover 4 years ago

Curious if anyone's using New Relic for monitoring/logging/alerting? Thoughts about it?

Pierre-Yvesover 4 years ago

Hi, if I have two redundant Thanos server with a Thanos store writting to a common block storage, are the data written twice ? and stored twice ?
Each Thanos store has a cache list of data written , but this cache is not shared

venkataalmost 5 years ago

FYI https://grafana.com/about/events/grafanacon/2021/

Parthaalmost 5 years ago

Hi All,
report.CRITICAL: {“error”:{“root_cause”:[{“type”:“illegal_argument_exception”,“reason”:“Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on - Please help on this problem ElasticSearch

btaialmost 5 years ago(edited)

what other vendoried infra (kubernetes) monitoring solutions are people using not named datadog?

Erik Osterman (Cloud Posse)almost 5 years ago

https://sweetops.slack.com/archives/CHDR1EWNA/p1620241466170000

Pierre-Yvesalmost 5 years ago(edited)

Hello, in prometheus server: I didn't managed to move job with azure_sd_config to file_sd directory it seems that this target is only for static config ? and not for service discovery jobs.
is there a way to move prometheus jobs out of prometheus.yml ? (to keep things organized and not in a single file )

Pierre-Yvesalmost 5 years ago(edited)

I was looking for examples on prometheus alertmanager rules. I found that many where listed here:

https://awesome-prometheus-alerts.grep.to/rules

uselessuseofcatalmost 5 years ago

Hi, is there any way to archive New Relic logs to S3 bucket so I have them saved after 30 days of retention?

Andrew Nazarovalmost 5 years ago

Has anybody tried this service https://www.netdata.cloud/? Didn’t get the trick, no prices found.

bradymabout 5 years ago(edited)

We're currently testing out an ELK stack deployed via AWS Elasticsearch and I'm having a heck of a time understanding what permissions I'd need to give engineers for them to do things like create saved searches, create visualizations and notebooks. Anyone know a good reference for this? Maybe I'm just missing it somehow, but I've not been able to find anything like this in the documentation. Not sure if this is the best place to ask this, if there's somewhere better please let me know.

Eric Bergabout 5 years ago

Regarding custom metrics (we're an AWS/k8s/Datadog shop), i'm trying to get ahead of my developers on the issue of custom metrics and how to represent situations where I want to represent ratios of successful or failed requests/events. For example, we have a routine for which we want to track success/failure as well as latency.

One approach is to have a single metric for all of these events and add a tag for result where the values are success and fail .

Another approach is to have discrete metrics for the success and failure counts...and maybe another one for the total number of requests.

I'd rather have separate metrics for success, failure, and one for a total number of requests.

Thanks for any input you have on this.

Patrick Jahnsabout 5 years ago(edited)

Are you guys aware of any other json logging format standard besides the Elastic Common Schema ( https://www.elastic.co/what-is/ecs ) - been searching a bit but haven't found something more vendor neutral so far.
Also the opentelemetry spec regarding this aspect is from my point of view quite open - https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/data-model.md#log-and-event-record-definition

btaiabout 5 years ago

I’ve always struggled w/ this but my use case is kind of special so curious whether anyone has run into this. We have a ton of kubernetes deployments in our prod cluster (maybe like 15-20k in our production cluster). We run deployments nightly where we will have thousands of deployments of new pods. When this happens I get a ton of alerts for replica pods going down & unavailable deployment replicas detected. I believe this is somewhat normal procedure as the pods get rotated. I wish that I wouldn’t have to resolve all the alerts, but at the same time I don’t want to disable alerting during deployment time either. Anyone have a good workaround for this? (I’m testing out datadog currently)

Garethabout 5 years ago

Good Afternoon all,
Can anybody make a recommendation for filtering unwanted or wanted lines from Logs running on EC2 instance within AWS using the Unified Cloudwatch Agent?
As far as I'm aware there isn't an ability to filter before ingestion into cloudwatch.

I believe AWS recommendation is to filter the log into another log and then consume the filtered log. So, before I have to write something for Centos and windows, I wonder if anybody can make a recommendation for an app that could be used to transform / filter the logs?

Shtrullabout 5 years ago(edited)

HELP
I have the next prom query (to reduce the results while testing I limited a specific reader_id)
irate(nexite_reader_all_packets_per_channel_total{reader_id="10000"}[1m])

and here is a cleaned up result (i have manully removed the ns, container,svc i.e)

{branch_id="3689", chain_id="3390", channel="37", instance="10.4.2.236:8188",  pod="my-super-pod-67cd5b7d5f-pn754", reader_id="10000" } = 108
{branch_id="3689", chain_id="3390", channel="37", instance="10.4.2.40:8188",  pod="my-super-pod-67cd5b7d5f-dwmkw", reader_id="10000" } = 163
{branch_id="3689", chain_id="3390", channel="38", instance="10.4.2.236:8188",  pod="my-super-pod-67cd5b7d5f-pn754", reader_id="10000" } = 77
{branch_id="3689", chain_id="3390", channel="38", instance="10.4.2.40:8188",  pod="my-super-pod-67cd5b7d5f-dwmkw", reader_id="10000" } = 121
{branch_id="3689", chain_id="3390", channel="39", instance="10.4.2.236:8188",  pod="my-super-pod-67cd5b7d5f-pn754", reader_id="10000" } = 86
{branch_id="3689", chain_id="3390", channel="39", instance="10.4.2.40:8188",  pod="my-super-pod-67cd5b7d5f-dwmkw", reader_id="10000" } = 131

before k8s they had one "pod" so if they did sum by (branch_id) they got the right results but becues the pods are dynamic they get the results in double
and each run qeuerys their database not at the same time they are off by a bit

is there a elagent way to first run avg on both pods, and then run sum by branch_id?

Pierre Humberdrozabout 5 years ago(edited)

Does someone know a helm chart for a good / updated query exporter?
https://github.com/albertodonato/query-exporter
https://github.com/free/sql_exporter
https://github.com/justwatchcom/sql_exporter

Joan Portaabout 5 years ago

Hi guys! in k8s, I want to use Opentelemetry collector to gatter logs, In the cluster I have multiple app's. Is it posible to not need in each app POD a sidecar with opentelemetry agent, just only the daemonset? I dont want extra overhead having to put a sidecar to all app's POD's

Kareemabout 5 years ago

Has anyone had success trying out Datadog Real User Monitoring (RUM)? Considering it and just curious about anybody's experiences. Also open to alternatives for tracking user events and behavior, more so for troubleshooting client-side interactions rather than analytics.

btaiabout 5 years ago

Anyone have suggestions for a good postgres monitoring system (inefficient sql, debugging iops spikes) that can run completely on prem?

joshmyersabout 5 years ago

Anyone tried AWS Grafana/Prometheus services? Thoughts?

Alex Jurkiewiczabout 5 years ago

Do you use AWS? Upload your images to an ECR registry and use its built-in security scan. (The registry can be unused otherwise.)

zeidabout 5 years ago

I'd like to be able to scan a registry on new images, approve them, and also configure admission controllers to only allow scanned images

zeidabout 5 years ago

Anyone have any experience with https://snyk.io/ ? I'm at products security/monitoring products that we can use through the dev lifecycle.

Tim Birkettabout 5 years ago(edited)

Hi 👋 - Not sure that this belongs in here, but is loosely related to observability (logging) - Currently running an EFK stack and I configured fluent(-bit|d) to partition logs by namespace (ie an index per namespace like: fluentd-<namespace>-<http://YYYY.MM|YYYY.MM>.DD). This is great, it means no field type collisions between teams, teams can search and visualise (I'm from the UK 😛) based on their own indexes (making queries kinder to Elasticsearch, and we can create finer grained curator configurations (keep fewer noisy namespace logs around).

The things that have been a bit annoying:
• Having to add another index pattern to Kibana every time a new namespace pops up
• Having to regularly refresh field lists on index patterns as log fields evolve over time. It was okay with 4 or 5 index patterns but is now a bit tedious with 40+. Developers forget to do it and have issues searching and visualising new logs
Today, I've spent a day having a bit of a hack and have a script that:
1. Keeps Kibana index patterns in sync with the indexes in Elasticsearch based on a prefix
2. Updates index pattern field lists based on the presence of an environment variable

It's over at: https://github.com/devopsmakers/kibana-index-pattern-creator - the script works well and has a DRY_RUN mode. My next step is to get an image up in Dockerhub and get a Helm chart up and running to deploy it to Kubernetes as a CronJob (or 2) - oh, and rewrite the README. Hopefully it comes in handy to others 🙃

#sre

sre