How are y’all scaling prometheus-server when 1 instance can’t handle the huge amount of metrics (so starts OOMing)?
We are adopting Prometheus-Operator and debating between: • a) “one prometheus per namespace”: benefit for us would be that we have >100 namespaces and a few dozen teams, so can have reporting and isolate impact to each namespace. • b) “functional sharding”: the Prometheus shard X scrapes all pods of Service A, B and C while shard Y scrapes pods from Service D, E and F. • c) “automatic sharding”: the targets will be assigned to Prometheus shards based on their addresses.