kubernetes
21886,222
N
Nat G.7 days ago
Hey folks. Anyone here using Claude Code for building on K8s and wishing you could have it validate changes against your cluster without submitting a PR/waiting for CI? This tutorial walks through using Signadot to give Claude (or any agent) a live sandbox to test and fix its work before the PR.
What's your current setup for AI-driven debugging? https://www.signadot.com/docs/tutorials/rapid-local-debugging-claude-code
What's your current setup for AI-driven debugging? https://www.signadot.com/docs/tutorials/rapid-local-debugging-claude-code
A
Antarr Byrd2 months ago
Are there any tools I can use to block certain countries. Not looking to use aws or cloudflare
Jonathan4 months ago
(Cross-posting from #terraform since this is K8s-focused)
Hey folks, I built a new Kubernetes Terraform provider that might be interesting to you.
It solves a long-standing Terraform limitation: you can't create a cluster and deploy to it in the same apply. Providers are configured at the root, before resources exist, so you can't use a cluster's endpoint as provider config.
Most people work around this with two separate applies, some use null_resource hacks, others split everything into multiple stacks. After being frustrated by this for many years, I realized the only solution was to build a provider that sidesteps the whole problem with inline connections.
Example:
Create cluster → deploy workloads → single apply. No provider configuration needed.
Building with Server-Side Apply from the ground up (rather than bolting it on) opened doors to fix other persistent community issues with existing providers.
• Accurate diffs - Server-side apply dry-run projections show actual changes, not client-side guesses
• YAML + validation - K8s strict schema validation catches typos at plan time
• CRD+CR same apply - Auto-retry handles eventual consistency (no more time_sleep)
• Patch resources - Modify EKS/GKE defaults without taking ownership
• Non-destructive waits - Timeouts don't force resource recreation
300+ tests, runnable examples for everything.
GitHub: https://github.com/jmorris0x0/terraform-provider-k8sconnect
Registry: https://registry.terraform.io/providers/jmorris0x0/k8sconnect/latest
Would love feedback if you've hit this pain point.
Hey folks, I built a new Kubernetes Terraform provider that might be interesting to you.
It solves a long-standing Terraform limitation: you can't create a cluster and deploy to it in the same apply. Providers are configured at the root, before resources exist, so you can't use a cluster's endpoint as provider config.
Most people work around this with two separate applies, some use null_resource hacks, others split everything into multiple stacks. After being frustrated by this for many years, I realized the only solution was to build a provider that sidesteps the whole problem with inline connections.
Example:
resource "k8sconnect_object" "app" {
cluster = {
host = aws_eks_cluster.main.endpoint
token = data.aws_eks_cluster_auth.main.token
}
yaml_body = file("app.yaml")
}Create cluster → deploy workloads → single apply. No provider configuration needed.
Building with Server-Side Apply from the ground up (rather than bolting it on) opened doors to fix other persistent community issues with existing providers.
• Accurate diffs - Server-side apply dry-run projections show actual changes, not client-side guesses
• YAML + validation - K8s strict schema validation catches typos at plan time
• CRD+CR same apply - Auto-retry handles eventual consistency (no more time_sleep)
• Patch resources - Modify EKS/GKE defaults without taking ownership
• Non-destructive waits - Timeouts don't force resource recreation
300+ tests, runnable examples for everything.
GitHub: https://github.com/jmorris0x0/terraform-provider-k8sconnect
Registry: https://registry.terraform.io/providers/jmorris0x0/k8sconnect/latest
Would love feedback if you've hit this pain point.
DevOpsNinja4 months ago
Hey everyone, we just shipped something and would love honest feedback from the community.
What we built: Kunobi is a new platform that brings Kubernetes cluster management and GitOps workflows into a single, extensible system, so teams don’t have to juggle Lens, K9s, and GitOps CLIs to stay in control. We make it easier to use Flux and Argo by enabling seamless interaction with GitOps tools. We’ve focused on addressing pain points we’ve faced ourselves - tools that are slow, memory-heavy, or just not built for scale.
Key features include:
• Kubernetes resource discovery
• Full RBAC compliance
• Multi-cluster support
• Fast keyboard navigation
• Helm release history
• Helm values and manifest diffing
• Flux resource tree visualisation
Here’s a short demo video for clarity.
Current state: It’s rough and in beta, but fully functional. We’ve been using it internally for a few months.
What we’re looking for:
• Feedback on whether this actually solves a real problem for you
• What features/integrations matter most
• Any concerns or questions about the approach
Happy to answer questions about how it works, architecture decisions, or anything else. https://kunobi.ninja — download the beta here.
What we built: Kunobi is a new platform that brings Kubernetes cluster management and GitOps workflows into a single, extensible system, so teams don’t have to juggle Lens, K9s, and GitOps CLIs to stay in control. We make it easier to use Flux and Argo by enabling seamless interaction with GitOps tools. We’ve focused on addressing pain points we’ve faced ourselves - tools that are slow, memory-heavy, or just not built for scale.
Key features include:
• Kubernetes resource discovery
• Full RBAC compliance
• Multi-cluster support
• Fast keyboard navigation
• Helm release history
• Helm values and manifest diffing
• Flux resource tree visualisation
Here’s a short demo video for clarity.
Current state: It’s rough and in beta, but fully functional. We’ve been using it internally for a few months.
What we’re looking for:
• Feedback on whether this actually solves a real problem for you
• What features/integrations matter most
• Any concerns or questions about the approach
Happy to answer questions about how it works, architecture decisions, or anything else. https://kunobi.ninja — download the beta here.
Nat G.4 months ago
Hi folks! I want to share one of our recent articles that tackles a problem many of us deal with every day: the shared staging environment.
If your team is constantly running into a "blame game" because someone's change broke the main test environment, you know how frustrating and slow that is. The old model of duplicating your stack is too expensive, but sharing it leads to painful bottlenecks. A better way to test is by using isolation to give every developer a personal, instant sandbox for their pull request. This eliminates the contention, speeds up testing dramatically, and removes a huge source of friction.
Check out the full analysis here: https://www.signadot.com/blog/the-staging-bottleneck-why-your-engineering-team-is-slow-and-how-to-fix-it
If your team is constantly running into a "blame game" because someone's change broke the main test environment, you know how frustrating and slow that is. The old model of duplicating your stack is too expensive, but sharing it leads to painful bottlenecks. A better way to test is by using isolation to give every developer a personal, instant sandbox for their pull request. This eliminates the contention, speeds up testing dramatically, and removes a huge source of friction.
Check out the full analysis here: https://www.signadot.com/blog/the-staging-bottleneck-why-your-engineering-team-is-slow-and-how-to-fix-it
Nat G.4 months ago
Hey folks 👋
The conversation around the impact of AI on developer velocity is shifting. Signadot’s CEO, Arjun Iyer, had a chat with David Cramer (Sentry) at Tech Week SF last week, and it sparked an idea: Is AI-generated code secretly creating a massive bottleneck for our validation infrastructure?
It feels like we're getting great at
Curious to hear how others are thinking about this challenge.
https://www.signadot.com/blog/what-sentrys-evolution-taught-me-about-the-future-of-development-velocity
The conversation around the impact of AI on developer velocity is shifting. Signadot’s CEO, Arjun Iyer, had a chat with David Cramer (Sentry) at Tech Week SF last week, and it sparked an idea: Is AI-generated code secretly creating a massive bottleneck for our validation infrastructure?
It feels like we're getting great at
code generation but the real tax on developer velocity is becoming code validation—long CI queues, flaky staging envs, etc. We put together some thoughts on how Sentry's evolution points to a future where pre-production testing platforms become critical.Curious to hear how others are thinking about this challenge.
https://www.signadot.com/blog/what-sentrys-evolution-taught-me-about-the-future-of-development-velocity
DevOpsNinja5 months ago
Kubernetes shouldn’t feel like juggling five tools just to ship one change.
Platform teams managing Kubernetes at scale are drowning in tool sprawl. Lens for visibility, k9s for speed, Flux CLI or ArgoCD for GitOps — but no single workflow that works for everyone. The result? Onboarding bottlenecks, visibility gaps, and constant context-switching.
The Solution
We are working on a new platform - called Kunobi - that unifies Kubernetes and GitOps management in one place. It keeps the speed and flexibility of the CLI, but adds just enough visualization so you don’t need to rebuild the entire mental model in your head every time.
Key Features
• GitOps-native control – FluxCD + ArgoCD with one-click sync, rollback, and drift detection.
• Unified topology view – Real-time map of deployments, pods, and secrets.
• Actionable resource management – Live statuses with quick actions (logs, shell).
• Collaborative, efficient UI – Keyboard speed plus GUI for teamwork.
If these pains resonate, we’d love your feedback—help us push Kunobi further before we launch more widely.
Join the Beta Program
We’re opening up early access to 100 platform teams. Beta testers receive:
• Full Professional access — free during beta
• Direct influence on roadmap & features
• Priority support via our Reddit community
• Early access to new GitOps & AI capabilities
Send a request here https://kunobi.ninja
Platform teams managing Kubernetes at scale are drowning in tool sprawl. Lens for visibility, k9s for speed, Flux CLI or ArgoCD for GitOps — but no single workflow that works for everyone. The result? Onboarding bottlenecks, visibility gaps, and constant context-switching.
The Solution
We are working on a new platform - called Kunobi - that unifies Kubernetes and GitOps management in one place. It keeps the speed and flexibility of the CLI, but adds just enough visualization so you don’t need to rebuild the entire mental model in your head every time.
Key Features
• GitOps-native control – FluxCD + ArgoCD with one-click sync, rollback, and drift detection.
• Unified topology view – Real-time map of deployments, pods, and secrets.
• Actionable resource management – Live statuses with quick actions (logs, shell).
• Collaborative, efficient UI – Keyboard speed plus GUI for teamwork.
If these pains resonate, we’d love your feedback—help us push Kunobi further before we launch more widely.
Join the Beta Program
We’re opening up early access to 100 platform teams. Beta testers receive:
• Full Professional access — free during beta
• Direct influence on roadmap & features
• Priority support via our Reddit community
• Early access to new GitOps & AI capabilities
Send a request here https://kunobi.ninja
DevOpsNinja5 months ago
hey folks, curious how you balance cli vs ui in your teams. seniors can fly through kubectl and k9s, but juniors end up lost without some visibility layer like lens or argocd.
do you push everyone through the cli grind, or do you let people lean on guis for day-to-day work?
do you push everyone through the cli grind, or do you let people lean on guis for day-to-day work?
Nat G.5 months ago
Hey everyone, if you're working with SQS-based microservices, you know how tricky it can be to test a new consumer without messing up the shared environment.
We wrote a detailed guide on how we solve this challenge by using isolated testing environments. It's a hands-on walkthrough showing how to test new SQS logic safely and quickly. Hope it's a useful resource for your team.
Tutorial link: https://www.signadot.com/blog/testing-microservices-with-rabbitmq-using-signadot-sandboxes
We wrote a detailed guide on how we solve this challenge by using isolated testing environments. It's a hands-on walkthrough showing how to test new SQS logic safely and quickly. Hope it's a useful resource for your team.
Tutorial link: https://www.signadot.com/blog/testing-microservices-with-rabbitmq-using-signadot-sandboxes
sheldonh6 months ago
What's the go to for folks when you have a helm chart with multiple app roles in it.
I'd like to use something strongly typed and get away from yaml templating cause we aren't using helm except to render the yaml.
Prefer SDK/strongly typed but if helm is still the best way do you have a way to better organize to reduce complexity ? Sub charts?
Right now "external", "internal", "scheduler" etc. all these result in too many permutations for helm-unit test for my comfort.
I did pulumi in past and rendered but that's a big shift for the team so open to "no SDK for you do this with hem instead" or any variation.
Providing as a service to everyone on app teams and thinking through my options to cleanup as I take maintainership of it. :-)
I'd like to use something strongly typed and get away from yaml templating cause we aren't using helm except to render the yaml.
Prefer SDK/strongly typed but if helm is still the best way do you have a way to better organize to reduce complexity ? Sub charts?
Right now "external", "internal", "scheduler" etc. all these result in too many permutations for helm-unit test for my comfort.
I did pulumi in past and rendered but that's a big shift for the team so open to "no SDK for you do this with hem instead" or any variation.
Providing as a service to everyone on app teams and thinking through my options to cleanup as I take maintainership of it. :-)
Tech7 months ago
If you guys need help with Kubernetes please ping me. Not selling any service. I want to volunteer.
akhan4u7 months ago
Hi all, Anyone running cassandra on K8s? I've to check the resource requirements for the same. Every morning the cassandra deployment dies out because of OOM. I'd like to know the preferable defaults/gotchas for the same.
OliverS9 months ago
Hi all, has anyone configured Nginx Gateway Fabric to support websocket connection? Https is working (via httproute) but for wss on same port we're getting an error
Error during WebSocket handshake: Invalid status line
And I can't find much about this.
When we try to connect to the container directly over websocket from another container it works so it seems to be a gateway config issue.
Error during WebSocket handshake: Invalid status line
And I can't find much about this.
When we try to connect to the container directly over websocket from another container it works so it seems to be a gateway config issue.
Michael Koroteev9 months ago
Hello everyone! Small question -
how are you scaling workloads based on HTTP requests? I wanted to use the KedaHTTP add-on but it seems it's not ready yet for production use.
how are you scaling workloads based on HTTP requests? I wanted to use the KedaHTTP add-on but it seems it's not ready yet for production use.
Sean Turner10 months ago
Hey all, wondering if you've ever run into an issue with Karpenter having trouble terminating Nodes? Namely we'll see Pods stuck in a
Seems that the
None of us (team of 3) noticed this issue prior to an annual Cluster Upgrade where we updated a number of components
EKS 1.28 --> 1.32
Karpenter 0.37 --> 1.27
AL2 --> AL2023
Considering turning on the EKS Node Monitor and Karpenter auto repair. But this doesn't seem to solve the underlying issue or prevent DAGs from coming online and promptly failing. Which is a problem.
Terminating state for days. Annoyingly these Pods come online for a moment (e.g. prefect DAG starts execution) and start doing stuff before they die.Seems that the
kubelet is dying on the Node, perhaps when the node comes online, but we're not sure yet. Problem seems to occur when the Node is new and see some Datadog Metrics around maxed out CPU / Memory right when the Node comes online. Most Pods have CPU / Memory limits but not really on daemonsets.None of us (team of 3) noticed this issue prior to an annual Cluster Upgrade where we updated a number of components
EKS 1.28 --> 1.32
Karpenter 0.37 --> 1.27
AL2 --> AL2023
Considering turning on the EKS Node Monitor and Karpenter auto repair. But this doesn't seem to solve the underlying issue or prevent DAGs from coming online and promptly failing. Which is a problem.
akhan4u11 months ago
Hi guys! I'm looking for setting up ALB with nginx-ingress in EKS. Need your inputs. So I want to deploy something like this
•
• I've come across that you can setup
I hope my understanding is correct, please feel free to correct me.
•
ALB+ nginx_ingress+ certmanager + external dns+let's encrypt - Is it possible to do this? I'm able to setup ClassicLB+nginx+certmanager+external_dns+letsencrypt but I want to get rid of classic_lb.• I've come across that you can setup
ALB+nginx+ACM here letsencrypt with certmanager is not possible. so something like pre-generating certs and importing to ACM should work? I hope my understanding is correct, please feel free to correct me.
managedkaos11 months ago
SimonelDavid12 months ago
Hi guys! I have a little challenge on making RBAC in Kubernetes. I have a request to give in a cluster role permission to everything in the cluster with read only except for the secrets. I tried something like:
`api_groups = ["*"]
resources = ["*"]
verbs = ["get", "list", "watch"]
}
default_rule2 = {
api_groups = [""]
resources = ["secrets"]
verbs = [""]
}`
But this doesn't work unfortunately only if i make an explicit list of all the resources and give those 3 verbs on it, but this would be a real pain. I know that there is no explicit deny and also once i gave the permission to the secrets in default rule 1 it can not be overwritten in default rule 2. Do you have any ideas of any workarounds? I tried to search over the internet but nothing is really helpful and i am trying to avoid a lot of code and the addition of another external tool for rbac. Thank you!
default_rule1 = { `api_groups = ["*"]
resources = ["*"]
verbs = ["get", "list", "watch"]
}
default_rule2 = {
api_groups = [""]
resources = ["secrets"]
verbs = [""]
}`
But this doesn't work unfortunately only if i make an explicit list of all the resources and give those 3 verbs on it, but this would be a real pain. I know that there is no explicit deny and also once i gave the permission to the secrets in default rule 1 it can not be overwritten in default rule 2. Do you have any ideas of any workarounds? I tried to search over the internet but nothing is really helpful and i am trying to avoid a lot of code and the addition of another external tool for rbac. Thank you!
Stefabout 1 year ago
long shot, how do I get Istio to accept a k8s CA signed certificate for a specific service
Azarabout 1 year ago
Hi Folks, How do we preserve source IP in istio <> Azure LB without updating the externaltrafficpolicy to local ?
Michaelabout 1 year ago
Does Cloud Posse have a common library for Helm charts that is similar to
context.tf for Terraform? I'm curious how something like that would be implemented, perhaps comparable to New Relic's approach: https://github.com/newrelic/helm-charts/tree/master/library/common-libraryakhan4uabout 1 year ago
Hi all, Are there any recommended ways to authenticate to AWS service from non-eks clusters? Does IRSA work here or there is some other way to authenticate to AWS, I'm preferring IAM role over IAM user credentials. Non-eks k8s is k3s cluster.
IKabout 1 year ago
does anyone have the name of the tool that can re-write manifests with private registry URL’s? for e.g. if someone deploys a chart from
<http://charts.external-secrets.io|charts.external-secrets.io>, the webhook updates it with <http://my-private-repo.git.io|my-private-repo.git.io> for e.gpomabout 1 year ago
Hey - we last implemented a test library for our k8s/OCP clusters in like 2019, it's a bunch of python script/ansible modules which perform assertions using the k8s api and sometimes the cloud provider apis. We generally run the tests on a schedule a few times a day on our prod clusters, and run them before and after upgrades (less useful than the monitoring we have, but still handy for more niche cases that monitoring can't cover). Seems an old fashioned way to do it but it works.
Time's come to update it, maybe reimplement with some other tech.
What are some other approaches people have seen or are using? Any recommendations welcome!
Time's come to update it, maybe reimplement with some other tech.
What are some other approaches people have seen or are using? Any recommendations welcome!
oscarabout 1 year ago
Hi, quick question any ideas on how people manage updating , databases caches configs for apis and similar for urls etc? getting tired of writing config maps any way people automated that?
venkataabout 1 year ago(edited)
I notice the k3s installer is primarily just a bash script that does everything. Does anyone here use k3s in prod? If so, how did you deploy it? Their official bash script? or did you roll your own automation that created all the nessecary configs/files (e.g. systemd units)?
mikoover 1 year ago
Hello, has anyone here managed to deploy Apache Kafka using Strimzi operator in AWS EKS? I've managed to deploy a cluster but I need to expose the consumer's port to the outside world but I can't find an example that I could follow
B
Branko Petricover 1 year ago
Hi guys. I am launching KubeWhisper - AI CLI kubectl assistant tomorrow. Waitlist is open until the end of day today. Sign up if you would like to try for free. I'd appreciate any feedback, commends, concerns, etc... 🙂
Waitlist: https://brankopetric.com/kubewhisper
Waitlist: https://brankopetric.com/kubewhisper
tokaover 1 year ago
Anyone using CAST AI to scale node pools and use spot instances? Looking for something opensourced but theres not quite something like it
mikoover 1 year ago(edited)
RESOLVED
Hey guys, I hope everyone is having a nice day. Has anyone here used cloudnative-pg operator to run your postgres cluster before? I have a running cluster but I can't seem for the life of me figure out how to properly expose it for outside use for a few days now, I am able to connect to it using kubectl port forwarding. I am running on AWS EKS and here is my config, it's bare minimum and it even manages to create the load balancer with the external IP, but I just can't able to connect to it, even using telenet I'm not able to ping it:
Hey guys, I hope everyone is having a nice day. Has anyone here used cloudnative-pg operator to run your postgres cluster before? I have a running cluster but I can't seem for the life of me figure out how to properly expose it for outside use for a few days now, I am able to connect to it using kubectl port forwarding. I am running on AWS EKS and here is my config, it's bare minimum and it even manages to create the load balancer with the external IP, but I just can't able to connect to it, even using telenet I'm not able to ping it:
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: cluster-example
spec:
instances: 3
storage:
size: 1Gi
managed:
services:
disabledDefaultServices: ["ro", "r"]
additional:
- selectorType: rw
serviceTemplate:
metadata:
name: cluster-example-rw-lb
annotations: # this is the fix!
service.beta.kubernetes.io/aws-load-balancer-type: external
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
spec:
type: LoadBalancerAlex Atkinsonover 1 year ago
Does anyone have any good resources on the considerations between having an ArgoCD instance per cluster vs a single ArgoCD orchestration instance for multiple clusters?
I found this vid. https://www.youtube.com/watch?v=bj9qLpomlrs
I found this vid. https://www.youtube.com/watch?v=bj9qLpomlrs
RBover 1 year ago
noob question: what are the benefits of using the 1password scim ?
Alex Atkinsonover 1 year ago
Do folks like the one LB to many services, or one LB per service pattern in K8s?
Generally I've always liked one LB per service everywhere PAAS, but with K8s there may be issues/limitations with dynamically launching LBs (metal or otherwise).
Generally I've always liked one LB per service everywhere PAAS, but with K8s there may be issues/limitations with dynamically launching LBs (metal or otherwise).
Chris Pichtover 1 year ago
Anyone know of someone who is available for some freelance work with EKS & Bitnami's Sealed Secrets? I'm having difficulty pulling images from my GitLab Container Registry because I can't seem to get the correct value into the secret for containerd. Will gladly pay for the assistance.
mikoover 1 year ago(edited)
Hey guys, I am running a StatefulSet PostgresDB in AWS EKS, because it's stateless I am using NodePort which doesn't expose the service to the public, but my colleagues would like access to it for ease of development, any suggestions for me (config in the reply)?
Should I simply switch to ELB?
Should I simply switch to ELB?
enriqueover 1 year ago
still need help debugging why some of my pods are being killed (exit 137)
rohitover 1 year ago
Does anyone know if there's an open sourced project that helps with creating, managing, verifying licenses for products we create?
enriqueover 1 year ago
anyone here want to help someone drowning in failure deploying some a helm chart for mage ai?
rohitover 1 year ago
Does anyone use cosign to sign their images and artifacts? How do yall deal with rotating those keys? Our concern is if a customer is using our public key to verify that the images running in their kubernetes clusters, if we rotate the signing keys, they will have to also...
Adnanover 1 year ago
I would like to set a lifetime for pods. What are you using to achieve this?
jaysunover 1 year ago
hi there, how are folks handling argocd deployments nowadays? im thinking about revamping our setup. right now we’re:
• hub spoke model using argocd app of apps pattern
• bootstrap the argocd namespace and helm deployment in terraform on the hub cluster
• point to our “app of apps” repository which uses helmfile
• let argocd manage itself
• janky CICD workflow script to add new child clusters under management
I think we want to continue letting app of apps manage our “addons” using helmfile, but wondering if there’s been any improvements in the initial bootstrap of argocd itself ( for the hub ) and the argocd cluster add portion for child clusters (perhaps via the argocd terraform provider?)
• hub spoke model using argocd app of apps pattern
• bootstrap the argocd namespace and helm deployment in terraform on the hub cluster
• point to our “app of apps” repository which uses helmfile
• let argocd manage itself
• janky CICD workflow script to add new child clusters under management
I think we want to continue letting app of apps manage our “addons” using helmfile, but wondering if there’s been any improvements in the initial bootstrap of argocd itself ( for the hub ) and the argocd cluster add portion for child clusters (perhaps via the argocd terraform provider?)
akhan4uover 1 year ago
hey everyone, I wanted to know how you approach k8s upgrades. we are self hosting our K8s clusters (kubespray managed) in different DCs. I want to know how do you make sure the controller/operators in k8s does not have any breaking changes between say 1.x and 2.x? Do you surf through the changelog and look for keywords like
breaking/dropped/removed? I want to know if there is some automated way or a tool to compare version change-logs and summarise the breaking changes. We already check for deprecated apis using kubedd, polaris and others, However this controller version change review is manual and error prone.IKover 1 year ago
Hey all. Thoughts on openshift for cluster management? Vs something like rancher. Are many folks running k8s on openshift? I can see the value-add, unsure of dollars at this stage but given majority of our footprint is the public cloud, using openshift instead of say AKS or EKS seems a little counter-intuitive
Adnanover 1 year ago
Hello everybody,
I had an issue few days ago during which an nginx deployment had a spike in 504 timeouts while trying to proxy the requests to the upstream php-fpm.
The issue lasted for 1h20min and resolved by itself. I was unable to find those requests reaching the upstream php-fpm pods.
I suspect that a node went away but the endpoints were not cleaned up. Unfortunately, I don't really have much evidence for it.
Anyone ever had similar issues where you had a large number of 504 between two services but you could not find any logs that would indicate those requests actually reached the other side?
I had an issue few days ago during which an nginx deployment had a spike in 504 timeouts while trying to proxy the requests to the upstream php-fpm.
The issue lasted for 1h20min and resolved by itself. I was unable to find those requests reaching the upstream php-fpm pods.
I suspect that a node went away but the endpoints were not cleaned up. Unfortunately, I don't really have much evidence for it.
Anyone ever had similar issues where you had a large number of 504 between two services but you could not find any logs that would indicate those requests actually reached the other side?
managedkaosover 1 year ago
lorenover 1 year ago
I got another noob question about K8S... I was testing a cluster update, that happened to cycle the nodes in the node group. I'm using EKS on AWS, with just a single node, but have three AZs available to the node group. There's a StatefulSet (deployed with a helm chart) using a persistent volume claim, which is backed by an EBS volume. The EBS volume is of course tied to a single AZ. So, when the node group updated, it seems it didn't entirely account for the zonal attributes, and cycled through 3 different nodes before it finally created one in the correct AZ that could meet the zonal restrictions of the persistent volume claim and get all the pods back online. Due to the zonal issues, the update took about an hour. The error when getting the other nodes going was
"1 node(s) had volume node affinity conflict." So basically, any pointers on how to handle this kind of constraint? Is there an adjustment to the helm chart, or some config option I can pass in, to adjust the setup somehow to be more zonal-friendly? Or is there a Kubernetes design pattern around this? I tried just googling, but didn't seem to get much in the way of good suggestions... I don't really want to have a node-per-AZ always running...Hamzaover 1 year ago(edited)
Hi, I'm using Postgres-HA chart of Bitnami and after using it for a while, we decided that we don't really need HA and a single pod DB is enough without the
pg_pool and the issues it comes with, right now we're planning to migrate to normal Postgres chart of Bitnami, I would like to know how to have the database's data persisted even if we switch to the new chart ?