kubernetes

21886,227

Archive: https://archive.sweetops.com/kubernetes/

Balazs Varga10 days ago

hello all, I have a question related to sigterm and termination process.

Currently I see that if I send a sigterm to a pod and let it start graceful shoutdown, sometimes (under very heavy load) pod can get new request and since it is under graceful shutdown it will reply as connection refused. what I found, I can avoid this kind of issue with prestop hook and let container wait a little time before start the shutdown and with that give enough time to kubernetes to handle the eviction from endpointslices.
For this I need to update all my helm charts to use this prestop. Is there any way to stop sigterm on k8s api level and 1st drain from endpointslices and once it is evicted send the sigterm to pod?

Nat G.14 days ago

Interesting read on The New Stack about how teams using Istio can leverage it to implement ephemeral environments to deal with the flood of PRs coming from agents.
Curious if anyone here is hitting this validation bottleneck yet with their AI workflows?
https://thenewstack.io/ai-agents-istio-validation/

Nat G.28 days ago

Hey folks. Anyone here using Claude Code for building on K8s and wishing you could have it validate changes against your cluster without submitting a PR/waiting for CI? This tutorial walks through using Signadot to give Claude (or any agent) a live sandbox to test and fix its work before the PR.
What's your current setup for AI-driven debugging? https://www.signadot.com/docs/tutorials/rapid-local-debugging-claude-code

Antarr Byrd3 months ago

Are there any tools I can use to block certain countries. Not looking to use aws or cloudflare

Jonathan4 months ago

(Cross-posting from #terraform since this is K8s-focused)

Hey folks, I built a new Kubernetes Terraform provider that might be interesting to you.

It solves a long-standing Terraform limitation: you can't create a cluster and deploy to it in the same apply. Providers are configured at the root, before resources exist, so you can't use a cluster's endpoint as provider config.

Most people work around this with two separate applies, some use null_resource hacks, others split everything into multiple stacks. After being frustrated by this for many years, I realized the only solution was to build a provider that sidesteps the whole problem with inline connections.

Example:

resource "k8sconnect_object" "app" {
  cluster = {
    host  = aws_eks_cluster.main.endpoint
    token = data.aws_eks_cluster_auth.main.token
  }
  yaml_body = file("app.yaml")
}

Create cluster → deploy workloads → single apply. No provider configuration needed.

Building with Server-Side Apply from the ground up (rather than bolting it on) opened doors to fix other persistent community issues with existing providers.

• Accurate diffs - Server-side apply dry-run projections show actual changes, not client-side guesses
• YAML + validation - K8s strict schema validation catches typos at plan time
• CRD+CR same apply - Auto-retry handles eventual consistency (no more time_sleep)
• Patch resources - Modify EKS/GKE defaults without taking ownership
• Non-destructive waits - Timeouts don't force resource recreation
300+ tests, runnable examples for everything.

GitHub: https://github.com/jmorris0x0/terraform-provider-k8sconnect
Registry: https://registry.terraform.io/providers/jmorris0x0/k8sconnect/latest

Would love feedback if you've hit this pain point.

DevOpsNinja5 months ago

Hey everyone, we just shipped something and would love honest feedback from the community.
What we built: Kunobi is a new platform that brings Kubernetes cluster management and GitOps workflows into a single, extensible system, so teams don’t have to juggle Lens, K9s, and GitOps CLIs to stay in control. We make it easier to use Flux and Argo by enabling seamless interaction with GitOps tools. We’ve focused on addressing pain points we’ve faced ourselves - tools that are slow, memory-heavy, or just not built for scale.
Key features include:
• Kubernetes resource discovery
• Full RBAC compliance
• Multi-cluster support
• Fast keyboard navigation
• Helm release history
• Helm values and manifest diffing
• Flux resource tree visualisation
Here’s a short demo video for clarity.
Current state: It’s rough and in beta, but fully functional. We’ve been using it internally for a few months.
What we’re looking for:
• Feedback on whether this actually solves a real problem for you
• What features/integrations matter most
• Any concerns or questions about the approach
Happy to answer questions about how it works, architecture decisions, or anything else. https://kunobi.ninja — download the beta here.

Nat G.5 months ago

Hi folks! I want to share one of our recent articles that tackles a problem many of us deal with every day: the shared staging environment.
If your team is constantly running into a "blame game" because someone's change broke the main test environment, you know how frustrating and slow that is. The old model of duplicating your stack is too expensive, but sharing it leads to painful bottlenecks. A better way to test is by using isolation to give every developer a personal, instant sandbox for their pull request. This eliminates the contention, speeds up testing dramatically, and removes a huge source of friction.
Check out the full analysis here: https://www.signadot.com/blog/the-staging-bottleneck-why-your-engineering-team-is-slow-and-how-to-fix-it

Nat G.5 months ago

Hey folks 👋
The conversation around the impact of AI on developer velocity is shifting. Signadot’s CEO, Arjun Iyer, had a chat with David Cramer (Sentry) at Tech Week SF last week, and it sparked an idea: Is AI-generated code secretly creating a massive bottleneck for our validation infrastructure?
It feels like we're getting great at code generation but the real tax on developer velocity is becoming code validation—long CI queues, flaky staging envs, etc. We put together some thoughts on how Sentry's evolution points to a future where pre-production testing platforms become critical.
Curious to hear how others are thinking about this challenge.
https://www.signadot.com/blog/what-sentrys-evolution-taught-me-about-the-future-of-development-velocity

DevOpsNinja5 months ago

Kubernetes shouldn’t feel like juggling five tools just to ship one change.

Platform teams managing Kubernetes at scale are drowning in tool sprawl. Lens for visibility, k9s for speed, Flux CLI or ArgoCD for GitOps — but no single workflow that works for everyone. The result? Onboarding bottlenecks, visibility gaps, and constant context-switching.

The Solution
We are working on a new platform - called Kunobi - that unifies Kubernetes and GitOps management in one place. It keeps the speed and flexibility of the CLI, but adds just enough visualization so you don’t need to rebuild the entire mental model in your head every time.

Key Features
• GitOps-native control – FluxCD + ArgoCD with one-click sync, rollback, and drift detection.
• Unified topology view – Real-time map of deployments, pods, and secrets.
• Actionable resource management – Live statuses with quick actions (logs, shell).
• Collaborative, efficient UI – Keyboard speed plus GUI for teamwork.
If these pains resonate, we’d love your feedback—help us push Kunobi further before we launch more widely.

Join the Beta Program
We’re opening up early access to 100 platform teams. Beta testers receive:
• Full Professional access — free during beta
• Direct influence on roadmap & features
• Priority support via our Reddit community
• Early access to new GitOps & AI capabilities
Send a request here https://kunobi.ninja

DevOpsNinja6 months ago

hey folks, curious how you balance cli vs ui in your teams. seniors can fly through kubectl and k9s, but juniors end up lost without some visibility layer like lens or argocd.
do you push everyone through the cli grind, or do you let people lean on guis for day-to-day work?

Nat G.6 months ago

Hey everyone, if you're working with SQS-based microservices, you know how tricky it can be to test a new consumer without messing up the shared environment.
We wrote a detailed guide on how we solve this challenge by using isolated testing environments. It's a hands-on walkthrough showing how to test new SQS logic safely and quickly. Hope it's a useful resource for your team.
Tutorial link: https://www.signadot.com/blog/testing-microservices-with-rabbitmq-using-signadot-sandboxes

Robindeva6 months ago

https://dev.to/robindeva/building-an-ai-powered-contact-center-quality-monitoring-system-on-aws-5409

sheldonh6 months ago

What's the go to for folks when you have a helm chart with multiple app roles in it.

I'd like to use something strongly typed and get away from yaml templating cause we aren't using helm except to render the yaml.

Prefer SDK/strongly typed but if helm is still the best way do you have a way to better organize to reduce complexity ? Sub charts?

Right now "external", "internal", "scheduler" etc. all these result in too many permutations for helm-unit test for my comfort.

I did pulumi in past and rendered but that's a big shift for the team so open to "no SDK for you do this with hem instead" or any variation.

Providing as a service to everyone on app teams and thinking through my options to cleanup as I take maintainership of it. :-)

Tech7 months ago

If you guys need help with Kubernetes please ping me. Not selling any service. I want to volunteer.

akhan4u8 months ago

Hi all, Anyone running cassandra on K8s? I've to check the resource requirements for the same. Every morning the cassandra deployment dies out because of OOM. I'd like to know the preferable defaults/gotchas for the same.

OliverS10 months ago

Hi all, has anyone configured Nginx Gateway Fabric to support websocket connection? Https is working (via httproute) but for wss on same port we're getting an error

Error during WebSocket handshake: Invalid status line

And I can't find much about this.

When we try to connect to the container directly over websocket from another container it works so it seems to be a gateway config issue.

Michael Koroteev10 months ago

Hello everyone! Small question -
how are you scaling workloads based on HTTP requests? I wanted to use the KedaHTTP add-on but it seems it's not ready yet for production use.

Sean Turner11 months ago

Hey all, wondering if you've ever run into an issue with Karpenter having trouble terminating Nodes? Namely we'll see Pods stuck in a Terminating state for days. Annoyingly these Pods come online for a moment (e.g. prefect DAG starts execution) and start doing stuff before they die.

Seems that the kubelet is dying on the Node, perhaps when the node comes online, but we're not sure yet. Problem seems to occur when the Node is new and see some Datadog Metrics around maxed out CPU / Memory right when the Node comes online. Most Pods have CPU / Memory limits but not really on daemonsets.

None of us (team of 3) noticed this issue prior to an annual Cluster Upgrade where we updated a number of components
EKS 1.28 --> 1.32
Karpenter 0.37 --> 1.27
AL2 --> AL2023

Considering turning on the EKS Node Monitor and Karpenter auto repair. But this doesn't seem to solve the underlying issue or prevent DAGs from coming online and promptly failing. Which is a problem.

akhan4u12 months ago

Hi guys! I'm looking for setting up ALB with nginx-ingress in EKS. Need your inputs. So I want to deploy something like this
• ALB+ nginx_ingress+ certmanager + external dns+let's encrypt - Is it possible to do this? I'm able to setup ClassicLB+nginx+certmanager+external_dns+letsencrypt but I want to get rid of classic_lb.
• I've come across that you can setup ALB+nginx+ACM here letsencrypt with certmanager is not possible. so something like pre-generating certs and importing to ACM should work?
I hope my understanding is correct, please feel free to correct me.

managedkaos12 months ago

SimonelDavidabout 1 year ago

Hi guys! I have a little challenge on making RBAC in Kubernetes. I have a request to give in a cluster role permission to everything in the cluster with read only except for the secrets. I tried something like:
default_rule1 = {
`api_groups = ["*"]
resources = ["*"]
verbs = ["get", "list", "watch"]
}
default_rule2 = {
api_groups = [""]
resources = ["secrets"]
verbs = [""]
}`

But this doesn't work unfortunately only if i make an explicit list of all the resources and give those 3 verbs on it, but this would be a real pain. I know that there is no explicit deny and also once i gave the permission to the secrets in default rule 1 it can not be overwritten in default rule 2. Do you have any ideas of any workarounds? I tried to search over the internet but nothing is really helpful and i am trying to avoid a lot of code and the addition of another external tool for rbac. Thank you!

Stefabout 1 year ago

long shot, how do I get Istio to accept a k8s CA signed certificate for a specific service

Azarabout 1 year ago

Hi Folks, How do we preserve source IP in istio <> Azure LB without updating the externaltrafficpolicy to local ?

Michaelabout 1 year ago

Does Cloud Posse have a common library for Helm charts that is similar to context.tf for Terraform? I'm curious how something like that would be implemented, perhaps comparable to New Relic's approach: https://github.com/newrelic/helm-charts/tree/master/library/common-library

akhan4uover 1 year ago

Hi all, Are there any recommended ways to authenticate to AWS service from non-eks clusters? Does IRSA work here or there is some other way to authenticate to AWS, I'm preferring IAM role over IAM user credentials. Non-eks k8s is k3s cluster.

IKover 1 year ago

does anyone have the name of the tool that can re-write manifests with private registry URL’s? for e.g. if someone deploys a chart from <http://charts.external-secrets.io|charts.external-secrets.io>, the webhook updates it with <http://my-private-repo.git.io|my-private-repo.git.io> for e.g

pomover 1 year ago

Hey - we last implemented a test library for our k8s/OCP clusters in like 2019, it's a bunch of python script/ansible modules which perform assertions using the k8s api and sometimes the cloud provider apis. We generally run the tests on a schedule a few times a day on our prod clusters, and run them before and after upgrades (less useful than the monitoring we have, but still handy for more niche cases that monitoring can't cover). Seems an old fashioned way to do it but it works.

Time's come to update it, maybe reimplement with some other tech.

What are some other approaches people have seen or are using? Any recommendations welcome!

oscarover 1 year ago

Hi, quick question any ideas on how people manage updating , databases caches configs for apis and similar for urls etc? getting tired of writing config maps any way people automated that?

venkataover 1 year ago(edited)

I notice the k3s installer is primarily just a bash script that does everything. Does anyone here use k3s in prod? If so, how did you deploy it? Their official bash script? or did you roll your own automation that created all the nessecary configs/files (e.g. systemd units)?

mikoover 1 year ago

Hello, has anyone here managed to deploy Apache Kafka using Strimzi operator in AWS EKS? I've managed to deploy a cluster but I need to expose the consumer's port to the outside world but I can't find an example that I could follow

Branko Petricover 1 year ago

Hi guys. I am launching KubeWhisper - AI CLI kubectl assistant tomorrow. Waitlist is open until the end of day today. Sign up if you would like to try for free. I'd appreciate any feedback, commends, concerns, etc... 🙂

Waitlist: https://brankopetric.com/kubewhisper

tokaover 1 year ago

Anyone using CAST AI to scale node pools and use spot instances? Looking for something opensourced but theres not quite something like it

mikoover 1 year ago(edited)

RESOLVED
Hey guys, I hope everyone is having a nice day. Has anyone here used cloudnative-pg operator to run your postgres cluster before? I have a running cluster but I can't seem for the life of me figure out how to properly expose it for outside use for a few days now, I am able to connect to it using kubectl port forwarding. I am running on AWS EKS and here is my config, it's bare minimum and it even manages to create the load balancer with the external IP, but I just can't able to connect to it, even using telenet I'm not able to ping it:

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: cluster-example
spec:
  instances: 3

  storage:
    size: 1Gi

  managed:
    services:
      disabledDefaultServices: ["ro", "r"]
      additional:
      - selectorType: rw
        serviceTemplate:
          metadata:
            name: cluster-example-rw-lb
            annotations: # this is the fix!
              service.beta.kubernetes.io/aws-load-balancer-type: external
              service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
              service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
          spec:
            type: LoadBalancer

Alex Atkinsonover 1 year ago

Does anyone have any good resources on the considerations between having an ArgoCD instance per cluster vs a single ArgoCD orchestration instance for multiple clusters?
I found this vid. https://www.youtube.com/watch?v=bj9qLpomlrs

RBover 1 year ago

noob question: what are the benefits of using the 1password scim ?

Alex Atkinsonover 1 year ago

Do folks like the one LB to many services, or one LB per service pattern in K8s?
Generally I've always liked one LB per service everywhere PAAS, but with K8s there may be issues/limitations with dynamically launching LBs (metal or otherwise).

Chris Pichtover 1 year ago

Anyone know of someone who is available for some freelance work with EKS & Bitnami's Sealed Secrets? I'm having difficulty pulling images from my GitLab Container Registry because I can't seem to get the correct value into the secret for containerd. Will gladly pay for the assistance.

Hao Wangover 1 year ago

https://www.infoq.com/news/2024/08/slack-kubernetes-operator-bedroc/?utm_source=email&utm_medium=devops&utm_campaign=newsletter&utm_content=08132024

mikoover 1 year ago(edited)

Hey guys, I am running a StatefulSet PostgresDB in AWS EKS, because it's stateless I am using NodePort which doesn't expose the service to the public, but my colleagues would like access to it for ease of development, any suggestions for me (config in the reply)?

Should I simply switch to ELB?

enriqueover 1 year ago

still need help debugging why some of my pods are being killed (exit 137)

rohitover 1 year ago

Does anyone know if there's an open sourced project that helps with creating, managing, verifying licenses for products we create?

enriqueover 1 year ago

anyone here want to help someone drowning in failure deploying some a helm chart for mage ai?

rohitover 1 year ago

Does anyone use cosign to sign their images and artifacts? How do yall deal with rotating those keys? Our concern is if a customer is using our public key to verify that the images running in their kubernetes clusters, if we rotate the signing keys, they will have to also...

Adnanover 1 year ago

I would like to set a lifetime for pods. What are you using to achieve this?

jaysunover 1 year ago

hi there, how are folks handling argocd deployments nowadays? im thinking about revamping our setup. right now we’re:

• hub spoke model using argocd app of apps pattern
• bootstrap the argocd namespace and helm deployment in terraform on the hub cluster
• point to our “app of apps” repository which uses helmfile
• let argocd manage itself
• janky CICD workflow script to add new child clusters under management
I think we want to continue letting app of apps manage our “addons” using helmfile, but wondering if there’s been any improvements in the initial bootstrap of argocd itself ( for the hub ) and the argocd cluster add portion for child clusters (perhaps via the argocd terraform provider?)

akhan4uover 1 year ago

hey everyone, I wanted to know how you approach k8s upgrades. we are self hosting our K8s clusters (kubespray managed) in different DCs. I want to know how do you make sure the controller/operators in k8s does not have any breaking changes between say 1.x and 2.x? Do you surf through the changelog and look for keywords like breaking/dropped/removed? I want to know if there is some automated way or a tool to compare version change-logs and summarise the breaking changes. We already check for deprecated apis using kubedd, polaris and others, However this controller version change review is manual and error prone.

IKover 1 year ago

Hey all. Thoughts on openshift for cluster management? Vs something like rancher. Are many folks running k8s on openshift? I can see the value-add, unsure of dollars at this stage but given majority of our footprint is the public cloud, using openshift instead of say AKS or EKS seems a little counter-intuitive

Adnanover 1 year ago

Hello everybody,

I had an issue few days ago during which an nginx deployment had a spike in 504 timeouts while trying to proxy the requests to the upstream php-fpm.

The issue lasted for 1h20min and resolved by itself. I was unable to find those requests reaching the upstream php-fpm pods.

I suspect that a node went away but the endpoints were not cleaned up. Unfortunately, I don't really have much evidence for it.

Anyone ever had similar issues where you had a large number of 504 between two services but you could not find any logs that would indicate those requests actually reached the other side?

Hao Wangover 1 year ago

https://docs.kubeshark.co/en/introduction

managedkaosover 1 year ago

Sharing FYI....

https://learnk8s.io/exploring-the-kubernetes-instance-calculator

#kubernetes

kubernetes