#kubernetes - April 2025 | Slack Archive

Hey all, wondering if you've ever run into an issue with Karpenter having trouble terminating Nodes? Namely we'll see Pods stuck in a Terminating state for days. Annoyingly these Pods come online for a moment (e.g. prefect DAG starts execution) and start doing stuff before they die.

Seems that the kubelet is dying on the Node, perhaps when the node comes online, but we're not sure yet. Problem seems to occur when the Node is new and see some Datadog Metrics around maxed out CPU / Memory right when the Node comes online. Most Pods have CPU / Memory limits but not really on daemonsets.

None of us (team of 3) noticed this issue prior to an annual Cluster Upgrade where we updated a number of components
EKS 1.28 --> 1.32
Karpenter 0.37 --> 1.27
AL2 --> AL2023

Considering turning on the EKS Node Monitor and Karpenter auto repair. But this doesn't seem to solve the underlying issue or prevent DAGs from coming online and promptly failing. Which is a problem.