When Your Cluster Won’t Sit Still: The Hidden Cost of Kubernetes Autonomy During Incidents
I’ve spent the better part of the last few years on the receiving end of Kubernetes pages, both as an operator and as someone building tooling for platform teams. The pattern I’ve seen, across very different organizations, is almost always the same: the hardest part of a Kubernetes incident isn’t fixing the problem. It’s getting the cluster to hold still long enough to understand it.
Here’s what I mean.
It’s 2 a.m. An alert fires. You pull up the dashboards and start trying to figure out what’s wrong. Latency is up, a couple of pods are crashlooping, customers are noticing. And while you’re squinting at logs, your cluster is busy doing its job.
The Horizontal Pod Autoscaler sees the CPU spike and adds replicas. Argo CD notices the new replica count doesn’t match Git and quietly rolls it back. The node recycler, blissfully unaware that anything is wrong, drains a node because it’s Tuesday and that’s what it does on Tuesdays. The Vertical Pod Autoscaler picks this exact moment to update resource requests and evict a pod for the trouble.
None of these systems are misbehaving. They’re doing exactly what you asked them to do. The problem is they’re doing it while you’re trying to understand a broken system, and every autonomous action changes the very state you’re trying to reason about.
You’re not debugging a system. You’re debugging a system that keeps moving.
The Autonomy Paradox
Kubernetes was built on a great idea: tell the cluster what you want and let it figure out how to get there. Self-healing, autoscaling, continuous reconciliation, that’s the whole point. In normal operations, this autonomy is genuinely valuable. It absorbs routine chaos so humans don’t have to.
Incidents aren’t normal operations.
During an active incident, that same autonomy turns into a liability. Every automated controller modifying cluster state is introducing new variables into an already confusing environment. The more sophisticated your platform, the more things are quietly moving at once, and the trend in Kubernetes tooling is firmly toward more automation, not less. Managed services like AKS, EKS, and GKE are also pulling more of the control plane out of your hands, which means autonomous behavior is increasingly happening at layers you have less visibility into and less ability to pause.
Take a quick inventory of what’s running on a typical production cluster. HPA is polling metrics every 15 seconds and adjusting replicas. VPA is recalculating resource requests and willing to evict pods to apply them. Argo CD or Flux is reconciling live state against Git, ready to resync the instant it sees drift. The cluster autoscaler is adding and removing nodes based on pending pod pressure. A node recycler is rotating nodes on a schedule to apply AMI patches. If you’re on AKS, Azure’s node auto-provisioner is making its own scaling decisions on top of all of that.
During an incident, every one of those systems is still running. They don’t know there’s an incident. They don’t know you’re trying to hold the cluster still long enough to understand what’s wrong.
There’s an awkward irony here. The teams with the most sophisticated, automated Kubernetes setups are often the ones with the hardest time holding their clusters still during a real incident. More automation in normal operations means more interference in abnormal ones.
What This Actually Looks Like
Most of the time on a Kubernetes incident isn’t spent fixing anything. It’s spent figuring out what’s wrong. Correlating logs, tracing requests, checking events, comparing current state to what you’d expect. In my experience, the diagnostic phase routinely eats up well over half of total incident duration, and it gets dramatically harder when the cluster keeps mutating underneath you.
Here’s the kind of thing that happens.
You find a pod that looks suspicious. You start digging into its logs, its resource usage, its recent events. While you’re reading, HPA scales the deployment up by two replicas because CPU spiked when traffic hit the degraded pod. Now you have three pods instead of one, the logs you were reading are diluted across multiple instances, and the CPU signal that looked like a clue has been partially masked because the new replicas absorbed the load.
Then Argo CD notices the replica count doesn’t match Git and resyncs back down to the original number, undoing the HPA change. Now you’re back to one pod, the CPU spike returns, and HPA is winding up to scale again. Your cluster is oscillating between two states, and every observation you make is potentially stale by the time you make the next one.
If you’ve spent time on call, you’ve seen this. It’s not exotic. It’s what happens when systems built for steady state keep operating in steady-state mode during a crisis.
The Manual Workaround Everyone Already Knows
Experienced Kubernetes operators have figured this out. The unofficial answer is a pre-flight checklist at the start of an incident: suspend Argo CD sync, pin the HPA, set VPA to Off, lock the cluster autoscaler, pause the node recycler. Do all of that before you start diagnosing anything.
It works. A still cluster is a much easier cluster to debug.
But doing it manually has three real problems.
It’s slow. Working through all the steps correctly is, anecdotally, a 10 to 20-minute exercise depending on your setup. That’s 10 to 20 minutes you’re spending while the incident is active and customers are affected.
It’s error-prone. Under pressure, steps get missed. The node recycler is still running because someone forgot to pause it, and 20 minutes into your investigation a node drains, your pod moves, and the logs you’ve been analyzing are now historical.
And the reverse is just as risky. When the incident resolves, you have to undo everything. Miss a step on the way out and you’ve traded one problem for another. A cluster with the autoscaler still locked won’t respond to load. A GitOps tool still suspended means your next deployment quietly doesn’t ship. Every veteran on-call has a story about the incident that ended only for a new one to start an hour later because something never got re-enabled.
What’s Actually Missing
Incident response and normal operations have fundamentally different requirements. In normal operations you want maximum autonomy. During an incident you want maximum stability. The cluster has to be able to switch modes, and right now most clusters don’t have a concept of a diagnostic mode at all.
What teams actually need is a way to atomically freeze the autonomous behaviors that interfere with diagnosis, fast, reliably, and just as easy to reverse when the incident is over. Not a six-step checklist executed under stress. Something closer to a single operation that puts the cluster into a known, stable state for the duration of the incident window.
The mechanisms that matter most are the ones that change pod counts, node counts, or deployment state. Pinning HPA. Suspending GitOps sync. Locking the cluster autoscaler. Setting VPA to Off. Pausing node recyclers. Today each one is a separate operation, and collectively they’re what engineers are already doing by hand.
The gap isn’t in knowing what to do. It’s in having a reliable, fast way to do all of it at once, and undo it just as cleanly.
A Problem That Deserves Better
The Kubernetes community spends a lot of energy talking about observability. How to instrument better, alert smarter, reduce MTTR through better visibility. That work matters. But there’s a parallel problem that gets almost no attention: the cluster itself is making incidents harder to diagnose by continuing to operate autonomously while you’re investigating.
Better dashboards don’t help much when the system you’re watching keeps changing state. Structured logging doesn’t help when the pod you’re tracing gets evicted mid-investigation. The whole observability stack sits on top of a cluster that doesn’t know it’s being diagnosed, and that gap has a real cost measured in incident duration and engineer burnout.
Teams that handle incidents well have already internalized this. They have runbooks for stabilizing the cluster before they start diagnosing. They’ve been burned enough times by a moving target to know that the first five minutes of an incident should go toward creating a stable environment, not chasing the first symptom.
The question worth asking is why that stabilization step is still manual, error-prone, and undocumented in most organizations, and what it would take to make “freeze the cluster” a first-class operation, treated with the same seriousness as a rollback or a failover.
Until that happens, the rest of us will keep paying the tax: longer incidents, more cognitive load, more 2 a.m. mistakes. The cluster will keep moving, and we’ll keep trying to debug it anyway.


