It Worked Last Tuesday: What Operators Teach Us About Platform Reality
A funny thing happens after a successful “apply.” People relax.
I’ve watched this play out a lot: We write a tidy plan, run it, and the cloud obediently assembles itself. Screenshots are taken. A Slack thread is archived with a little party emoji. Then Wednesday happens. A node group rolls, a certificate ages out faster than anyone expected, or, my personal favorite, someone “temporarily” flips a setting for a debugging session and forgets to flip it back. Now the plan from Tuesday is historically accurate, like a yearbook photo. It looked great at the time.
Infrastructure as code was a big step forward. Versioned intent beats wikis and wishful thinking. But a lot of IaC is designed for moments, not months. You describe the desired state, press the button and hope the world doesn’t move too far before you press it again. If your platform is quiet and slow-changing, that’s fine. Most platforms aren’t.
Kubernetes operators come from a different mindset: Don’t just make it true — keep it true. Instead of running a script and hoping, you run a controller that watches reality and continuously reconciles it back to your intent. The first time I saw a well-written operator quietly fix something I didn’t know was broken yet, I had the same feeling you get when you realize the dishwasher actually does sensibly wash the cutlery if you stop fighting the little separator thingies.
Here’s a simple example I’ve seen in a few teams. You want to expose a workload in a cluster to an internal audience across environments. On paper: Publish an endpoint, give it a stable name, make sure only the right identities can reach it, and keep the TLS fresh. In practice: Pods migrate, cluster IPs shift, nodes churn, DNS gets “creative” and certificates mysteriously expire at 3 a.m. With a controller, you point at a Service or Ingress, and it does the boring bits on a loop, updates endpoints as they move, renews credentials ahead of time, and keeps the name pointing at the thing that actually exists today. No one has to remember to “rerun the thing” after a rolling update. It’s not magic; it’s a control loop.
This isn’t an either/or story. Use the right tool for the lifetime of the task. One-shot tools are fantastic at provisioning durable stuff: networks, base clusters, long-lived databases that change on purpose and with notice. Controllers shine where change is the default state—where you’d otherwise be dispatching a human or a bash script to do the same fix every week.
There’s a cultural lesson hiding here, too. When we adopted IaC, we learned to review every change. That was healthy. Operators add a second habit: close the loop. Don’t just declare what good looks like; ship a small piece of software whose entire job is to keep it that way, and then instrument it so you can see when it can’t. The first time a team gets comfortable with that, you can feel the platform get calmer. Incidents don’t disappear, but they get shallow. You recover by design instead of by heroics.
A few patterns that helped us (and that I wish I’d learned sooner):
Pick opinions to match the operator’s job. Not every operator should be a magic box. Some are general-purpose building blocks meant to live in a thousand different environments. Those earn their keep by exposing configuration, with sane defaults and good docs, because there really are many valid ways to run them. Others promise a smooth path. Those should encode the paved road and hide the sharp edges. Our operator is in that second camp: We’re serving customers who want the thing to “just work.” When we removed two of three “harmless” flags, our incident rate dropped — not because flags are bad, but because more choices make it easier to make mistakes.
Model intent, not steps. If your API reads like a runbook — “first allocate this, then write that file” — you’re leaking implementation details that will bite you the moment the environment changes. Say what you want: “A private endpoint for this Service,” “a valid certificate for that hostname,” “credentials that rotate before this date.” Let the controller figure out how to satisfy that in the cluster you actually have, not the one you drew on a whiteboard six months ago.
Keep things small enough to reason about. The most tempting anti-pattern is the mega-controller that “just does everything.” It’s amazing right up until it isn’t, and then nobody wants to touch it. A handful of clear, composable reconcilers will always beat one giant brain that requires a four-hour orientation session and a strong coffee.
Finally, respect the blast radius. A good controller is humble. It uses least privilege, backs off when reality looks weird, and surfaces clear status so a human can step in.
If you squint, operators are just operationalized empathy. They assume your environment and your coworkers will change things for good reasons and bad, on schedules and off. They don’t punish that; they adapt. That’s a useful stance for platform teams in general. We’re not here to freeze reality. We’re here to give people an interface that stays true as reality moves.
When someone asks me, “Why operators?” I think of that Tuesday plan and the Wednesday surprise. One approach treats infrastructure like a snapshot you occasionally retouch. The other treats it like a livestream with a steady hand on the camera. Both have their place. The interesting part is choosing which one matches the half-life of the thing you’re building.
If you remember nothing else: Design for the cadence of change. Use provisioning tools for the things that change slowly and deliberately. Use controllers for the things that won’t sit still. And try to make the boring parts automatic so the exciting parts can be, you know, the actual product.
KubeCon + CloudNativeCon North America 2025 is taking place in Atlanta, Georgia, from November 10 to 13. Register now.