Ten Years of the Operator Pattern: What We Got Right, What We’d Change

June 5, 2026 Ketankumar Jani cloud native, devops, kubernetes, Operators, platform engineering

CoreOS introduced the operator pattern in November 2016, and nearly a decade later operators are everywhere. Almost every CNCF graduated project ships one, every database vendor offers one, and every platform team has written at least one of their own. We have enough operational experience now to ask the question that’s been polite to skip: was this a good idea?

The honest answer is yes, but with caveats nobody wrote down in 2016 because nobody knew them yet.

What the Pattern Got Right

The fundamental insight is correct and still correct. Operators encode domain expertise about stateful systems as code, putting things like provisioning, version upgrades, failover, backup, and point-in-time recovery into the same artifact that runs the workload. The Postgres operator carries the dozen smaller decisions that used to live in a senior DBA’s head, the etcd operator knows how to add a member without losing quorum, and the Kafka operator handles partition rebalancing without breaking consumers. That’s operations expertise that previously lived in runbooks that went stale, wikis that nobody updated, and tribal memory that left when senior engineers left, and it now lives in versioned, tested, deployable software. The field is collectively better off for it, and expertise is reusable in a way it wasn’t before.

The pattern also matched Kubernetes well, which isn’t a small thing. CRDs extend the API and controllers reconcile state, and the control loop pattern that Kubernetes uses for its built-in resources extends naturally to arbitrary domains. If you understand how Deployment controllers work, you mostly understand how every operator works, so the cognitive overhead for operators of operators is genuinely low. That’s good design.

What We Got Wrong

The community vastly underestimated the operational maintenance cost of operators. Adopting an operator isn’t a one-time decision; it’s a commitment to track upstream releases, apply security patches on the operator’s schedule rather than yours, manage CRD migrations across versions (including stored data conversions), test operator upgrades against your specific workload patterns, debug controller behavior when the reconciliation loop does something unexpected, and plan for operator failure modes. Operator down means no automation, but whether it means no operations depends on the operator and on how you’ve designed around its absence.

Every operator you adopt is a real ongoing cost, and teams running 20 operators have 20 of these commitments. Most haven’t accounted for that in staffing, and you can usually tell because the operators that don’t have a clear owner are the ones perpetually a few versions behind, with security advisories ignored and bug reports filed by users who eventually give up.

Operators also don’t compose well. When Operator A and Operator B both want to manage a shared resource (a Service, a ConfigMap, a Secret), conflicts emerge because the Kubernetes API doesn’t really distinguish between “I created this” and “I claim this.” Two controllers fighting over the same resource is a real failure mode, and it’s hard to debug because each side thinks it’s right; the reconcile-and-retry loop turns into a slow-motion war of attrition that shows up as resource churn in your audit logs.

CRD versioning is harder than expected. The v1alpha1 to v1beta1 to v1 migration story works in theory, but in practice it breaks when you have to migrate stored data, update finalizers in place, and coordinate operator upgrades with consumer upgrades. Many production operators are still on v1beta1 because the v1 migration is risky enough that nobody wants to be first, and the conversion webhook story is solid until something goes wrong with the webhook (at which point you have storage that can’t be read by anyone).

The Over-Correction

Between roughly 2020 and 2022, the field over-corrected. Every team building anything for Kubernetes defaulted to writing an operator, and we ended up with operators for things that didn’t need to be operators: simple deployment wrappers, configuration distributors, things that were really cron jobs in API clothing.

Writing an operator is a substantial commitment. It requires controller-runtime expertise, CRD design judgment (which is its own skill, and the bad CRDs are usually obvious in hindsight), RBAC scoping that doesn’t accidentally hand out cluster-admin, observability for the controller itself (not just the workloads), and a maintenance plan that survives team turnover. Many of the use cases that produced operators in 2020-2022 would have been better served by a Helm chart, a Kustomize overlay, or a CI/CD step.

A few signs you probably don’t need an operator: the “controller” doesn’t have a control loop and just runs once to apply manifests (that’s a Job); the CRD wraps an existing resource with fewer fields (that’s an admission controller or a template); the reconciliation is a sequence of steps with no error recovery (that’s a script); the thing being operated has no ongoing operations to encode, which means there’s nothing for an operator to do because operators encode operational knowledge that you actually have.

Many of the “operators” from the over-correction era are still running in production but contribute very little. They consume cluster resources, generate API traffic, hold leases, and don’t really earn their keep, which is why the better engineering teams are quietly archiving them in favor of simpler patterns.

What’s Worked Well

Stateful workload operators are the clearest success story. Postgres, MySQL, Cassandra, Kafka, Elasticsearch, MongoDB: provisioning is non-trivial, failover is fiddly, and backup-and-restore involves enough state-machine reasoning that getting them right by hand is rare. Rolling upgrades that preserve consensus or replication add another layer of subtlety. These are exactly the cases where encoding expertise as code pays for itself, and the production track record reflects that.

Cluster lifecycle operators have also worked well. Cluster API and its provider implementations have made Kubernetes itself manageable through Kubernetes APIs, and that recursion turned out to be a useful abstraction rather than a fragile one because the underlying cloud APIs were the messy part and CAPI is the thing that hides them.

Cross-cutting concerns are another category that’s earned its place: External-DNS, cert-manager, sealed-secrets, external-secrets-operator. These are all “watch resources, take action elsewhere” patterns, which is exactly what controllers are for. Application platforms have produced strong operators too, including ArgoCD’s Application CRD, Crossplane compositions, and Knative’s Service. These represent durable abstractions that lots of users build on top of, and the operator pattern fits them well.

What Hasn’t Worked Well

Application-specific operators for stateless workloads have rarely justified themselves. The MyAppOperator that wraps a Deployment and a Service could almost always have been a Helm chart at a fraction of the cost, and the teams that wrote these tend to know it now. The same story applies to operators that fight Kubernetes built-ins: if you’re reimplementing scheduling, networking, or RBAC, you’re probably going to lose, and most of these projects either get deprecated or end up maintained heroically by a small team that can’t keep up with upstream changes.

Multi-tenant operators that try to manage workloads across many tenants often fail to scale. The Kubernetes API server, etcd, and the cache layers in client-go all assume the controller is responsible for a known set of resources, so when that set becomes “everything in the cluster across many namespaces,” the assumptions stop holding and you get hot-loops, watch storms, or quietly missed events.

What We’d Change

The community needs a clearer story for “this should not be an operator.” Defaults have improved, but the bar to write an operator is still set too low.

CRD evolution needs better tooling. The conversion webhook design handles the easy cases well, but the hard cases (renaming fields with semantic change, splitting one CRD into two, dropping a field that’s been written) need more support than they currently get.

Operator composition needs conventions. When two operators interact, the failure modes shouldn’t be “fight to a standstill”; some kind of ownership semantics in the API would help, even if it’s just an annotation convention that the major controllers agree to honor.

Operator lifecycle needs a story that doesn’t depend on Helm. The operator-installer-operator pattern that OLM uses is a workaround for Kubernetes lacking native package management. It works, but it’s funny that we install operators with another operator.

Where the Field Is Going

In 2026, operators are increasingly pieces of larger platforms (Crossplane compositions, OperatorHub bundles, Cluster API providers) rather than standalone tools, which is a healthy direction. The “should this be an operator” conversation is more honest than it used to be, and newer projects start with simpler primitives like Jobs, CronJobs, controllers without CRDs, or plain Go programs talking to the Kubernetes API, adding operators only when the operational complexity actually warrants it.

Ten years in, the operator pattern has earned its place, and the next ten are about being more precise about when to use it.