Kubernetes Was the Easy Part

May 18, 2026May 18, 2026 Alan Shimel AI, cloud native, kubernetes, Linux Foundation, MCPCon

Kubernetes was hard. Nobody who lived through the early container years should pretend otherwise. The industry had to learn a new operating model, a new control plane, a new vocabulary and a new way of thinking about infrastructure. Scheduling, service discovery, storage, networking, secrets, upgrades, observability and platform abstraction all had to be reworked around a container-first world.

But looking back, Kubernetes solved a relatively clean problem.

That may sound strange given how much pain Kubernetes caused and, in many organizations, still causes. But Kubernetes was built around deterministic orchestration. You declared a desired state. The control plane worked to converge the system toward that state. A pod was running or it was not. A deployment was healthy or it was not. A service had endpoints or it did not. A container image, for all its flaws, was still a known artifact. A manifest could be inspected. A configuration could be rolled back. A cluster could be observed through metrics, logs, events and traces.

The next problem does not look nearly as tidy.

Agentic systems are bringing the cloud-native world into probabilistic orchestration. That is a very different animal. Agents do not just run. They reason, retrieve context, call tools, make choices, generate outputs and sometimes act on behalf of users or systems. They may take different paths to solve the same problem. They may produce different results from similar inputs. They may invoke different tools depending on context, memory, policy and model behavior.

That breaks a lot of assumptions.

This is one of the more important undercurrents at Open Source Summit North America 2026. The Linux Foundation and its broader ecosystem are no longer talking only about Linux, containers, Kubernetes and supply chains. Agentic AI, MCP, open AI infrastructure and standards for agent interoperability are now part of the open source infrastructure conversation. The Linux Foundation’s Agentic AI Foundation was launched with contributions including Anthropic’s Model Context Protocol, Block’s goose and OpenAI’s AGENTS.md, which tells you the agent discussion has already moved beyond demos and into governance, standards and shared plumbing. (Linux Foundation) The Agentic AI Foundation is also building a global events program around AGNTCon, MCPCon and MCP Dev Summits, specifically aimed at moving agents from experimentation into production. (Linux Foundation)

That is the real story. Agents are becoming infrastructure.

And once something becomes infrastructure, the hard questions start.

Kubernetes gave us a clean abstraction around workloads. It did not make infrastructure simple, but it gave the industry a common model. Desired state became the lingua franca. Platform teams could build around that. GitOps could build around that. Observability tools could build around that. Policy engines could build around that. Security teams could build around that. Developers did not need to understand every detail of the underlying system to ship software.

Agentic systems are not so cooperative.

What is the desired state of an agentic workflow? Is it the completion of a task? The quality of a decision? The safe use of a tool? The absence of policy violations? The explainability of the path taken? The fact that a human approved the right step at the right time? The answer is probably all of the above, which means we are no longer just managing infrastructure state. We are managing behavior.

That is a big shift.

Traditional cloud-native observability is still necessary, but it is not sufficient. Metrics, logs and traces tell us whether systems are alive, slow, broken or misconfigured. They do not fully tell us why an agent chose one action over another, which context it relied on, which tool it invoked, whether the prompt was manipulated, whether policy was evaluated correctly or whether the outcome was acceptable.

For agents, the question is not just “is the service healthy?” The question becomes “did the system behave within operational, security and business boundaries?”

That requires a different kind of observability. We need to see decisions, tool calls, context sources, memory access, prompt instructions, policy checks, approval gates, confidence signals and exception paths. We need behavioral traces, not just distributed traces. We need to understand not only what ran, but why it ran and under what authority.

Rollback gets even messier.

In Kubernetes, rollback usually means reverting a deployment, an image, a configuration or a manifest. That can be painful, but the pattern is understandable. In an agentic system, rollback might mean undoing a cloud configuration change, reopening a ticket, correcting a database entry, retracting a customer communication, reversing an access decision or explaining why an autonomous workflow took an action in the first place.

Some of those actions are reversible. Some are not. Some are technically reversible but politically, legally or operationally damaging.

That means the old instinct of “we can fix it in rollback” becomes dangerous. Agentic systems need stronger pre-action controls. They need policy before execution, not just remediation after execution. They need scoped permissions, human approval for high-risk actions, tool attestation, context validation and auditability built into the runtime path.

Testing also changes.

CI/CD pipelines were built around repeatability. We write tests. We run tests. We expect the same input to produce the same output. We promote builds through environments based on whether they pass or fail. That model does not disappear, but it does start to strain when applied to agents.

Agentic testing is going to be more statistical than binary. We will still need unit tests, integration tests and regression tests around the software components. But we will also need scenario testing, adversarial testing, tool-call validation, policy simulation, prompt-injection testing and outcome evaluation. We will need to test ranges of acceptable behavior, not just exact outputs.

That will make a lot of current DevOps muscle memory uncomfortable.

The same is true for governance. Cloud-native governance has mostly centered on build-time and deploy-time controls. Scan the image. Verify the SBOM. Check the dependency. Apply policy-as-code. Control admission into the cluster. Manage secrets. Enforce access rules.

Agents move the risk into runtime.

The risk is not only what was deployed. It is what the agent decides to do after deployment. Which tool did it call? What data did it see? Which identity did it use? What policy boundary applied at that moment? Did the agent have permission to take that action in that context? Was there a human in the loop? Was there a reliable audit trail?

This is where runtime trust becomes the center of gravity. The artifact still matters. The model still matters. The infrastructure still matters. But the live decision path matters more than anything. A trusted deployment can still produce an untrusted action if the runtime controls are weak.

Scaling agents is another trap.

Scaling pods is conceptually simple. Add replicas. Distribute load. Watch resource limits. Adjust autoscaling policies. Scaling agents means something else entirely. It means scaling decision-makers. It means many autonomous or semi-autonomous systems acting in parallel, possibly across shared tools, shared memory, shared workflows and shared business processes.

That raises harder questions. Can two agents collide? Can they take conflicting actions? Can one agent undo another agent’s work? How is shared context managed? How are identities issued and scoped? What prevents agent sprawl? How does a platform team know which agents exist, what they can touch and who is accountable for their behavior?

This is where platform engineering gets pulled back to its roots. Platform engineering started by abstracting Kubernetes and cloud-native complexity for developers. Internal developer portals, golden paths and paved roads helped organizations make Kubernetes consumable. Now platform teams are being asked to do something larger. They need to provide the operating model for agentic workflows.

That means catalogs will need to include agents, tools, policies, identities, memory sources, approval chains and audit requirements. The platform will not just answer “where does this service run?” It will need to answer “what is this agent allowed to do, under what conditions, with which tools, against which systems, and with what evidence trail?”

That is a very different platform conversation.

The open source community has an important role here because the agent stack cannot be left entirely to proprietary platforms. MCP, agent frameworks, tool registries, identity models, policy engines, evaluation systems and runtime governance frameworks all need open standards and shared implementations. Microsoft’s Open Source Summit messaging around open source as the foundation for AI and the need for secure, predictable infrastructure for building apps and agents is a good example of how quickly this conversation is converging with the cloud-native world. (opensource.microsoft.com)

But open standards alone will not solve the problem. They create the common language. They do not remove the operational burden.

The uncomfortable truth is that most current cloud-native tooling assumes predictability. Agents attack that assumption at the root. Not maliciously, necessarily. Just structurally. They introduce variability where our tools expect repeatability. They introduce judgment where our pipelines expect determinism. They introduce context where our policies often expect static inputs. They introduce action where our observability stacks mostly record behavior after the fact.

That does not mean cloud native is obsolete. Quite the opposite. Cloud native is the foundation the agentic era will build on. Kubernetes, containers, service meshes, GitOps, policy-as-code and observability are not going away. They are becoming the substrate.

But the control plane is moving up the stack.

The next control plane will not just schedule workloads. It will coordinate intent, context, tools, policy, identity, memory and trust. It will need to understand not only whether something is running, but whether something should be allowed to act. That is the leap from deterministic orchestration to probabilistic orchestration.

Open Source Summit is a good place for this conversation because this is exactly the kind of problem open source communities have solved before. They took messy infrastructure and created shared operating models. They created common primitives. They built ecosystems around them. Kubernetes became the symbol of that achievement.

Now the industry needs to do it again, but this time the thing being orchestrated is less predictable.

Shimmy’s Take

Kubernetes was the hard part of the last decade. It forced the industry to rethink infrastructure, operations and developer experience. But Kubernetes also had one thing going for it: the problem had boundaries. Containers were strange at first, but they were still software artifacts running inside a control plane designed around desired state.

Agents are different. They operate with context. They call tools. They make decisions. They act. That makes them powerful, but it also makes them operationally uncomfortable.

The cloud-native community should not run from that discomfort. It should sit with it. Because the next great infrastructure challenge is already here. Kubernetes taught us how to orchestrate workloads. Agentic systems will force us to orchestrate behavior.

That is a much harder problem.

And yes, it may turn out that Kubernetes was the easy part.