From PagerDuty to ‘Agentic Ops’: The Rise of Self-Healing Kubernetes

February 27, 2026 Pavan Madduri 3 A.M. PagerDuty, Agentic Ops, AI in DevOps, Automated Ops, cloud cost optimization, devops, eBPF, incident management, Kubernetes operators, LLMs, observability, policy as code, predictive scaling, root cause analysis, Site Reliability Engineer, SRE, System Automation, Technology Evolution

by Pavan Madduri

For the last decade, the life of a site reliability engineer (SRE) has been defined by the 3 a.m. PagerDuty test. When the alert fires, a human wakes up, logs in, checks dashboards and frantically searches for a runbook.

We built sophisticated tools to make this easier. We have centralized logging, distributed tracing and fancy Grafana dashboards.

However, fundamentally, the workflow hasn’t changed: The machine breaks, and the human fixes it.

In 2026, we are witnessing the first real shift away from this model.

We are moving from Automated Ops (scripts that do what they are told) to Agentic Ops — systems that can reason, investigate and remediate problems without human intervention.

This isn’t science fiction. It is the convergence of three specific technologies: Extended Berkeley Packet Filter (eBPF), LLMs and Kubernetes Operators. Here is how they are combining to kill the 3 a.m. pager call.

1. eBPF: The ‘Eyes’ of the Agent

The biggest barrier to AI in DevOps has always been context. An LLM cannot fix a bug if it doesn’t know what happened. Past attempts at AIOps failed because they relied on logs, which are often incomplete, unstructured or missing entirely during a crash.

Enter eBPF.

eBPF allows us to attach programs to the Linux kernel safely. It sees everything: Every syscall, every network packet, every file read — without requiring the application to be instrumented.

For an AI agent, eBPF provides the ground truth. It doesn’t need to guess why a pod crashed; it can see the exact error code returned by the kernel when the process tried to allocate memory. It can see the exact TCP packet that was dropped by the firewall. This high-fidelity, structured data is the perfect food for a reasoning engine.

2. The Agentic Loop: Observe, Orient, Decide, Act

The old ChatOps model was passive: You asked a bot, “Why is the site down?” and it queried Prometheus.

The Agentic model is active. It runs a continuous observe, orient, decide, act (OODA) loop.

Imagine a scenario where a microservice starts returning 500 errors.

Observe: The Agent detects the anomaly via Prometheus alerts.

Orient: It queries the eBPF data and sees a spike in database connection timeouts. It correlates this with a recent deployment that changed the connection pool size.

Decide: It reasons. The connection pool is exhausted. The immediate fix is to roll back the deployment. The secondary fix is to increase the max_connections config.

Act: It triggers a Kubernetes rollback via the API server.

Crucially, the human is on the loop, not in the loop. The agent notifies the SRE: “I detected a crash loop and rolled back to version v1.2. Service is restored. Here is the root cause analysis (RCA).”

3. Guardrails: Trusting the Machine

The immediate fear for any practitioner is: “What if the AI deletes the database?”

This is where Kubernetes operators and policy as code step in. In an Agentic Ops world, we don’t give AI root access. We provide it with a set of constrained tools.

We wrap sensitive actions in safe mode. An agent might have permission to restart a pod or scale a deployment, but not to delete a PersistentVolume. We use tools such as Open Policy Agent (OPA) or Kyverno to enforce these guardrails.

If the agent decides a destructive action is necessary (like wiping a cache), it can escalate to a human for approval. This tiered autonomy allows us to automate 90% of the mundane incidents (disk full, memory leak, hung process) while keeping humans in charge of the dangerous 10%.

4. Predictive Scaling: Moving Beyond the HPA

Finally, Agentic Ops changes how we scale. The horizontal pod autoscaler (HPA) is reactive. It waits for CPU usage to spike before it adds pods. By then, latency has already increased.

An agentic system looks at historical traffic patterns, calendar events (e.g., It’s Black Friday) and upstream signals. It doesn’t just react to load; it predicts it.

It can pre-scale the cluster 10 minutes before the traffic wave hits, ensuring zero latency degradation. This isn’t just better performance — it’s better economics. The agent can aggressively scale down when it predicts a lull, saving cloud costs that a conservative HPA configuration would waste.

Conclusion: The New Role of the SRE

Does this mean the SRE is obsolete? Far from it.

The role is simply shifting up the stack. We are no longer operators who turn knobs. We are becoming architects of agents. Our job is to define the goals, set the constraints and train the models on our specific infrastructure context.

The goal of Agentic Ops isn’t to replace the engineer; it’s to let the engineer sleep through the night, knowing that the system is smart enough to save itself.