Komodor Extends Autonomous AI Agent for Optimizing Kubernetes Clusters
Komodor today added autonomous self-healing and cost optimization capabilities to an artificial intelligence (AI) platform designed to automate site reliability engineering (SRE) workflows across Kubernetes environments.
Company CTO Itiel Shwartz said those capabilities further extend an AI agent, dubbed Klaudia, that Komodor developed to automatically detect, investigate, and remediate issues with or without human engineers necessarily having to be in the loop.
For example, new capabilities that were added now make it possible for Klaudia to autonomously resolve common failures such as pod crashes, misconfigurations, and failed rollouts.
SRE teams can now dynamically right-size workloads to balance cost, performance, and reliability, intelligently schedule pods, maximize infrastructure consumption and mitigate any issues that might arise from overly aggressive IT scaling policies. They can also invoke a PodMotion capability to seamlessly move pods across nodes with zero downtime.
Trained on telemetry from thousands of production environments that Komodor has aggregated to provide Klaudia with the level of Kubernetes domain needed to autonomously perform tasks, the AI agent continuously monitors workloads, applies reasoning and causality to identify anomalies, and automatically remediates issues in alignment with IT policies.
Rather than trying to create an AI agent capable of managing all types of SRE workflows, Klaudia is instead squarely focused on optimizing Kubernetes deployments, noted Shwartz. Iterative learning loops driven by continuous health checks and user feedback allow the platform to more effectively resolve tasks over time with higher levels of precision, noted Shwartz.
Guardrails also enable SRE teams to define the scope of automation to be applied based on pre-defined levels to ensure actions stay within desired operational boundaries. Klaudia has already proven its ability to autonomously manage thousands of issues, so it’s now up to each overworked SRE team to decide to what degree to rely on Klaudia to reduce the number of tasks they need to manually perform, said Shwartz.
A recent Komodor analysis of thousands of incidents involving Kubernetes clusters finds IT teams are spending 34 workdays per year resolving issues, with 79% of those incidents stemming from recent system changes. The report also finds more than 60% of the time spent managing Kubernetes clusters is spent on troubleshooting issues, while only 20% of incidents resolved without escalation.
More troubling still, more than 65% of workloads run under half their requested CPU or memory, suggesting that wasted spending on the infrastructure required to run Kubernetes clusters is exceedingly high. A full 82% of the Kubernetes workloads are overprovisioned, compared to 11% that are underprovisioned, the report finds. Komodor estimates almost 90% of organizations are also overspending on cloud resources, with capacity utilization often falling below 80%. Well more than a third of IT teams (37%) have a need to rightsize 50% or more of their workloads.
While there is always going to be some concern over the degree to which AI might eliminate the need for SREs altogether, the fact remains that most organizations find it difficult to find and retain SREs. There simply isn’t enough available expertise. The issue then becomes how much to rely on AI agents to make up for that gap based on the level of risk associated with any potential outage that might result.
The one thing that is certain, however, is that when it comes to automating SRE tasks those AI agents are only going to become increasingly better as they are exposed to more issues.



