Machine Learning in Kubernetes: Why Trust, Not Tech, is Your Biggest Hurdle
At its core, optimizing Kubernetes at scale is a massive math problem. And frankly, it’s a problem that we, as humans, are just not built to solve. The sheer number of variables, the constant flux of workloads, and the interplay between cost and performance create a dynamic so complex that our intuition often fails us, leading to the overprovisioning and wasted spend we see everywhere.
It’s an interesting paradox. Our engineering teams rightfully take immense pride in their automation. We’ve all worked hard to automate CI/CD pipelines, testing, and deployments—everything that accelerates delivery. Yet, there’s this one critical area where we consistently see the brakes get slammed: Continuous resource optimization. This isn’t a backlog item; it’s a dynamic pipeline that demands a new approach. Why is a practice so fundamental to cloud efficiency, like rightsizing, still only a consistent priority for a fraction of organizations?
I believe the answer isn’t technical. It’s human.
The Developer’s Dilemma: Performance vs. Cost
Let’s be pragmatic. When a developer is pushing to meet a deadline, their primary concern is ensuring the application they just built doesn’t fall over in production. If you give them a choice between a larger instance size for peace of mind or a thousand dollars in monthly savings, they will choose the larger instance every single time. And you can’t blame them. Their job is to deliver reliable features that meet an SLA, not to be cloud economists.
This creates a natural tension. The business needs to control costs, but developers need to guarantee performance. The result is a cycle of overprovisioning, followed by periodic, manual “cleanup exercises” that are always too little, too late. This is why a staggering number of leaders feel that container complexity is outpacing their FinOps teams’ ability to manage it.
The Real Blocker is Trust, Not Automation
During a recent presentation, a colleague and I landed on a line that has stuck with me: “Automation isn’t the problem; trust is”.
Developers and platform engineers are more than willing to adopt automation when they trust that it will make their lives easier without compromising the stability of their applications. The reluctance we see in optimization stems from a lack of trust in “black box” recommendations. They worry that an automated change will degrade performance, causing an incident they’ll have to fix at 2 a.m.
This is where machine learning becomes so critical. Unlike simple statistical analysis, advanced ML models can learn the unique seasonal patterns of each workload, providing forecast-based recommendations that prove their reliability over time. The goal of the technology should be to codify all the concerns and operational guardrails an engineer has into the system itself. By doing this, you’re not replacing the engineer; you’re removing the need for a human-in-the-loop on routine, mathematically-driven decisions. You’re finally giving them a system they can trust.
From Optional Tool to Paved Road
So, how do we build that trust at an organizational level? What we’ve seen with our most successful customers is a fundamental shift in approach. They stop treating optimization as an optional tool for developers to pick up. Instead, their platform engineering teams provide it as a default, opt-out service.
When ML-powered rightsizing is woven into the fabric of the platform—part of the “paved road”—it’s no longer a feature developers have to evaluate. It’s simply how the platform works. This approach transforms optimization from a reactive, manual task into a continuous, automated process that runs from build time onward, preventing waste before it even happens. We’ve seen this model dramatically reduce Kubernetes costs while simultaneously improving performance and reliability.
Looking Ahead: The Future is Automated and Collaborative
This isn’t just a Kubernetes story. Kubernetes is the crucible where these complex, dynamic optimization challenges are being solved today because the pain is so acute. The trust and the frameworks we build here will become the blueprint for how we bring this same intelligence to the rest of the stack—from traditional IaaS to other “black box” services like Snowflake and Databricks.
The future of optimization isn’t about choosing between cost and performance, or between governance and developer flexibility. It’s about using intelligent automation to achieve both. It’s about building collaborative systems that empower engineers with cost visibility and give finance teams confidence in engineering decisions. By focusing on building trust, we can finally close the gap between insight and action, and let our teams get back to the work that truly matters: Delivering value to our customers.
KubeCon + CloudNativeCon North America 2025 is taking place in Atlanta, Georgia, from November 10 to 13. Register now.