Cost Control for Kubernetes: Monitor, Right-Size, Govern
As Kubernetes moves from testbeds to production, managers are getting sticker shock from the bills a K8s deployment can incur. Whether hosted in-house or on a cloud provider, who knew cloud nativity would cost this much?
Turns out, a considerable portion of the duties that come with running Kubernetes is not just from managing K8s itself, but also managing costs.
This was the topic of discussion of Sunday’s “Lightning Talk” video from The Tech Stuff, a personal EdTech site for cloud native engineer Maya Mnaizel, who hosted.
Engineering manager Christian Dussol shared the lessons he learned working with a Kubernetes deployment for a financial firm.
The issue of cost really came to his attention after moving one production Kubernetes deployment to Azure. The expenses didn’t come all at once, but over time they added up, from all sorts of places, storage, networking, monitoring. This was Opex, not CapEx.
Dussol insisted that this was not the fault of Kubernetes.
“Kubernetes is a tool,” he said. It will only do what you tell it to. “Kubernetes will only allocate the resource or schedule according to the configuration you decide to put in place.”
As excellent as Kubernetes is in efficiently managing distributed resources, it is indifferent to managing its own resources. That’s the manager’s job.
Likewise, delegating operations to the cloud does not reduce responsibility. Cloud resources must be managed and optimized as well. The cloud providers offer tools, but it’s up to the user to wield them effectively.
“Microsoft and Amazon will only do exactly what you ask, which means that you have to choose the right cloud services, configure them properly, compose properly, and then rely on [a financial operations framework] to monitor, optimize and operate in the continuous way,” Dussol summarized.
With Kubernetes, optimization is not merely an enhancement, but essential to the deployment’s success.
Cloud Inflation
With Kubernetes, over-provisioning runs rampant.
Nodes are routinely overprovisioned by developers who don’t know or care about costs. They lavish generous resource requests upon their projects to ensure reliability. Or they optimistically plan for some spike in traffic that may not happen. Architectural design decisions get locked in early, before operational feedback can be incorporated into the design.
Telemetry, which is absolutely essential to keeping costs down, can incur its own significant bills, not the least because of the storage costs for collecting all that operational data.
Across multiple projects, all this unused capacity tallies up an unnecessarily large bill. Dussol has seen cases of pods requesting three times the actual resources needed.
How can you tell if your cluster is overprovisioned? If it is utilizing less than 50% of its resources (memory, CPUs, etc.), Dussol suggested.
Nodes are the Real Cost Unit
Dussol spoke very highly of the Linux Foundation’s FinOps framework (from the Linux Foundation), which outlines a three-phased approach to cost management: inform, optimize, and operate.
With this approach, you first get visibility into the clusters. Then right-size the provisioned gear to the workload and to your Service Levels Agreements (SLA). Then, automate the governance.
Dussol also added this piece of advice: Label everything.
Labels, for instance, will allow you to better balance costs against SLAs. Some cloud provider charge by the cluster. So, by grouping workloads, you can put development, staging, and quality control on shared clusters, while putting production workloads, with stricter SLA, on their own, more performant, clusters.
SLAs should guide the kind of service needed for a workload. Running apps in serverless containers could reduce cost but they also could incur cold‑start delays, especially for applications that aren’t called frequently. If cold starts violate the SLAs for response times, then serverless is not an option.
Likewise, storage comes in different tiers. Azure, for instance, offers object, file and block storage, and each comes in a range of options, from least expensive to most performant. Understand what performance the workloads need, and procure for the most cost-effective offering for the needed performance envelope.
Initially, Dussol’s team went with the premium SSD storage, which was unnecessarily expensive for many of the workloads.
Monitoring is one of the most important aspects of optimization, especially the CPU and memory consumption of the workloads. You can’t manage what you don’t see. Effective telemetry is where to spy the node sprawl and the gaps between requests and usage.
But at the same time monitoring is another area that tends to get overprovisioned, which leads to unnecessary storage and network costs. Do you need to capture performance every second, or is there a cost-saving sampling rate that would provide the needed data? How long do you need to keep the data? The sooner you cut it loose, the more you’ll save in storage. Though, especially in regulated industries like finance, there are also government and internal rules for how long to keep the data.
The Human Factor
Ultimately, cost is an organizational problem, requiring the input of multiple parties.
DevOps has the cross-department stance that makes it ideal for driving optimization. A DevOps team acts as a hub, closing gaps between developers, testing, and operations. Without an overview of the entire Kubernetes lifecycle, cost-cutting decisions can’t be made.
Perhaps counter-intuitively, Dussol is a big believer in silos. With siloed operations, you have an individual who can be held responsible for maintaining the performance of their operations, against its given budget. For silos to work, however, there needs to be communication with the rest of the organization.
This is why the Cloud Native Computing Foundation’s Certified Kubernetes Application Developer (CKAD) can be a “secret weapon” in cost-containment, Dussol said. This training teaches the K8s developer about the whole support ecosystem, surfacing the connection between writing configuration code and cost management. It also gives them a shared language with the FinOps team.
The company Dussol works for has a Site Reliability Finance specialist, whose job it is to reconcile the system’s performance with the company’s budget. Each week they hold a “showback” meeting with the ops folks, showing how much they consumed in resources, just as a check of how well they are doing. Every quarter, costs are compared to forecasts to ensure they are within the limits.
The SR Finance person also sets the rules and overall governance, which is then used by the operational teams to help make allocation decisions. This is the governance loop as defined by the FinOps framework. Tools such as Kyverno can help with this process.
“The bottom line is we have to do our own homework, which means that we should master and know properly, the technology,” Dussol said.


