Report Details Raft of Kubernetes Management Challenges
An analysis of thousands of incidents involving Kubernetes clusters finds IT teams are spending 34 workdays per year resolving issues, with 79% of those incidents stemming from recent system changes.
The report, conducted by Komodor, a provider of a platform for managing Kubernetes clusters, also finds more than 60% of the time spent managing Kubernetes clusters is spent on troubleshooting issues, while only 20% of incidents being resolved without escalation.
More troubling still, more than 65% of workloads run under half their requested CPU or memory, suggesting that wasted spending on the infrastructure required to run Kubernetes clusters is exceedingly high. A full 82% of the Kubernetes workloads are overprovisioned, compared to 11% that are underprovisioned, the report finds. Komodor estimates almost 90% of organizations are also overspending on cloud resources, with capacity utilization often falling below 80%. Well more than a third of IT teams (37%) have a need to rightsize 50% or more of their workloads.
Komodor CTO Itiel Shwartz said the report makes it apparent that as more Kubernetes clusters are increasingly deployed to run cloud-native applications, there is still significant room for operational improvements, especially when it comes to change management.
Overall, the survey finds about 80% of organizations are now running Kubernetes clusters in a production environment, with most IT teams managing a little more than 20 clusters. However, 37% of organizations are now running more than 100 clusters, with nearly half (48%) having deployed them in four or more IT environments. In total, there has been a 35% year-over-year increase in the number of clusters being managed by Komodor customers, the report finds.
More than three quarters (77%) have also to some degree adopted GitOps as a methodology for managing Kubernetes environments, while 68% have established a dedicated platform team. Well over a third (35%) have also invested in an artificial intelligence platform for IT operations (AIOps) with another 40% planning to explore AIOps capabilities by 2026.
Despite these investments, the median mean-time-to-detection (MTTD) of an issue is nearly 40 minutes for high-impact outages, with median mean-time-to-resolution (MTTR) of more than 50 minutes. Well over a third (38%) said they experience high‐business‐impact outages at least weekly, with 62% estimate costs at $1 million per hour for major downtime. Median annual downtime is 177 hours and on average five engineers are typically involved in each incident response.
The report also noted, however, that IT teams that have observability tools experience 40% less annual downtime and 24% lower hourly outage costs.
It’s not clear to what degree IT teams might be considering consolidating Kubernetes clusters to streamline management workflows, but every time an organization does this it’s usually not too long before they still wind up deploying additional clusters, noted Shwartz. As such, the only option is to deploy some type of framework that makes it simpler to automate the management of highly distributed Kubernetes clusters, and all the add-on software packages required, at scale, he added.
Regardless of how IT teams specifically go about achieving that goal, the one thing that is certain is that Kubernetes clusters in the enterprise are here to stay. The only issue that remains to be resolved is how to ensure best practices for managing them are actually being followed.