2023 Benchmark Kubernetes Report: 6 K8s Reliability Missteps

The cloud is increasingly the preferred destination for organizations to build their applications and services. Despite a year dominated by headlines about economic uncertainty as inflation grows and banks teeter, most organizations are expecting their cloud usage and spending to be about the same (45%) as or higher than (45%) planned this year. New data in Flexera’s 2023 State of the Cloud report showed that just 10% of respondents expected cloud spend to be somewhat lower or significantly lower than they planned. Regardless of plans around spending, many organizations are looking for ways to control high cloud costs while also ensuring the reliability of their Kubernetes workloads. But keeping your costs as low as possible does not mean that you do not also need to keep the users of your platforms and services happy.

By analyzing data from over 150,000 workloads and hundreds of organizations, Fairwinds produced the 2023 Kubernetes Benchmark Report. This report compared the data from 2022 to the previous year’s benchmark. Industry reports showed that, despite increased adoption in development and production environments, aligning to Kubernetes best practices remains challenging for many organizations. Unfortunately, lack of alignment often results in real-world consequences, such as increased security risks, unmanaged cloud costs and reduced reliability for cloud apps and services. There are six areas in the benchmark tied to reliability, each related to configuration missteps.

1. Memory Limits and Memory Requests Are Missing

According to Kubernetes best practices, you should always set resource limits and requests on your workloads, but for most people, it is difficult to know what values you should use for each application. Typically, this results in either never setting requests or limits at all or setting them too high and never coming back to adjust them appropriately. According to the 2021 benchmark data, 41% of organizations set memory requests and limits for over 90% of their workloads. In the latest report, that number dropped to just 17%. This may be a result of developers and DevOps teams not knowing what limits to set, an increase in Kubernetes consumption without an increase in visibility into configurations or a combination of the two. To make sure that your Kubernetes cluster scaling actions are working properly, you need to adjust your memory limits and requests on each pod. Appropriately setting memory limits and requests can help you ensure that your applications on Kubernetes clusters are running as efficiently and reliably as possible.

2. Liveness and Readiness Probes Are Missing

A liveness probe determines whether a container is running. In Kubernetes, you use probes to monitor the health of an application periodically. Kubernetes automatically restarts the container when a liveness probe detects a failing state, which restores your service to an operational state. You should put a liveness probe in each container in the pod; without a liveness probe, a faulty or non-functioning pod will run indefinitely, using up valuable resources and potentially causing application errors. The latest benchmark report showed that 83% of organizations were not setting liveness or readiness probes for more than 10% of workloads. In the previous year, 65% were not setting liveness or readiness probes for more than 10% of workloads. This issue is not improving, unfortunately.

3. Pull Policy is Not Set to Always

Sometimes teams rely on cached versions of a Docker container image, which may result in reliability issues. By default, an image is always pulled if it is not already cached on the node attempting to run it. This can result in different versions of images running per node. It could also offer access to an image without having direct access to the ImagePullSecret. In the latest report, 25% of organizations were reliant on cached images for nearly all their workloads, a marked increase from 15% the previous year. This increase negatively impacted application reliability.

4. Deployment Replicas Are Missing

New this year, the benchmark checked for deployments with only a single replica, which can also negatively impact reliability. Based on the data, 25% of organizations are running over half their workloads without replicas. This impacts reliability because if a node crashes, the deployment will continue to replace pods if the replica count is one, but during that period, there are zero replicas available. Deploying multiple replicas helps organizations ensure that containers are stable and available.

5. CPU Limits Are Missing

Based on data from 2021, 36% of organizations were missing CPU limits on fewer than 10% of their workloads. The latest report showed that the number of impacted workloads increased for workloads across the board. Eighty-six percent of organizations had more than 10% of workloads impacted. Specifying CPU limits is important because, without limits, the container will not have any upper bound on how much CPU it can consume, which can slow down and exhaust all CPU available on the node.

6. CPU Requests Are Missing

Previously, just 50% of organizations were missing requests on at least 10% of their workloads. The latest benchmark shows that 78% of organizations have greater than 10% of workloads impacted. The number of organizations with 71-80% of workloads missing CPU requests rose from 0% up to 17%. If a single pod is allowed to consume all the node CPU and memory, it may starve another pod of resources. Appropriately setting your resource requests increases the reliability of apps and services because it guarantees that the pod will have access to the necessary resources and prevent other pods from consuming all available resources on a node.

Kubernetes Reliability Remains a Challenge

Kubernetes offers the potential for exceptional value to organizations and enables on-demand scalability and flexibility. At the same time, it is a complex environment with many configurations available. Learning how to adjust them appropriately for your environment and business requirements can be challenging and error-prone. The Kubernetes Benchmark report can help you by showing where other organizations are missing the mark and educating you on what changes can help you make your organization’s deployment as secure, reliable and cost-efficient as possible.

Read the complete Kubernetes Benchmark Report.

Danielle Cook

Danielle Cook is the vice president of marketing at Fairwinds, a Kubernetes governance and security company. She can be reached at [email protected]

Danielle Cook has 5 posts and counting. See all posts by Danielle Cook