Ten Common Kubernetes Misconfigurations That Cause Outages (And What You Can Do About It)
Kubernetes has become the backbone of modern cloud-native infrastructure, powering everything from startups to Fortune 500 enterprises. Its ability to orchestrate containerized applications at scale has made it an indispensable tool for deploying flexible and efficient systems.
However, this complexity comes with its own set of challenges. As organizations rely more heavily on Kubernetes—especially in the age of AI and real-time data—ensuring its reliability is more crucial than ever. Even minor misconfigurations can lead to outages, risking not only service disruptions but also loss of trust and revenue.
In this context, understanding common pitfalls and how to prevent them is vital for building robust, dependable systems that can support innovation without chaos. This article explores ten of the most frequent Kubernetes misconfigurations that cause outages, providing practical insights on how to identify and mitigate these risks before they can cause outages and impact your users.
Missing CPU and Memory Requests
In today’s hyper-scalable environments, allocating CPU and memory to workloads may seem low-impact. However, not doing so can have a big impact, including preventing Kubernetes from scheduling and running your pod. Requests serve two key purposes:
- They reserve a minimum amount of CPU and memory capacity for pods. This helps Kubernetes determine which node to run pods on to efficiently allocate resources.
- They protect your nodes from resource shortages by preventing over-allocating a single node.
Without requests, Kubernetes might schedule a pod onto a node that doesn’t have enough capacity for it. Even if the pod uses little computing resources initially, that amount could increase over time and lead to exhaustion. With memory, the risk is even greater: If Kubernetes schedules a pod onto a node with limited available memory, and the pod consumes more memory over time, it could trigger an out-of-memory (OOM) event that terminates the pod.
Setting requests won’t prevent a pod from consuming more memory, but it will reduce the likelihood of the pod crashing or being evicted. Starting in Kubernetes 1.34, you can set requests and limits at the pod level instead of the container level, making it even easier to manage workload compute capacity.
Missing CPU and Memory Limits
In contrast to requests, limits set the maximum CPU and memory allocated to a pod.
When you deploy a pod without CPU or memory limits, it can consume memory without upper limits, like any other process. If it keeps using memory without freeing any (known as a memory leak), the host will eventually run out of memory. At that point, a kernel process called the OOM (out of memory) killer jumps in and terminates the process before the system becomes unstable.
Setting requests and limits creates a range of compute resources the pod can consume, making it easier for Kubernetes to schedule the pod efficiently.
Missing Liveness Probes
Once a pod is running, it is not guaranteed to continue running. However, you can monitor the health and availability of the pod’s containers using liveness probes. These periodically send HTTP requests (or commands) to a container and wait for a response. If the container responds with a failure or doesn’t respond at all, the probe triggers a restart.
The power of liveness probes is their deep integration with the rest of Kubernetes. In addition to restarting containers, they can also wait for containers to finish starting, run commands in pods, and even provide a grace period before terminating a pod. In theory, the only time a service owner should have to manually check their pods is if the pod itself has a problem (like the dreaded CrashLoopBackOff state).
No Availability Zone (AZ) Redundancy
Availability Zones (AZs) help isolate failures and create redundancy in cloud environments, but they only work if your systems are configured to use them. If your service is only set up in a single AZ and that AZ fails, then your service will fail even if your cluster spans multiple AZs.
Kubernetes natively supports AZ redundancy in its control plane (the systems responsible for running the cluster) and worker nodes (the systems responsible for running your application pods). Cloud providers often build cluster redundancy into their platform, while you can use topology spread constraints to control how Kubernetes distributes your workloads across zones and regions. While this can lead to higher hosting costs, the benefits far outweigh the risk of an incident or outage, especially for critical services.
Pods in ImagePullBackOff and CrashLoopBackOff States
Before Kubernetes can create a container, it must download the image containing the files and executable code needed to run it. Kubernetes uses your manifest to determine which image to use, where to retrieve it from, and which version to pull. If it can’t download the image, your pod may end up in an ImagePullBackOff state.
When troubleshooting an ImagePullBackOff:
- Check the image URL, name, and version for typos.
- Ensure your Kubernetes nodes can access the container repository.
- Check whether the repository is private; if so, check whether your nodes can authenticate.
Even once you have the image pulled, there’s still a risk of the pod failing to start. CrashLoopBackOff is an infamous pod state that occurs when the pod repeatedly terminates due to an error running it. Kubernetes will try restarting crashed pods after a delay and for a certain number of attempts, exponentially increasing the delay between each attempt. If the pod still crashes after the maximum delay time of 5 minutes, Kubernetes gives it the status CrashLoopBackOff.
CrashLoopBackOff can have several causes, including:
- Application errors that cause the process to crash.
- Problems connecting to third-party services or dependencies.
- Trying to allocate unavailable resources to the container, like ports already in use or more memory than what’s available.
- A failed liveness probe.
There are many more reasons why a CrashLoopBackOff can happen, which is why it’s one of the most common issues that even experienced Kubernetes developers run into. Resolving it often requires looking at the pod’s logs or using kubectl describe to review its deployment details.
Unschedulable Pods
A pod is unschedulable when it’s been put into Kubernetes’ scheduling queue but can’t be deployed to a node. This isn’t a specific state like CrashLoopBackOff, but occurs when a pod enters the Pending state and doesn’t progress to Running. Pods can become unschedulable for many reasons, including:
- Too little free CPU or RAM in the cluster to meet the pod’s requests.
- Pod affinity or anti-affinity rules prevent it from being deployed to available nodes.
- Nodes are being cordoned due to updates or restarts.
- The pod requires a persistent volume that’s unavailable or inaccessible.
Unschedulable pods are often symptoms of a larger cluster problem. To troubleshoot, look at cluster logs and pod resource descriptions to identify why the pod was unschedulable, for example, by running:
kubectl get pods –field-selector=status.phase=Pending
Application Version Mismatches
A software deployment best practice is to pin your deployments to a specific code version. It lowers the risk of deploying the wrong version, and in Kubernetes, it ensures that every replica of your application runs the same version. You can specify image versions in two ways:
- Tags are names created by the image maintainer. A single tag can cover multiple container versions (e.g., how latest always refers to the most recent version).
- Digests result from running the image through a hashing function like SHA256. Each digest identifies a single version of a container. This makes them much more valuable for version pinning.
The risk of using tags is that they can change at any time. For example, if you deploy a pod today using the image’s latest tag and then deploy a replica of the pod tomorrow, how can you be sure the image wasn’t updated in between? You might end up with two different versions running side-by-side. The solution is to use a fixed version reference, like the digest.
Init Container Errors
Init containers run before the main container in a pod. They’re often used to prepare an environment for the main container, so it has everything it needs to run.
For example, large language models (LLMs) require datasets that can be several gigabytes (GB). You can create an init container that downloads these datasets so that the LLM container has the data it needs when it starts.
Unfortunately, init containers introduce a point of failure. If the init container fails and the pod’s restartPolicy is set to Never, the pod will repeatedly restart until it succeeds. Otherwise, Kubernetes gives the entire pod the Init:CrashLoopBackOff status. If you have several init containers defined, they must all run from start to finish, adding time and potential failure points to startup.
Horizontal Pod Autoscaler (HPA) Misconfigurations
Horizontal autoscaling is a critical Kubernetes feature, but it’s easy to get wrong. A horizontal pod autoscaler (HPA) automatically adds or removes replicas to or from a deployment in response to a metric, such as CPU or memory usage. When the metric passes a threshold that you configure, Kubernetes acts accordingly to ensure there’s enough capacity or to free unused resources.
Creating an HPA only requires three things: A metric, a threshold and a range of pods to deploy. For example, this command will add a replica whenever the deployment’s total CPU usage exceeds 50%, up to a maximum of four replicas:
kubectl autoscale deployment nginx –cpu-percent=50 –min=1 –max=4
But what if you reach the maximum, and demand is still high? If the HPA can no longer deploy replicas to meet the threshold, its ScalingLimited condition will be set to true. This carries similar risks to not having an HPA since the service might not have enough resources to meet customer demand. Similarly, if something is preventing your deployment from scaling (e.g., one of the pods is stuck in CrashLoopBackOff), the HPA’s AbleToScale condition will be false.
To ensure your HPA can scale effectively, increase the maximum number of pods to provide extra overhead in case of traffic spikes, and monitor your HPA for any unusual status conditions, e.g. by running:
kubectl get hpa hpa-name -o jsonpath='{.items[0].status.conditions[]}’
Detecting Kubernetes Risks is Critical for Reliability
While any of these risks can be enough to take down your Kubernetes applications, most of them can be fixed relatively easily once you know they’re there. That makes it essential to have a method for detecting these risks and surfacing them so your team can remediate the issue before it causes an incident or outage.
KubeCon + CloudNativeCon North America 2025 is taking place in Atlanta, Georgia, from November 10 to 13. Register now.