Spot Check: Guide to K8s Spot Node Savings

Hesitant about introducing spot nodes into your Kubernetes clusters? You’re not alone. Despite their bargain value, the fact that spot nodes offer no availability guarantees can make leveraging them feel like a high-wire act with no net. 

Some dev teams are unaware that the three biggest public cloud Kubernetes services now support spot instances (Google Cloud GKE support launched in 2018, Azure AKS in October 2020 and Amazon EKS in December 2020). However, each cloud provider clearly warns customers to use spot instances for fault-tolerant applications only, and that the provider cannot guarantee availability. 

Thus, it falls to the teams running Kubernetes to identify the workloads that are appropriate for spot instances—and do so at their own risk. Without a reliable method for vetting workloads, this becomes a high-stakes gamble. Choose correctly and you save up to 90% on cloud resource costs. Choose incorrectly and you can suffer expensive downtime. 

The following spot-readiness checklist can help take the risk out of incorporating spot instances in your clusters. By applying this checklist to your public cloud Kubernetes workloads, you can more accurately recognize the workloads that can be safely scheduled on spot instances (and identify those that shouldn’t be running on spot). As a result, you can take advantage of meaningful cost savings without risking availability issues. 

Understanding Spot Instances

Public cloud providers offer their spare compute resources as spot instances, often at tremendously discounted rates. Supply and demand dynamics govern the details of spot nodes, including their specific pricing, availability and deployment location. When demand for an instance type goes up, the cloud provider will send interruption notices to spot instances of that type and spin them down within minutes (and sometimes even faster). This is why leveraging spot resources requires the ability to identify workloads that are fault-tolerant. Kubernetes itself provides dynamic workload replication and scaling, offering applications resilience against node failure and simplifying the use of spot instances. If you use Kubernetes scheduling primitives correctly (taint and toleration, in this case), the Kubernetes scheduler will automatically move spot-scheduled workloads to on-demand nodes if spot capacity goes down in the cluster. It’s worth noting that this requires either excess on-demand capacity on the cluster or some sort of cluster autoscaler to create on-demand capacity as needed.

Your Spot-Readiness Checklist

To assess the spot readiness of your Kubernetes workloads, you should check these five criteria:

1) Controller type. Some Kubernetes controller types are indicative that a workload is not spot-ready, like StatefulSets. Most other controller types, like Deployments and DaemonSets, could be spot-ready depending on the other items on the checklist. If you have a StatefulSet that passes all other checks on the list, consider if it should be a StatefulSet at all.

2) Replica count. If a workload’s replica count is configured to 1, it’s not spot-ready. A spot node outage that removes that one replica from the cluster would take the entire workload down with it. Greater replica counts indicate that a workload is horizontally scalable, making it more spot-ready. 

3) Local storage. Having a writable volume, like emptyDir, indicates that a workload isn’t spot-ready. A pod that shuts down mid-write can easily lose data.

4) Pod disruption budget (PDB). A PDB can direct the scheduler to meet set availability requirements. If a PDB is in place, calculate the minimum available replicas divided by the total replicas. If this ratio is greater than 0.5, the workload likely isn’t spot-ready because it requires high availability. This is a heuristic-based approach; failing this check shouldn’t instantly disqualify a workload. Further investigation might discover that a PDB is stricter than needed. 

5) Rolling update strategy (for deployment only). As a default setting, deployment configurations include a rolling update strategy with “25% max unavailable.” For deployments with this configuration, perform the same minimum available replicas calculation as above with PDBs, but now with a requirement of 0.9. This check makes sure that default-configured deployments have replica counts higher than three, as displayed here:

Pair With Expert Oversight

Along with the checklist above, be sure to have domain experts consider and approve each workload for use on spot nodes. The aforementioned Kubernetes’ taints and tolerations are valuable and recommended tools for ensuring that only spot-ready workloads are scheduled on spot nodes. By carefully vetting workloads in line with the spot-ready checklist and supported by expert guidance, your teams can leverage spot nodes to quickly and safely realize significant cost savings.

Kirby Drumm

Kirby Drumm is the Field Engineering Lead at Kubecost, an open source solution for understanding and optimizing Kubernetes spend. He has more than 30 years of software engineering experience, including extensive work in system integration across a variety of products and industries.

Kirby Drumm has 1 posts and counting. See all posts by Kirby Drumm