Cost-Effective Reliability: Making Sense of Multi-Zone Kubernetes
Look around the ‘net and you’re likely to see Kubernetes operators worrying about how to keep their applications up even in the face of failures but also worrying about the cost of cloud computing (here’s one of many examples). The increasing adoption of multi-zone clusters affects both of these worries: multi-zone clusters make people feel better about reliability, but we hear horror stories about sudden cost increases driven by cross-zone traffic – tens of thousands of dollars per month in the worst cases.
Like any other technology, of course, multi-zone clusters are not themselves good or bad: using technology effectively is always about tradeoffs, and multi-zone clusters are no different. In this article, we’ll take a close look at how clusters are built, and how things fail, to help shed some light on what multi-zone clusters improve – and what they cost.
Clusters and How They Fail
Clusters are built from nodes, which is Kubernetes jargon for a computer (either physical or virtual) running some code that allows Kubernetes to work. There’s not too much to this: glossing over lots of details, the node’s job is to run containers that the Kubernetes control plane tells it to run, and to handle the bookkeeping around what it’s running.
A Kubernetes cluster will happily run with just a single node – I have a demo cluster running a single Raspberry Pi 4 node sitting within arm’s reach right now. For production, though, you want the cluster to survive any single node failing, so multi-node clusters are the common case in the real world. (It’s also helpful for performing upgrades or other maintenance on nodes.)
The node hardware failing isn’t the only kind of problem we can have in production. Real clusters aren’t running on a few Raspberry Pis sitting on a desk; they’re running on hardware in dedicated data centers. There are levels to this, and cluster operators tend to organize their hardware to reflect that:
- Multiple servers get collected into racks;
- Multiple racks get collected into data centers;
- Multiple data centers get collected into an availability zone (AZ);
- And finally, multiple AZs get collected into a region.
We can see failures at any of these levels. A single node failing might be a VM crashing, or a physical server having a hardware problem. Kubernetes can generally ride through this by simply restarting the affected pods (assuming, of course, that there’s enough excess capacity on other nodes).
A lot of nodes failing at once might be a whole rack losing power, or possibly a data center catching on fire. Kubernetes should be able to ride through this, too, if your cluster wasn’t unlucky enough to have all its nodes running on the affected hardware!
Beyond that, it’s possible for an entire AZ – or even an entire region – to go off the air. Hollywood would say that this is the province of invading armies or meteor strikes, but when I’ve seen this sort of thing happen, it’s been a BGP misconfiguration or some such prosaic thing. And yes, I’ve personally been affected at all these levels (including, I’m sad to say, data centers catching on fire). The unfortunate reality is that things fail, and also that the higher the level, the longer things can take to fix.
Multi-Zone Clusters
We mitigate failing nodes, racks, or data centers by running clusters with multiple nodes. We mitigate entire zones going off the air by running clusters with nodes in multiple zones. But multi-zone clusters tend to get expensive: every time traffic from a node in one zone goes to a node in a different zone, the cluster provider is likely to bill us. This might seem like the providers are just being greedy, but there are real costs involved: traffic between zones often has to traverse long-haul networks with more limited bandwidth than the switch fabrics in data centers. Adding bandwidth really does get expensive.
At scale, this can have a real impact on your bottom line, as those folks who’ve told us about monthly bills for tens of thousands of dollars can attest! Unfortunately, it can be tricky for Kubernetes by itself to really control this while still preserving reliability.
(To round out our mitigation strategies, by the way: we mitigate entire regions going off the air by designing multi-cluster applications. But that’s a topic for another article.)
Cross-Zone Traffic
So if managing cross-zone traffic is important, why is it tricky for Kubernetes? The answer here lies in the history of Kubernetes’ networking. By default, Kubernetes handles routing inside the cluster with a component called kube-proxy – and kube-proxy has historically not had a concept of what zone a given node was in, so it couldn’t do anything to try to manage cross-zone traffic.
Kubernetes has tried a few different things meant to improve this situation over the years: you may have heard of topology keys, topology aware routing (TAR), and the very recent traffic distribution. All of these things rely on labeling nodes to tell Kubernetes which zone the node is in. Where they differ is in how complex they are to configure, and – crucially – in what they do when things start failing.
Topology keys and TAR could both prevent cross-zone traffic when everything was going well, which is great for reducing costs. However, they wouldn’t ever allow traffic outside the zone, not even if all the pods in your zone went down… which is problematic for reliability. (They also had fairly baroque requirements around how many zones were required, how many endpoints per zone, etc.) Traffic distribution, for those running clusters new enough to use it, helps because it can let traffic cross zones if all the endpoints in a given zone crash.
But all three of these systems have a major caveat: they only pay attention to whether or not a pod passes its readiness check. If all the pods in a given zone pass their readiness check, but they’re all overloaded and slow, both traffic distribution and TAR will refuse to go out of zone – even if bringing in capacity from other zones would dramatically improve performance.
Smarter Routing
Taking all this into consideration, if you really want to control costs and maintain reliability, your only current option is to bring in some additional software. Service meshes and CNIs can both dramatically extend your options for Kubernetes networking, with Linkerd’s High Availability Zonal Load Balancing (HAZL) being the first thing designed for exactly this job of minimizing cross-zone traffic while still crossing zones when the application needs it. Since a mesh has more information available about how well a given endpoint is actually performing, it can make much better decisions about when to allow traffic to cross zones. This approach requires running additional software but can give you the best of both worlds: all the reliability guarantees of multi-zone clusters, without the out-of-control costs.
Overall, multi-zone clusters can be a powerful tool for preserving your ability to function even in the face of widespread failures. Using them effectively, as always, requires understanding what they can do, what they protect against – and how they can affect your cloud provider bill.