Scaling is Still Hard. Machines Make it Easier

At StormForge, we’ve been focused on extracting the most from your applications. As we’ve talked at length about optimization, sustainability and related concepts, one implementation topic has reared its head over and over again, in the area of resource utilization. That is … just plain old scaling.

Scaling, as a concept, is hard. Not only does it need to be automatic so you don’t have your operations team acting like a modern-day switchboard operator, but it also needs to be responsive so your business doesn’t suffer. It needs to be flexible so you can dynamically fit varying needs and it needs to be efficient so you don’t end up paying through the nose. In order to use it, you then need to have a solid understanding of your application’s behavior, otherwise, you’re putting your faith in default values that were never meant for your app.

Newer and better tools for scaling appear and evolve all the time, depending on the infrastructure ecosystem you work in. We’re lucky to be focused on the Kubernetes ecosystem, which is growing at least as rapidly as it was five years ago, and there are a lot of scaling opportunities. Before we even worry about the ability to dynamically scale your clusters’ node count, the apps themselves can take advantage of the horizontal pod autoscaler (HPA), which is a great tool that is part of Kubernetes itself. The HPA, however excellent it is at adding and removing pods, is not without its shortcomings when it comes to implementation, even counting the great features that came out around the 1.18 release, such as individual rate controls.

As implied above, default settings can far from optimal in any capacity. Depending on your app and load balancer, you may even find yourself scaling up infinitely, and not notice until you have an angry CFO pinging you on Slack, or you run out of IPs, whichever comes first. That’s when you’ll realize that default CPU or memory load, for example, wasn’t actually a good measure of your app, and now you want some custom metrics for the HPA to scale against. So, your team spends several sprints analyzing your applications and capturing lots of metrics (many of which end up being irrelevant). You arrive with what you describe as “Pretty good, we think,” while you shrug, knowing it would take an entire research team to get better accuracy beyond what you picked from a default list of Prometheus variables.

The above scenario is so (un)surprisingly typical, it is exactly why vendors like us are pointing machine learning at this problem space. Without the type of work above, you know in the back of your mind that it’s a powder keg. It’s only a matter of time before something explodes, and an AWS bill that’s 200 times higher than normal shows up, possibly without warning. (Of course, there are many other ways besides scaling for that to happen, too, but that’s a separate topic). You know you have to do scaling properly, but doing it takes time, human effort and an incredible amount of introspection and analysis that throws off roadmap timetables like nobody’s business. You could just bring in some consultants, but at that point you should stop doing these things by hand and let the computers do what computers are good at—many iterations of repetitive tasks. Let the computer churn through the unimportant details and show you that spike of importance in a sea of otherwise lost man-hours. Throwing ML at this is the best path forward today to find scaling that properly fits your apps, instead of relying on someone else’s analysis of someone else’s app.

It’s also important to point out that all of the above is only about reactive scaling, and not about proactive scaling of any kind. That requires a whole additional level of analysis that goes far beyond just “Reserve 20 times our normal capacity for Black Friday.” What are the actual patterns of your dev work on weekends? Can you correlate usage not just to inbound connections, but to where you are in a pay cycle or world events? It’s almost a given that you are missing scaling behaviors just because you’re doing it by hand, and you didn’t know to look in the first place.

If you find it worth your time to write scripts to automate menial human tasks, why wouldn’t you turn the same pattern loose on preventing surprises and actually picking up on missed opportunities to make things better, faster and/or cheaper? As with many things in the world of on-demand technology, the surprises can cost you far more than the prevention ever would, and you could have been getting ahead instead of falling behind.

To hear more about cloud-native topics, join the Cloud Native Computing Foundation and the cloud-native community at KubeCon+CloudNativeCon North America 2021 – October 11-15, 2021

Noah Abrahams

Noah runs the Las Vegas Kubernetes community, and shepherds OSS projects at StormForge. He is a CNCF ambassador, has previously been part of the Kubernetes Contributor Experience team, and led the non-code contributor guide efforts. He has been working in Cloud for over a decade and in Tech since the 90s.

Noah Abrahams has 1 posts and counting. See all posts by Noah Abrahams