Scaling Kubernetes With AI and Machine Learning

November 16, 2023November 15, 2023 Raj Nair AI/ML, AIOps, autoscaling, kubernetes, KubeSlice, scaling

If you are a site reliability engineer (SRE) for a large Kubernetes-powered application, optimizing resources and performance is a daunting job. Some spikes, like a busy shopping day, are things you can broadly schedule; but, if done right, would require a painstaking understanding of the behavior of hundreds of microservices and their interdependence that has to be re-evaluated with each new release – not a very scalable approach, let alone the monotony and resulting stress to the SRE. Moreover, there will always be unexpected peaks to respond to. Continually keeping tabs on performance and putting the optimal amount of resources in the right place is essentially impossible.

The way this is being solved now is through gross overprovisioning or a combination of guesswork and endless alerts – requiring support teams to review and intervene. It’s simply not sustainable or practical and certainly not scalable. But it’s just the kind of problem that machine learning and AI thrive on. We have spent the last decade dealing with such problems, and the arrival of the latest generation of AI tools, such as generative AI, has opened the possibility of applying machine learning to the real problems of the SRE to realize the promise of AIOps.

Turning up the Compute Knob

No matter how great your observability dashboard, the amount of data and the need for agility is just too much. You have to provision adequate resources to achieve the desired response times and error rates. It is not unusual for people in this role to peg compute utilization at 30% “to be safe” and be prepared to monitor hundreds of microservices to ensure the desired service-level agreement (SLA) is achieved. The end result is costly not just from a compute resources standpoint but also from DevOps resources dedicated to maintaining the SLA.

It seems that for all it has brought us, Kubernetes has evolved beyond the capabilities of those charged with operating it. Horizontal Pod Autoscaling (HPA) and reactive scaling solutions still leave the SREs guessing at what level to set the CPU utilization threshold that would work for various traffic loads and service graph dependencies. Traffic does not have a linear relationship to microservice loading and, thus, to performance, and that is not the only reason to change the states of the application deployment. SREs also monitor issues like temperature, faults, and latency.

For a typical Kubernetes application, there are, on average, several hundreds of microservices. Furthermore, each microservice is dependent on other microservices in a web of interconnected relationships with other microservices. It is not easy for a person to view and understand it all and then make detailed changes and do this repeatedly for every release of each microservice every week. SREs figuratively “turn up the compute knob” and hope that it improves whatever has dropped below the service-level objective (SLO). But, the reality is that it is useless to increase resources at a microservice that is dependent on another microservice, which is actually the bottleneck.

Simply turning up the “compute knob” does not solve everything – and will definitely be the most expensive solution. Kubernetes needed a better system-level solution that was practical and did the job.

An Ideal Use Case for AI

In 2023, when someone says AI, they almost inevitably mean ChatGPT. ChatGPT is a generative AI tool that selects the best next word. While the architecture required for a strong AIOps platform is very different from ChatGPT (more on that later), the goal is similar – choose the best next state for the application.

The intricately interconnected ecosystems of modern microservice applications are too big and complex for the SRE team to comprehend in detail and make those decisions. Most efforts to auto-scale these applications fail to take into account the nuanced requirements and performance needs of individual services.

Putting Digital Twins Through Their Paces

Training data is the fuel for AI. To teach an application to operate a mission-critical Kubernetes instance, we need to develop good information about how performance can be optimized. Digital twins have been used for decades in multiple fields, including manufacturing and racing, to help people recreate a digital equivalent of the real subject to study its behavior. In our case, we use performance metrics to build a digital twin of each microservice.

In reinforcement learning (RL), digital twins are used to create a simulation environment to generate an observation space in which a model can be trained to discover and learn the best paths (also known as ‘trajectories’) to guide the system to states that have the desired target properties in terms of cost, performance, etc. In our case, we use proximal policy optimization (PPO) as the RL training algorithm. Our approach is service-graph aware to take into account the dependencies of microservices that impact scaling. Ultimately, we will have a model-free network that is continually learning based on operational experience.

In summary, Smart Scaler looks at the current state, understands what historical and digital twin models did in that state, and generates the next state – optimal not just for the next minute but for the path forward. For the application, keeping a Kubernetes instance running optimally is not unlike an endless game of Go in that it is making its best move now while thinking far into the future.

Better Responsiveness and Ongoing Improvement

Kubernetes has come a long way. There is extensive tool-level automation but not a lot of effective system-level automation. Perhaps that has a lot to do with the vast amount of activity within a Kubernetes instance. We boiled the problem down to deciding the best next state for the application. By architecting an AI on the modeled behavior of the digital twin, Smart Scaler can provide that optimization now.

Just as ChatGPT gets better with more input, Smart Scaler is retraining regularly for each client instance. As it collects new data going forward, it only gets better at generating the best path for future system states. In addition to an ongoing understanding of each client instance, Smart Scaler will continue to train the general model on more microservices. In the next few years, we anticipate that the parameters of the underlying neural network will increase from hundreds of thousands to a few million.

People have been playing with generative AI that can produce words and images for a general audience. We are seeing how the same technology can transform our digital experience.

For SRE Now and Developers of the Future

SREs today could benefit from a transformation. SRE teams are often asked to contribute to their own SLOs, and they simply don’t know where to begin. It seems that the complexity of Kubernetes has outpaced the ability of humans alone to operate it. Smart Scaler does interact with human operators and allows extensive operator control. However, those operators are not called upon to make critical decisions using thousands of data points with seconds to take action. Today, Smart Scaler does that work–it improves outcomes for existing Kubernetes instances. Looking ahead, applying AIOps models and moving toward autonomous infrastructure can allow for a new level of complexity and scale for microservices applications.