Building the Ideal Kubernetes Distro for SaaS: Part 1

January 17, 2022January 14, 2022 Rick Spencer Influx, InfluxDB, kubernetes, managed Kubernetes service, SaaS

This is a two-part series about the unique challenge of running a multi-cloud, multi-tenant SaaS solution on Kubernetes. Part one covers the problem domain, how running a SaaS product is different from the originally intended use for Kubernetes and what is and is not well-supported for this use case. Part two will follow up with specific examples and suggestions for how the Kubernetes user community could potentially solve these problems.

InfluxDB Cloud is one of the largest commercially available services using Kubernetes today. Kubernetes provides the core orchestration layer to create and deliver InfluxDB Cloud across 10 global cloud clusters among AWS, Google Cloud and Microsoft Azure—all with a single codebase. Kubernetes lets us give our customers the serverless experience they want in a way that wouldn’t be possible without the orchestration capabilities.

At the time of this writing, our team operates 16 Kubernetes clusters—nine of which are production clusters running customers’ workloads. While Kubernetes is a very useful tool for providing a cloud abstraction layer that allows us to run our services on multiple clouds with one codebase, the Kubernetes API is not really designed for this use case. Our team has proven that you can use Kubernetes as a cloud abstraction layer at scale, but doing so takes significant upfront effort.

This begs the question, “Why can’t we have a distribution that is cloud vendor-neutral, and is designed from the first install to be compatible with important third-party services that deliver such things as observability, marketplace integration, self-managed and SaaS channels, CI/CD and robust security?”

Understanding a New Problem

Kubernetes is designed for operators to have a single cluster running multiple stateless and (dare I say) trivial workloads. At InfluxData, on the other hand, we are running multiple clusters each with a single mission-critical workload. So, in some respects, like many of our peers, we are swimming upstream of the Kubernetes community and against the current of the API’s design.

In a previous role, I managed a team that delivered a popular Linux distribution, focused on providing a premium experience for desktop users. Part of the philosophy of that distro (and one of the things that made it so popular) was the notion of “sensible defaults.” While the distribution did not limit choice, we focused on providing a set of default applications that were known to work well—and, most importantly, work well together. A user who stuck with the defaults was swimming downstream with the rest of the community and therefore had a much more productive experience.

In my dream world, such a community would form around a Kubernetes distribution designed for folks like me, who are responsible for mission-critical, multi-cloud, multi-region workloads. So if we could build the “dream Kubernetes distro” that SaaS providers need, here’s what it would look like.

Becoming CSP-Agnostic

The first thing to realize is that this ideal distribution will not come from any cloud service provider (CSP). After all, creating truly portable workloads is not really part of their business model. For example, CSPs’ managed Kubernetes services are usually at different versions, so the best you can do is maintain one codebase on the lowest common denominator version currently supported by the slowest CSP to upgrade.

Additionally, CSPs tend to have different levels of support for the different services that you might be plugging into, not to mention completely different APIs for providing services such as disks, etc. Even logging is different on the different cloud providers.

So, this dream distribution should provide an installer that works on all CSPs and is kept up-to-date on all CSPs while supplying—where possible—APIs and solutions that work on all clouds.

Controlled Updates

Every SaaS provider must be very intentional about writing services that survive updates to new Kubernetes versions. Nonetheless, getting an alert that your pods are being relocated to new nodes because the CSP’s managed service determined that right now is an ideal time for them to upgrade your cluster is a really frustrating experience—especially if it coincides with the discovery of bugs in your services that cause services to not cleanly handle such upgrades.

This problem goes back to the point that Kubernetes is designed for people running relatively trivial stateless workloads. This ideal distribution would not make such assumptions and would allow an SRE team to plan and canary updates to new versions. But because the distribution would support all CSPs, an SRE team could automate upgrades to get every cluster to the same version in a sane way, regardless of the CSP.

Sensible Defaults

It’s at this point that certain choices need to be made. Which services, both internal and external, should we choose as the defaults? Of course, this distribution will allow users to deviate from these defaults, but based on my experience, someone starting a SaaS application on Kubernetes will need the following:

A CI/CD system that implements GitOps
A monitoring and SLO solution
A multifaceted security solution
The ability to allow enterprise customers to run the application in their own infrastructure
An integration with CSP marketplaces
A service mesh configured for multi-cloud, multi-region applications

By coalescing around a standard set of solutions for these requirements, we can consolidate our efforts across the industry to provide high availability and a productive developer experience and turn our attention to providing value to customers. We can deliver code (instead of spending time figuring out how to deliver code) and we can more easily support our peers in the industry while doing so.

To Be Continued…

My next article in this series will include specific suggestions for what those sensible defaults could be, and how that might impact SaaS providers such as ourselves. I will be name-checking specific solutions that I hold in high regard, and I will continue to share what we’ve learned on our own journey.