Making the Most of Kubernetes for Big Data

October 10, 2023October 9, 2023 Ash Munshi big data, cloud native application development, data analytics, KubeCon, kubernetes, Pepperdata

by Ash Munshi

Kubernetes has grown beyond its roots as a flexible container orchestration system to become the de facto operating system of the cloud, capable of running almost anything. Its popularity is evidenced by the record-breaking attendance at this year’s KubeCon + CloudNativeCon North America.

As more enterprises embrace Kubernetes, cost optimization has become an increasing imperative. Startups and enterprises that experienced sticker shock when migrating to the cloud must demonstrate the ROI of their cloud investments and manage costs in a disciplined way. Fortunately, many approaches to cost optimization are now available that positively impact business operations. Understanding the historical trends that inspired the development of Kubernetes and its adoption and identifying key cost optimization solutions can significantly enhance a successful and cost-effective Kubernetes deployment.

Kubernetes and Hadoop: A Tale of Two Islands

Kubernetes emerged in the mid-2010s from Borg, an internal Google system for orchestrating Google’s containerized microservices at massive scale. The open source community quickly adopted Kubernetes, creating an ecosystem around containerization.

At around the same time, a vastly different technology was emerging to support the growth of increasingly large big data datasets: Hadoop. As an open-source project, Kubernetes lacked Borg’s support for Big Data, and thus, at that time, was considered unsuitable for the types of large-scale data and computing problems that Hadoop was addressing. With its mortal pods that serve a single function and then vanish, Kubernetes was ideal for stateless workloads like microservices, which simply accept a request, process it and return a response without saving state. In contrast, Hadoop was suited for stateful workloads that persisted state across sessions, such as applications that process massive datasets across hundreds of machines.

The parallel rise of Kubernetes and Hadoop in the mid-2010s meant that the computing world was divided into two islands:

Island 1: Kubernetes and microservices

Island 2: Hadoop and big data

The disjointed nature of these two islands made it difficult for enterprises to bridge the divide between them. Practically speaking, an organization would have three choices:

Run all its applications on one island or the other
Attempt to run microservices applications on Hadoop or big data applications on Kubernetes
Absorb the cost and overhead of adopting both islands

None of these choices was a clear winner, and organizations quickly found that navigating two independent islands of computation was costly and unwieldy.

All Islands Are Not Equal: Apache Spark Disrupts the Status Quo

Meanwhile, another technology was emerging that would disrupt this stalemate of sorts between Kubernetes and Hadoop. The growth of Apache Spark—with its high performance, ease of use and versatility—surfaced the challenges of managing two separate computing worlds. Kubernetes began to take the lead over Hadoop as a platform for Spark, offering a number of advantages that hold true today:

The robust scalability of Kubernetes complements the scalability of Spark—essential for processing massive datasets and accommodating varying workload demands.
Kubernetes containerization simplifies the deployment and management of Spark applications across multi-cloud environments.

As a result, Kubernetes (and not Hadoop) emerged as the most logical containerization system for enterprise-grade Spark.

AI/ML Workloads: Further Support for Stateful Apps on Kubernetes

We are now seeing a convergence of these two islands in the form of big data on Kubernetes. In a single year, from 2021 to 2022, the number of big data projects on Kubernetes grew by 35%. And 61% of Kubernetes professionals say that they’re running stateful applications like Apache Spark on Kubernetes.

The exponential growth of artificial intelligence (AI) and machine learning (ML) further drives this convergence. Many AI/ML applications require immense computational capacity, which in turn makes running stateful applications on Kubernetes increasingly compelling. The features recently released by the Kubernetes open source community are indicative of its alignment with AI/ML and other stateful applications:

Airflow: A Kubernetes-friendly tool for orchestrating big data workflows
Yunikorn: A resource scheduler that provides flexible and efficient resource utilization for Spark and other applications on Kubernetes
Kubeflow: A platform for building and deploying portable, scalable machine learning (ML) and MLOps workflows on Kubernetes
Volcano: A batch scheduler designed for running high-performance workloads on Kubernetes

These features underscore an increasing reliance on Kubernetes as an efficient, reliable and robust option not just for bread-and-butter stateless applications like microservices but also stateful applications based on Spark and/or AI. Today, the evolution of computing is being driven by the breakneck growth of AI, including large language models (LLMs) like ChatGPT. OpenAI, the creators of ChatGPT, are now running deep learning infrastructures on Kubernetes, taking advantage of its inherent portability and speed.

The Cost Optimization Imperative

With this evolution of Kubernetes into a mature and highly performant system for both microservices and stateful applications, the massive volume, variety and velocity of data running on Kubernetes makes it critical for organizations to make the most of their investment. As reported by the State of Kubernetes Report 2023, 57% of respondents said that they measure the ROI of their Kubernetes deployments in terms of cost savings. Without considering the cost implications caused by the notorious inefficiencies of Spark, for example, there’s a good chance that a Spark on Kubernetes initiative may flounder.

Cost optimization of stateful apps on Kubernetes is a two-step activity. First, optimization efforts can focus on the Kubernetes infrastructure (whether Amazon EKS, Google GKE, Microsoft AKS or others) through initiatives such as right-sizing or advance discount purchases. These actions provide essential cost savings at the platform level.

The second essential part of cost optimization is the optimization of Kubernetes applications themselves. An effective approach to deal with the inefficiency inherent in applications is to employ an automated solution with real-time visibility into actual usage in the cluster to remediate waste second by second, to automatically tune the application and to keep costs under control. Organizations that employ such methods at the application layer will have a strategic cost advantage over competitors as workloads become increasingly large and more complex.

There is often the temptation for developers to become mired in manual solutions, including ‘whack-a-mole’ efforts to tweak resource provisioning to match the ever-fluctuating demands of their applications. However, engineers can only do so much to fix the problem of wasted resources manually since, on average, typical Spark-on-Kubernetes applications can be overprovisioned by 30% to 50% or more.

The Common Future of Kubernetes, the Cloud and AI/ML

With the convergence of the formerly separate islands of big data and Kubernetes, accelerated by the rapid adoption of stateful applications like those built on Spark, AI will serve as the rocket fuel that drives us toward greater decentralization across multiple cloud environments and open incredible possibilities for innovation.

No matter how our world is transformed by such advances, I believe a few fundamental elements are here to stay:

The Cloud: Only the cloud can provide the infrastructure necessary for large-scale AI development and deployment on such systems as Kubernetes, as OpenAI is doing.
Distributed Computation on Kubernetes: Training LLMs requires a massive amount of compute, such as GPU farms, where a GPU is the equivalent of a supercomputer on a microchip. Such immense computing power can only be provided through distributed computing across cloud networks. And in environments where computing is distributed, the portability of containers makes Kubernetes the obvious choice.
Cost Optimization: Computing power may appear infinite, but money is not. Cost optimization at both the platform and the application level will continue to play a pivotal role in determining the leaders in this new world. Cloud-based Kubernetes solutions, including Amazon EKS, Google GKE and Microsoft AKS, will continue to build out their native cost optimization tools and leverage those from the open source community.

All this exciting potential has a caveat: The imperative to conserve cost and other resources wisely so we have the means to take advantage of the advances that technology will offer us. The winners in this new world will be those who are able to maximize ROI by achieving transformative technological goals and simultaneously optimizing costs.

To hear more about cloud-native topics, join the Cloud Native Computing Foundation and the cloud-native community at KubeCon+CloudNativeCon North America 2023 – November 6-9, 2023.