Best of 2023: Using Airflow to Run Spark on Kubernetes

January 2, 2024January 1, 2024 Chandan Pandey Airflow, Apache Spark, big data, container orchestration, data processing, kubernetes, Spark

by Chandan Pandey

As we close out 2023, we at Cloud Native Now wanted to highlight the most popular articles of the year. Following is the latest in our series of the Best of 2023.

Ganesh Kumar Singh contributed to this article.

In recent years, there has been a significant surge in companies using Apache Spark on Kubernetes (K8s). Kubernetes offers many benefits with this approach; in fact, a recent survey stated that 96% of organizations are now either using or evaluating Kubernetes. As more and more businesses migrate to the cloud, the number of companies deploying Spark on Kubernetes continues to rise. However, it’s important to note that this approach does have its drawbacks. Enterprises that choose to run Spark with Kubernetes must be prepared to tackle the challenges that come with this solution. This means having a strong understanding of their infrastructure and being able to optimize its performance across multiple dimensions. Ultimately, success with Spark on Kubernetes depends on the ability to monitor and manage the platform effectively.

This blog will detail the steps for setting up a Spark app on Kubernetes using the Airflow scheduler. The goal is to enable data engineers to program the stack seamlessly for similar workloads and requirements.

Benefits of Running Spark on Kubernetes

Kubernetes can save time and effort and provide a better experience while executing Spark jobs. In addition, there are business benefits, including scalability, reliability, visibility and cost-effectiveness. Running Spark on Kubernetes also provides portability to any cloud environment, making it less dependent on any particular cloud provider. This approach saves time in orchestrating, distributing and scheduling Spark jobs across different cloud providers.

In addition, this solution uses a common Kubernetes ecosystem that enables features such as continuous deployment, role-based access control (RBAC), dedicated node-pools and autoscaling, among others.

Understanding the Technologies

Before we set up, let’s look at all the technologies involved.

Kubernetes

Kubernetes is a container management system originally developed on the Google platform. Kubernetes helps to manage containerized applications in various types of physical, virtual and cloud environments. Kubernetes is a highly flexible container orchestration tool for consistently delivering complex applications running on clusters across hundreds or thousands of individual servers.

Spark

Apache Spark is a distributed processing system for handling big data workloads. It is an open source platform that leverages in-memory caching and optimized query execution to deliver fast queries on data of any size. Spark is designed to be a fast and versatile engine for large-scale data processing.

Airflow

Apache Airflow is an open source platform designed for developing, scheduling and monitoring batch-oriented workflows. Airflow provides an extensible Python framework that enables users to create workflows connecting with virtually any technology. The platform includes a web interface that helps manage the state of workflows. Airflow is highly versatile and can be deployed in many ways, ranging from a single process on a laptop to a distributed setup capable of supporting the largest data workflows.

Spark on Kubernetes Using Airflow

Apache Spark is a high-performance open source analytics engine designed for processing massive volumes of data using data parallelism and fault tolerance. Kubernetes, on the other hand, is an open source container orchestration platform that automates application deployment, scaling and management. When used together, Spark and Kubernetes offer a powerful combination that delivers exceptional results. Simply put, Spark provides the computing framework while Kubernetes manages the cluster, providing users with an operating system-like interface for managing multiple clusters. This results in unparalleled cluster use and allocation flexibility, which can lead to significant cost savings.

Spark on Kubernetes operator is a great choice for submitting a single Spark job to run on Kubernetes. However, users often need to chain multiple Spark and other types of jobs into a pipeline and schedule the pipeline to run periodically. In this scenario, Apache Airflow is a popular solution. Apache Airflow is an open source platform that allows users to programmatically author, schedule and monitor workflows. It can be run on Kubernetes.

The Current Setup

Kubernetes is used to create a Spark cluster from which parallel jobs will be launched. Jobs launches are not managed directly through the master node of the Spark cluster but from another node running an instance of Airflow. This provides more control over the executed jobs as well as features such as backfill execution. This involves performing executions that correspond to past times from the current time when the scheduling is defined. Airflow has a robust server and scheduler that provides a Python API for defining executors. With this API, programmers can specify tasks and their execution using a directed acyclic graph (DAG) format.

There are three major steps to run Spark application on Kubernetes using the Airflow scheduler:

Setting up Kubernetes Cluster
Setting up the Spark operator on Kubernetes
Installing Airflow on Kubernetes

Setting up Kubernetes Cluster

The setup is done using RKE2. The RKE document captures all the steps on the installation process.

Spark Operator Setup on Kubernetes

After setting up the K8s cluster, install spark-operator using the following command:

$ kubectl create namespace spark-operator

$ kubectl create namespace spark-jobs

$ helm repo add spark-operator https://googlecloudplatform.github.io/spark-on-k8s-operator

$ helm install <name of spark operator> spark-operator/spark-operator –namespace spark-operator –set webhook.enable=true

Set K8s to run Spark applications in the custom namespace by adding< –set sparkJobNamespace=spark-jobs >

$ helm install spark-operator spark-operator/spark-operator –namespace spark-operator –set webhook.enable=true –set sparkJobNamespace=spark-jobs

Create a service account called spark and clusterrolebinding.

$ kubectl create serviceaccount spark -n spark-jobs

$ kubectl create clusterrolebinding spark-role –clusterrole=edit –serviceaccount=spark-jobs:spark –namespace=spark-jobs

Spark-operator setup is now complete.

Next, test the Spark operator by submitting a sample Spark application using a deployment file. Here’s how you can code the file:

sample.yaml

apiVersion: “sparkoperator.k8s.io/v1beta2”

kind: SparkApplication

metadata:

name: spark-pi

namespace: spark-jobs

spec:

type: Scala

mode: cluster

image: “gcr.io/spark-operator/spark:v2.4.4”

imagePullPolicy: Always

mainClass: org.apache.spark.examples.SparkPi

mainApplicationFile: “local:///opt/spark/examples/jars/spark-examples_2.11-2.4.4.jar”

sparkVersion: “2.4.4”

restartPolicy:

type: Never

volumes:

– name: “test-volume”

hostPath:

path: “/tmp”

type: Directory

driver:

cores: 1

coreLimit: “1200m”

memory: “512m”

labels:

version: 2.4.4

serviceAccount: spark

volumeMounts:

– name: “test-volume”

mountPath: “/tmp”

executor:

cores: 1

instances: 1

memory: “512m”

labels:

version: 2.4.4

volumeMounts:

– name: “test-volume”

mountPath: “/tmp”

Start the application

$ kubectl apply -f sample.yaml

List out all sparkapplication jobs

$ kubectl get sparkapplication -n spark-jobs

g. To check the spark application log, use the following command –

$ kubectl -n <namespace> logs -f <driver pod name>

Example:

$ kubectl -n spark-jobs logs -f spark-pi-driver

Installation of Airflow On Kubernetes

Create namespace in K8s cluster

$ kubectl create namespace airflow

Install Airflow using Helm

$ git clone -b main https://github.com/airflow-helm/charts.git

$ helm upgrade –install airflow charts/airflow -f values.yaml -n airflow

$ helm upgrade –install airflow /home/ubuntu/airflow/charts/charts/airflow -f values.yaml –namespace airflow

Some of the changes that were pushed (for a specific requirement) in charts/charts/airflow/values.yaml are as follows:

Changed executor type from CeleryExecutor to KubernetesExecutor

values.yaml

Disabled the Flower component

values.yaml

Disabled redis

values.yaml

Added Git repo url where Airflow will check the DAG files –

values.yaml

Configured web UI user account for the defined users and roles with the access. The command shown below is a dummy. Programmers can completely customize this.

values.yaml

Once installation is done, we can see the pods and services of Airflow.

$ kubectl get pods -n airflow

$ kubectl get svc -n airflow

Airflow can be connected by UI using airflow-web service with user account details which are configured in values.yaml file.

Next, create a kubernetes_conn_id from airflow web UI.

Select Admin>> connections>> select the connection>> create connection ID

Sample DAG used for testing:

sample-dag.py

from airflow import DAG

from datetime import timedelta, datetime

from airflow.providers.cncf.kubernetes.operators.spark_kubernetes import SparkKubernetesOperator

from airflow.providers.cncf.kubernetes.sensors.spark_kubernetes import SparkKubernetesSensor

from airflow.models import Variable

from kubernetes.client import models as k8s

from airflow.contrib.operators.kubernetes_pod_operator import KubernetesPodOperator

default_args={

‘depends_on_past’: False,

’email’: [‘[email protected]’],

’email_on_failure’: False,

’email_on_retry’: False,

‘retries’: 1,

‘retry_delay’: timedelta(minutes=5)

}

with DAG(

‘my-second-dag’,

default_args=default_args,

description=‘simple dag’,

schedule_interval=timedelta(days=1),

start_date=datetime(2022, 11, 17),

catchup=False,

tags=[‘example’]

) as dag:

t1 = SparkKubernetesOperator(

task_id=‘n-spark-pi’,

trigger_rule=“all_success”,

depends_on_past=False,

retries=3,

application_file=“new-spark-pi.yaml”,

namespace=“spark-jobs”,

kubernetes_conn_id=“myk8s”,

api_group=“sparkoperator.k8s.io”,

api_version=“v1beta2”,

do_xcom_push=True,

dag=dag

)

Sample new-spark-pi.yaml file:

new-spark-pi.yaml

apiVersion: “sparkoperator.k8s.io/v1beta2”

kind: SparkApplication

metadata:

name: spark-pi

namespace: spark-jobs

spec:

type: Scala

mode: cluster

image: “gcr.io/spark-operator/spark:v2.4.4”

imagePullPolicy: Always

mainClass: org.apache.spark.examples.SparkPi

mainApplicationFile: “local:///opt/spark/examples/jars/spark-examples_2.11-2.4.4.jar”

sparkVersion: “2.4.4”

restartPolicy:

type: Never

volumes:

– name: “test-volume”

hostPath:

path: “/tmp”

type: Directory

driver:

cores: 1

coreLimit: “1200m”

memory: “512m”

labels:

version: 2.4.4

serviceAccount: spark

volumeMounts:

– name: “test-volume”

mountPath: “/tmp”

executor:

cores: 1

instances: 1

memory: “512m”

labels:

version: 2.4.4

volumeMounts:

– name: “test-volume”

mountPath: “/tmp”

Once the DAG and Spark application file is pushed into the configured repo, Airflow automatically picks the job and starts processing.

The same can be verified from seeing the Spark driver pod log using the command shared previously (kubectl -n spark-jobs logs -f spark-pi-driver).

At this point, the development process can start. Adding a PVC for PostGres to preserve all the data in Airflow is recommended. Additionally, as per your requirements or specific needs, consider adding PVC to all necessary pods, including the Spark application.

Adding PVC for PostGres

To add the PVC, set ‘enabled to true’ under the persistence section
Add storageClass (in case of using rook-cephfs)
Add the size according to requirement

Conclusion

Spark is a powerful data analytics platform that empowers you to build and deliver machine learning applications with ease. With Kubernetes, you can automate containerized application hosting and optimize resource use in clusters. By setting up Spark instances on K8s clusters, businesses can unlock a seamless and well-documented process that streamlines data workflows.