Kubernetes Begins Work on Pod Checkpoint/Restore
Kubernetes may be getting a feature that has long been enjoyed by supercomputer users: checkpoint/restore.
Just as the name indicates, checkpoint/restore is the ability to bookmark a distributed workload such that if even a single server goes down and disrupts the work, the job can be reloaded and started again from the most recent checkpoint.
It’s better than starting the program anew. In some cases of long-running jobs in the high performance computing (HPC) community, this would mean losing weeks’ worth of work.
But in Kubernetes, the feature would offer another value: better utilization of resources.
No Checkpoints in Kubernetes
Surprisingly, K8s has not had this feature heretofore, at least not in any easily-usable form.
But Adrian Rebe, a Red Hat senior principal software engineer, along with some other volunteers, have set up the Kubernetes Checkpoint Restore Working Group to bring checkpoints to the open source orchestration engine, as Rebe explained in a FOSDEM talk Saturday.
Rebe himself has been working on the project for several years. He and several like-minded colleagues contributed a checkpoint/restore feature for containers to Kubernetes in 2022.
That feature, introduced as Beta in Kubernetes 1.30, is mostly used for debugging and other forensics work.
A Working Group for Checkpoint/Restore
The folks working on technology “was a really small group. It was difficult for people from the outside understanding what we did and why we did it,” he says. Rebe needed a formal working group to bring in more contributors.
In May, Rebe submitted a pull request to the Kubernetes community to start a new working group for the activity, which was approved later that year. The first meeting attracted about 25 people, showing a larger interest in seeing such a feature.
Most of the work thus far has been in setting up the group itself, establishing the Slack channel, a mailing list, a regular meeting time (Thursdays 6pm CET) and other support infra.
Benefits of Checkpoint/Restore
Kubernetes was originally developed for running stateless workloads, so the need for checkpoints was fairly minimal. If something went afoul, the pod could just be rebooted.
Today’s Kubernetes production workloads, especially with AI, tend to be longer-running and involve data, so fault tolerance is becoming a necessity, for the sake of the application.
But a proper checkpoint/restore can be beneficial in other ways for a cloud native ecosystem, Rebe argues.
For one, it can be used to shorten start-up times, by having the initial configuration state pre-prepared for run time.
Java users should take note here.
“You initialize the Java application, you take a checkpoint, and then you restore it from the checkpoint, then it will start much faster,” Rebe says.
Most promisingly, Checkpointing could also help with resource allocation, in that it provides a foundation for easily stopping and redistributing the workload when more efficient resources become available elsewhere.
The industry hungry for GPUs has spiked interest in checkpoint/restore for this very reason, Rebe says.
Checkpointing Pods
Container checkpointing was a good start, but being able to checkpoint entire pods is really what would make this feature useful.
“Kubernetes is more about pods. There are plenty of resources connected to pods,” Rebe told the FOSDEM crowd.
The group is now crafting a Kubernetes Enhancement Proposal (KEP) to describe how a pod checkpointer would work and how it should be designed. They must also develop a reference implementation, a proof-of-concept to prove their architecture actually works.
Reber wants to use an iterative approach, taking small steps to ensure the basic functionality will work. Thus, a lot of pod resources may not be addressed in initial releases.
Ultimately, however, the goal would be to integrate checkpoint/restore into the Kubernetes scheduler itself, so it can move pods around to different nodes for the best fit, efficiency-wise.
“We’re welcoming everybody who’s interested in the topic to join us, help us move this forward in Kubernetes, and hopefully we can see it implemented pretty soon,” he says.


