Kubernetes Could Use a Different Linux Scheduler
A pair of Cambridge researchers have found a way to squeeze 10 – 20% more capacity from Kubernetes clusters, simply by making a tweak in the way the Linux kernel schedules jobs.
During a talk at FOSDEM, University of Cambridge Professor Richard Mortier described the work, the bulk of which he admitted was completed by his graduate student Al Amjad Tawfiq Isstaif, who couldn’t make the conference.
The research came about in trying to understand why Kubernetes clusters were not running as efficiently on the hardware as they could be.
In short, they found that the built-in Linux scheduler is ill-suited to run Kubernetes jobs, especially serverless tasks. By changing the scheduling algorithm to one more suited for completing many short-running tasks, the researchers increased the utilization of CPUs.
CFS is Not Always ‘Fair’
These days, we expect a modern processor to juggle multiple tasks on behalf of the operating system. How does it choose which jobs to execute? That’s the job of the OS scheduler.
For Linux, this is currently the role of the Completely Fair Scheduler (CFS), first implemented in Linux 2.6.23 in 2007. When it was added, Linux was (and is) widely used for running web servers, themselves sets of long-running daemons.
The scheduler relies heavily on cgroups to bunch together the many different types of jobs it has to run. Cgroups is the Linux kernel feature that provides a way to set resource utilization limits — for CPU shares-per-process, as well as memory, I/O, and network resources for each group. A cgroup can bundle multiple tasks, each one run from a container on the pod.
CFS’ own idea of fairness comes down to a single rule: Prioritize the task that has thus far been executed the least. This way, it minimizes the wait time as much as possible for all jobs.
Fair right?
Kubernetes jobs tend more densely packed together than the LAMP stack of yore, though. The appeal of Kubernetes is, after all, to quickly spin up and run jobs.
Following orders, CFS rotates these jobs, unaware of their rising amount of overhead they are incurring.
‘If you have a lot of tasks hanging around because they’re not getting completed, you also have a higher rate of context switching,” Mortier said in the presentation. “You get this kind of multiplicative effect where you end up increasing the overhead in the system quite substantially.”
Serverless jobs are especially hard-hit by CFS. “Serverless is a particularly poor workload in the sense that it will see this problem quite badly,” Mortier added.
Kubernetes Workloads Hit Different
The researchers reasoned that the trick is to cut down on the context switching, which is the computational tax incurred on switching the CPU’s attention from one task to another: The old task must be packed away into memory, and the new task called up. Plus, the scheduling tree must be rebalanced.
“As you’re trying to pack more things onto the system, the performance goes down, and it’s going down as the average time per context switch is increasing,” Mortier said.
CFS gives each task an initial time slice of about 4 milliseconds. Each context switch between two tasks costs 10-20 microseconds.
An Adjustment to the Scheduler
The researchers’ new algo, called Latency-Aware Group Scheduling, has the chief goal of maximizing task completion rate, thereby minimizing the number of context switches required.
LAGS’ mission is to “get the light tasks out of the way. Keep the run queues short. Don’t waste your time doing lots of system bookkeeping overheads,” Mortier said.
The scheduler prioritizes those tasks in the cgroup with shortest time left to complete execution. It uses the CGroup Load Credits, which tracks CPU usage by thread.
“By prioritising task completion over strict fairness, the enhanced scheduler is able to drain contended CPU run queues more rapidly and reduce time lost to context switching,” the researchers wrote.
It is less fair to all the tasks, but by pushing to complete those tasks nearest to completion the algorithm reduces the tasks on the server’s plate, and opening up resources for longer-running jobs.
A Multi-Core Scheduling Problem
While saving a few microseconds of context switching here or there may not seem like much, they can add up to actual hardware savings over time.
In their benchmark, the University of Cambridge researchers established a baseline workload that was spread across 600 containers. With no scheduling magic, this test workload could be most effectively be run across 16 nodes, with about 30% utilization rate of the 60 cores being used.
CFS allowed the researchers to multiplex the work so that it could be run on only 12 cores, each running at 45% efficiency, before performance degradation kicked in.
But CFS-LAG, applied as a kernel patch, came in with even better numbers: The same work could be executed with only 10 cores, running at 55% efficiency each.
“So you get the same amount done, but you only need 10 nodes instead of 14 nodes to achieve that,” Mortier said.
Next Steps for a New Scheduler
The good news is that you don’t need to throw out the CFS baby with the bathwater. LAGS could be implemented as a sub-schedule architecture (CFS-LAGS) that could be applied only to designated cgroups.
In user space, the configuration would involve working with cgroup interface and kernel flags.
Of course, the path to make upstream changes to the canonical CFS will be a heroic undertaking. In the meantime, interested parties could write their own patches, or look at doing an eBPF implementation.
The researchers’ work was summarized in an ArXiv paper, “Mitigating context switching in densely packed Linux clusters with Latency-Aware Group Scheduling.”


