Netflix Found a Faster Way to Load Containers
The initial appeal with containers was hardware-agnosticism. What runs on your machine runs on production, as long as both ran on x86 CPUs. This interoperability is a big factor in scalability of Kubernetes.
But when speed really matters, understanding what hardware to run can make a difference.
When Netflix upgraded its container runtime, from Docker to the open source containerd, it noticed some nodes were stalling out when operations scaled up.
This was a major problem. When a Netflix user decides what to play on Netflix, it triggers hundreds of containers. So it was quite important to ruthlessly tack down even the tiniest performance bottleneck.
The culprit turned out to be the way Netflix initialized containers. But the bug only showed itself in certain cases. With a bit of sleuthing, Netflix engineers found the delay was worse on some CPU architectures more than others, according to a blog post.
“Understanding and optimizing both the software stack and the hardware it runs on is key to delivering seamless user experiences at Netflix scale,” wrote Netflix Senior Software Engineer Andrew Halaney and Netflix Senior Performance Engineer Harshad Sane, in an engineering blog post, “Mount Mayhem at Netflix: Scaling Containers on Modern CPUs.“
Long Boot Times
Like most Kubernetes operations, a Netflix application will scale until it maxes out a node, then it procures a new instance and continues scaling on that one.
After about 100 containers, however, Netflix was finding that their servers were starting to slow. A health check that read the mount table (a list of everything mounted on a server) would take 30 seconds or longer to complete. This was problematic in that Linux has the CPU dedicate itself to checking if the health check was complete, delaying everything else in processing queue.
Too Many UIDs
Under the older Docker setup, containers typically shared the host’s user ID (UID) pool. When Netflix migrated to containerd—the runtime responsible for executing the Kubelet’s management tasks—it shifted to a more secure User Namespace model. In this setup, each container is isolated with its own unique UID range.
However, this approach required the kernel to individually ‘idmap’ every layer of a container image to that specific range. For a multi-layered image, this meant a massive spike in kernel calls just to instantiate a single container.
Containerd, when assembling a container’s root filesystem, is very needy with its requests for kernel-level locks, dominates the CPU’s time — especially for containers with more than 50 layers.
“If a node is starting many containers at once, every CPU ends up busy trying to execute these mounts,” the pair of Netflix engineers wrote. “Any system trying to quickly set up many containers is prone to this, and this is a function of the number of layers in the container image.”
Spinning up 100 containers, for instance, would require 20,200 mounts, each requiring system calls to the kernel!
Multi-Core Engineering
Netflix is not alone in suffering from this issue. Meta engineers, working with thousands of containers for AI inferencing, have bemoaned the performance drops that came along with managing containers with thousands of UIDs.
In fact, this has been a common system design issue for multi-core processing. Too often, data structures become the bottleneck, argued one widely-cited paper for the 2008 USENIX Operating System and Design conference. Those researchers also found that many commonly-used applications — including Exim, memcached, Apache, PostgreSQL, and MapReduce — also suffer from kernel-level bottlenecks in multi-core servers.
Linux kernel developer Christian Brauner, creator of the next-generation Virtual File System mounting, has long argued that the mount() system call is broken for modern containers, suggesting that file-descriptor-based mounting could be used instead.
The CPU Bottleneck
But it gets even weirder when the engineering team looked at the difference between the hardware running these containers!
Predominately, these timeouts were happening with the Intel Xeon-based AWS r5.metal instances (with the Intel Skylake/Cascade Lake architecture with 96 virtual CPUs). They happened far less frequently on either the later 7th generation Intel m7i.metal-24xl or the AMD EPYC -based m7a.24xlarge, which were also used on the Netflix Kubernetes deployments.
The engineering team developed a microbenchmark to compare the lock contention across different multi-core systems. They found that the older mesh architecture used in r5.metal chips was the bottleneck. The design struggled to synchronize the global mount lock across multiple cores, leading to massive cache-line contention.
The other AWS instances use a distributed architecture, where multiple cores each have their own local last-level cache. Lock contention is more rare in these designs.
“Centralized cache management amplified cache contention while distributed cache design smoothly scaled under load,” the engineers concluded.
The Fix Goes Upstream
The Netflix engineering team tackled the issue at the software level, namely by reducing the number of kernel system calls the containerd was making.
They created a pull request that changed how containerd did global lock usage. This was made possible by Linux kernel 6.3, released in April 2023, which introduced support for recursive binds in mount’s rbind option.
Instead of requiring global locks to mount each layer, containerd simply performed one single recursive bind mount (idmap) of the entire parent directory where all those layers reside.
“This makes the number of mount operations go from O(n) to O(1) per container, where n is the number of layers in the image,” the engineers wrote.
The PR, which included the metrics from the microbenchmark, was merged into containerd version 2.2, released in November.
But Netflix also considered the hardware, opting to route workloads away from r5.metal and to the architectures that scaled better under these conditions.
“This experience underscores the importance of holistic performance engineering,” they concluded.


