NVIDIA’s ComputeDomains Aims to Simplify Multi-Node NVLink for Kubernetes

November 14, 2025 James Maguire

NVIDIA is pushing deeper into the world of high-performance AI infrastructure, unveiling a Kubernetes-native abstraction called ComputeDomains that promises to simplify a major difficulty in today’s AI development: enabling secure, high-bandwidth GPU communication across multiple server nodes.

ComputeDomains handles a challenge at the center of NVIDIA’s broader strategy for supporting systems such as the GB200 NVL72, a rack-scale AI system geared for training and inference of extremely large AI models. The problem is that traditional single-node setups confine GPUs to one server chassis, limiting scalability. NVLink removes that barrier by forming a unified GPU fabric, so essentially it turns a rack into a single accelerated compute plane. But while the hardware has advanced quickly, Kubernetes hasn’t historically understood how to coordinate workloads that depend on these high-speed fabrics. That’s the gap that ComputeDomains attempts to close.

A Dynamic Process

At a high level, ComputeDomains gives Kubernetes the awareness it needs to manage NVLink-connected GPUs without relying on rigid, pre-configured cluster connections. The feature is built into NVIDIA’s Dynamic Resource Allocation (DRA) driver. DRA allows workloads to request GPUs dynamically, and now, with ComputeDomains layered on top, those workloads gain access to the cross-node memory operations that NVLink enables.

The value of this abstraction becomes clearer when looking at what happens under the hood. NVIDIA’s Internode Memory Exchange Service (IMEX) is the driver-level mechanism that manages GPU memory permissions across nodes. In earlier generations, IMEX domains had to be configured manually, forcing operators to assign workloads to specific nodes. That inflexibility worked against Kubernetes’ design principles of elasticity and fault isolation.

ComputeDomains extends IMEX into the Kubernetes control plane. Now, when a distributed job is scheduled, the platform automatically creates an IMEX domain around whichever nodes the pods land on. When the job finishes, the domain is torn down. The whole process is dynamic and workload-aware.

Optimizes Capacity, Security

NVIDIA has validated the model on DGX systems using the GB200 architecture, and the company says it will scale to future deployments, including systems leveraging much larger NVLink fabrics.

In practice, this means developers running multi-node PyTorch or TensorFlow jobs can rely on cross-GPU bandwidth without needing to understand how IMEX channels map to the cluster.

For enterprises trying to maximize GPU utilization, dynamic NVLink allocation reduces idle capacity and prevents the resource fragmentation that often plagues large clusters. For security-sensitive environments, ComputeDomains creates isolated communication zones so workloads can’t access GPU memory from neighboring jobs.

Bringing NVLink Awareness Into Kubernetes

With AI models growing more complex and AI inference becoming an ever-growing workload, GPU interconnect bandwidth now acts as a limit on overall performance. NVLink’s ability to deliver consistently higher throughput than PCIe remains essential, but only if orchestration layers can tap into that performance.

NVIDIA has been quick to lift the new product’s profile: its KAI scheduler and DGX Cloud Lepton service are already incorporating ComputeDomains as a standard layer.

Installation of the new DRA driver requires Kubernetes 1.32 or later, along with container device interface support. NVIDIA says the product is under rapid development, with forthcoming updates planned to boost elasticity and fault tolerance.

In the broader story of AI infrastructure, ComputeDomains highlights a trend that’s becoming increasingly clear: As GPUs evolve into tightly coupled, multi-node systems, the orchestration stack must evolve with them. By bringing NVLink awareness directly into Kubernetes, NVIDIA is offering a bridge between cutting-edge hardware and the container-native workflows that dominate modern AI development.