CNCF Expands Efforts to Run AI Inference Workloads on Kubernetes Clusters

March 25, 2026March 25, 2026 Mike Vizard AI agentic workloads, AICR, CNCF KubeCon Europe 2026, distributed AI inference, edge AI inference, in-place pod resizing, Jonathan Bryce CNCF, KAR v1.35, KRO project, Kube Resource Orchestrator, Kubernetes AI Conformance Program, Kubernetes AI Requirements, Kueue job queueing, llm-d framework, NVIDIA AI Cluster Runtime, PyTorch Foundation, Red Hat AI, Sovereign AI standards, vLLM extension, workload-aware scheduling

by Mike Vizard

The Cloud Native Computing Foundation (CNCF) and Red Hat, this week at the KubeCon + CloudNativeCon Europe conference, revealed that the llm-d framework, created to deploy artificial intelligence (AI) workloads across a distributed computing environment based on Kubernetes clusters, has been contributed to the consortium.

Additionally, the CNCF has published a stricter set of Kubernetes AI Requirements (KARs) as part of a Kubernetes AI Conformance Program that seeks to ensure AI inference engines can run at scale on Kubernetes clusters. Stable in-place pod resizing, which lets inference models adjust their resources without needing to restart, and workload-aware scheduling to avoid resource deadlocks during distributed training are now mandatory requirements.

Other levels of technical benchmarks for v1.35 of KAR that have been added include support for high-performance pod-to-pod communication, advanced inference ingress capabilities and disaggregated inference support.

The CNCF also plans to move beyond self-assessments for KAR to create a specialized “Verify Conformance Bot” to provide more rigorous, third-party validation of a platform’s AI-readiness. Later this year, the program is also planning to expand to include Sovereign AI standards, focusing on enhanced sandboxing and data privacy.

The overall Kubernetes AI Conformance Program has also been expanded to validate that AI agentic workloads running in sandboxes on Kubernetes clusters can also be ported across multiple Kubernetes environments.

Finally, the CNCF this week also revealed that OVHcloud, SpectroCloud, JD Cloud and China Unicom Cloud have also now been KAR certified, to bring to 31 the total number of KAR platforms certified since the KAR initiative was launched last November.

Jonathan Bryce, executive director for the CNCF, said the consortium is trying to marshal multiple open source initiatives to further ensure that Kubernetes becomes the dominant platform for running AI inference workloads.

Those projects in addition to the llm-d framework, range from a Kube Resource Orchestrator KRO) project initially launched by Amazon Web Services (AWS) to a Kueue job queueing system to collaborating with NVIDIA on a newly launched open source NVIDIA AI Cluster Runtime (AICR) to create recipes for automatically deploying AI workloads.

The CNCF will continue to focus on core technologies that will enable IT operations teams to deploy AI workloads at scale, while sister consortia such as the PyTorch Foundation will develop open source tools for building AI models, said Bryce. In fact, the llm-d framework developed by Red Hat is an extension of the vLLM inference and serving engine that is currently being advanced under the auspices of the PyTorch Foundation.

Ultimately, it’s now only a matter of time before more AI workloads are deployed on platforms designed to scale out rather than up, noted Bryce. Kubernetes is an ideal platform for running those data-intensive workloads because it is designed to enable workloads to dynamically scale up and down as needed, he added. That capability will also prove crucial as more AI inference workloads are distributed to the network edge, many of which will be based on smaller large language models (LLMs) that have been trained using private rather than public data, he added.

It’s clearly still early days so far as deploying AI inference workloads in distributed computing environments is concerned but the one thing that is clear is that AI workloads in the not-too-distant future are already set in terms of volume to eclipse every other class of workloads running on a Kubernetes cluster.