Google Extends Kubernetes Service to Safely Run Agentic AI Workloads
Google this week at the KubeCon + CloudNativeCon North America 2025 conference revealed it is making available of a technical preview of a sandbox capability on the Google Kubernetes Engine (GKE) service that can be used to optimally run and secure agentic artificial intelligence (AI) workloads.
Additionally, Google is now making available a GKE Inference Gateway that reduces Time-to-First-Token (TTFT) latency by 96% and token costs by as much as 25%. Google has also added a Pod Snapshots capability that also makes it simpler to restore pods in the event of a node failure.
Google is also adding a GKE Buffers application programming interface (API) for near-instant capacity and an Autopilot compute class for standard clusters to streamline provisioning of infrastructure resources.
Finally, Google also announced that Google Cloud is doubling the capacity of GKE to now support 130,000 node clusters.
Dave Bartoletti, a senior product manager for Google Cloud, said node clusters of that size would primarily be used to train massive LLMs, but it does show what can be achieved in terms of the size and scale of a Kubernetes environment that is increasingly running AI workloads.
The GKE Agent Sandbox, meanwhile, makes it possible to securely isolate AI agent workloads both as they are developed and after they are deployed in a production environment, said Bartoletti. It is based on gVisor, an open source sandbox for Linux platforms that Google previously made available to isolate workloads at the kernel level.
That capability is especially critical for AI agents that may behave in ways that are unpredictable unless they are deployed in a sandbox environment that ensures isolation, said Bartoletti.
Google is making a case of using a GKE service for building and deploying AI applications that provide IT teams with access to CPUs, graphical processor units (GPUs) and Tensor Processor Units (TPUs) that Google developed to provide a lower cost alternative to GPUs. Earlier this week, Google added a set of next generation Axion CPUs and Ironwood TPUs to Google Cloud. The Ironwood series of TPUs delivers 4,614 FP8 TFLOPS using 192 GB of HBM3E memory with bandwidth reaching 7.37 TB/s that can be accessed via pods that contain as many as 9,216 AI accelerators to deliver a combined 42.5 ExaFLOPS of FP8 processing capability.
Clusters on the GKE service also take advantage of DRANET, a Kubernetes network driver that uses Dynamic Resource Allocation (DRA) to distribute workloads across multiple Kubernetes clusters, noted Bartoletti.
It’s not clear what percentage of AI workloads are running on Kubernetes clusters. Many initial AI projects were built by data science teams that often lacked any Kubernetes expertise. However, as IT teams become more involved in deploying AI inference engines, the number of AI workloads running on Kubernetes clusters that are designed to scale up and down dynamically in ways that help reduce costs is steadily increasing.
The challenge, of course, is finding IT professionals who have not just experience managing Kubernetes clusters but also the nuances of the AI workloads deployed on them.



