The Inference Bottleneck: Architecting Kubernetes Autoscaling for Production LLMs

May 15, 2026May 24, 2026 Pavan Madduri kubernetes, LLMs, platform engineering

Generative AI (GenAI) is moving into production, but native Kubernetes autoscaling is fundamentally broken for large language model (LLM) inference. LLM inference engines preemptively allocate GPU memory for the KV cache, so traditional horizontal pod autoscaler (HPA) memory metrics trigger false scaling events. This article breaks down how platform engineers must abandon CPU/memory metrics in favor of token-centric observability (TTFT, TPOT) and custom controllers to dynamically scale GPU workloads without burning cloud budgets.

The KV Cache Illusion: Why Memory Metrics Fail

To understand why the HPA fails LLMs, you have to understand how modern inference engines manage GPU memory.

When an inference server boots up, it loads the model weights into the GPU VRAM. The remaining memory is dedicated to the key-value (KV) cache. The KV cache stores the mathematical context of previously generated tokens, so the model doesn’t have to recalculate the entire prompt matrix for every new word it generates.

Engines such as vLLM use a technique called PagedAttention, which preemptively allocates almost all available GPU memory into continuous blocks to prevent fragmentation.

If you configure a standard Kubernetes HPA to scale an inference deployment when memory utilization hits 80%, the HPA will immediately trigger a scale-out event the second the pod starts. The inference engine is intentionally hovering at 90%+ memory utilization to maximize the KV cache. High memory usage here is not a distress signal; it is the desired state.

Scaling based on CPU is equally flawed. The host CPU is merely passing tensors to the GPU via PCIe. The CPU might sit at 10% utilization while the GPU is bottlenecked with a massive continuous batching queue. The orchestrator is blind to the actual hardware doing the work.

Shifting the Paradigm: Token-Centric Observability

To scale LLMs efficiently, platform engineers must build autoscalers that observe the actual user experience. In the world of LLMOps, the only metrics that matter are token-centric.

Time to First Token (TTFT): How long does the user wait before the model starts streaming the response? This is an indicator of the prefill phase latency.
Time per Output Token (TPOT): How fast are subsequent words generated? This indicates decode phase latency.
Continuous Batching Queue Depth: How many requests are waiting in the inference engine’s internal queue before they can be assigned a KV cache block?

To expose these metrics to Kubernetes, your inference engine must emit them via a Prometheus/metrics endpoint. Once Prometheus is scraping the queue depth and TTFT histograms, you can decouple your scaling logic from the Kubelet’s cAdvisor metrics entirely.

Architecting the AI Autoscaler

The solution is to bypass the native resource metrics API and utilize custom metric adapters or purpose-built controllers.

Approach 1: KEDA and PromQL

The most immediate fix is deploying Kubernetes event-driven autoscaling (KEDA). Instead of watching the pod’s memory, KEDA queries your Prometheus server directly.

You can construct a ScaledObject that evaluates a PromQL query calculating the 95th percentile of TTFT over a rolling 60-second window. If the TTFT breaches your 500-millisecond SLA, KEDA instructs the underlying HPA to scale from 1 to N replicas.

YAML

apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata:

name: vllm-inference-scaler spec:

scaleTargetRef:

name: vllm-deployment

triggers:

– type: prometheus

metadata:

serverAddress: http://prometheus.monitoring.svc.cluster.local:9090

metricName: vllm_ttft_p95

query: histogram_quantile(0.95, sum(rate(vllm:request_prompt_time_seconds_bucket[1m])) by (le))

threshold: “0.5”

Approach 2: The Custom Controller Pattern

For high-scale enterprise environments, relying on Prometheus polling intervals (often 15 to 30 seconds) is too slow. If a burst of 1,000 complex prompts hits the API, you need sub-second scaling reactions.

This is where the industry is moving toward custom Kubernetes operators written in Go (similar to the logic behind open-source projects like the KubeAI Autoscaler). By using controller-runtime, platform teams can build dedicated controllers that sit directly on the message bus or API gateway (like Envoy), intercepting request load in real-time and adjusting the inference deployment replicas instantly, bypassing the metrics scraping life cycle entirely.

The Node Scaling Imperative (GitOps and Karpenter)

Scaling the pods is only half the battle. If your cluster has no available GPU nodes, your newly scaled inference pods will sit in a pending state indefinitely. Furthermore, keeping idle A100 or H100 GPUs running ‘just in case’ will incinerate your cloud budget.

Your token-centric pod autoscaler must be tightly coupled with a just-in-time node provisioner like Karpenter.

When the custom autoscaler requests a new pod, Karpenter instantly evaluates the nodeSelector and tolerations (e.g., nvidia.com/gpu: 1), makes a direct API call to the cloud provider, and spins up a new GPU node in seconds. By wrapping this entire configuration the inference deployment, the custom scaling thresholds and the Karpenter NodePools—into a GitOps pipeline via ArgoCD, platform teams can guarantee reproducible, cost-efficient AI infrastructure.

The Era of the AI Platform Engineer

We can no longer treat Kubernetes as a dumb container scheduler.

Hosting GenAI requires deep mechanical sympathy between the orchestrator, the GPU hardware and the application layer.

By abandoning traditional CPU/memory HPA triggers, exposing token-centric metrics and architecting custom autoscaling controllers, we prevent the ‘inference bottleneck’.

The evolution of the DevOps engineer into the AI platform engineer is no longer a future prediction; it is an immediate operational necessity.