The Ultimate Guide to GPU Scaling With Karpenter
The era of “GPU at any cost” has officially ended. As AI and ML move from the research lab to production at scale, the focus is shifting from simply acquiring compute to orchestrating it with precision.
In the Kubernetes world, Karpenter has emerged as the superior tool for this shift. Unlike the legacy Cluster Autoscaler, which relies on fixed node groups, Karpenter provisions nodes dynamically based on pending workloads.
If you are running GPU workloads on Amazon EKS, here is your definitive guide to scaling mistakes to stop making and what to do instead.
5 GPU Scaling Mistakes to Stop Making
Stop Using ASGs for GPUs
The Cluster Autoscaler relies on pre-defined Auto Scaling Groups (ASGs), forcing you to guess capacity in advance. If your ML training job needs a P4D, and you only have an ASG for G4DN, your pod stays pending forever.
Karpenter bypasses ASGs entirely by provisioning capacity directly through the EC2 Fleet API, allowing it to launch the exact instance type a workload requests within seconds.
Stop Over-Constraining Instance Families
A common mistake using Cluster Autoscaler is pinning a NodePool to a single instance type like g5.xlarge. In the Spot market, this is a recipe for insufficient capacity errors. Stop being picky.
Instead, define broad categories in your NodePool. For example:
- instance-category: [“g”, “p”]
- instance-generation: {operator: Gt, values: [“3”]}.
This gives Karpenter a wide menu to choose from, significantly increasing your chances of finding cheap Spot capacity.
Stop Trusting Default Resource Reporting for P5 & G6 Instances
There is a known architectural “gotcha” with the latest hardware: some NVIDIA P5 and G6 instances currently report a GPU count of 0 to the DescribeInstanceTypes API.
If you don’t account for this, Karpenter might mistakenly schedule non-GPU pods onto these expensive nodes or fail to provision them for your actual ML jobs. Until this is patched, explicitly exclude these families from your general-purpose NodePools using the NotIn operator.
Stop Paying Cold-Start Costs From Image Pulls
Cold starts are especially costly for GPU workloads, and AI container images are notoriously bloated, often exceeding 10GB. If a pod spends 5 minutes in ImagePullBackOff while a GPU node sits idle, you burn expensive capacity before any work begins.
Stop relying on default image pulls during cold starts. Explore the Data on EKS (DoEKS) initiative for optimized blueprints, and consider peer-to-peer distribution tools like Dragonfly or Spegel to pull images from neighboring nodes instead of the registry.
For extremely large images, use lazy loading via Seekable OCI or e-stargz to start containers before the full image downloads.
Stop Ignoring Voluntary and Involuntary Disruption Types
Karpenter is aggressive about efficiency, and without a clear distinction between voluntary and involuntary disruptions, teams can lose work unnecessarily or assume Karpenter is unsafe when it is actually behaving as designed.
- Voluntary disruptions: Initiated by Karpenter, such as consolidation, drift detection, or expiration, and are typically handled gracefully with advance notice.
- Involuntary disruptions: Spot interruptions or hardware failures. These are abrupt and outside of Karpenter’s control.
If a job cannot tolerate stopping mid-run, configure Karpenter to avoid voluntary disruption for that workload.
8 Best Practices for GPU Scaling
Use Spot-to-Spot Consolidation (With Caution)
Most people know Karpenter can swap On-Demand capacity for Spot instances. But the real pro-move is using the SpotToSpotConsolidation feature. Note that this is still an experimental setting in current versions.
When enabled, Karpenter will continuously monitor the Spot market and swap your current Spot instance for a different Spot instance if it becomes cheaper.
Pre-Fetch Images With EBS Snapshots
If you use Bottlerocket, AWS’s minimal, container-optimized operating system for Kubernetes, you can dramatically reduce GPU cold starts by using its dual-volume architecture.
Bottlerocket stores container images on a separate data volume, allowing you to pre-seed nodes with large ML images, snapshot that volume to EBS, and reference the snapshot ID in your EC2NodeClass so new GPU nodes boot with images already present.
Use Time-Slicing, But Know Its Limits
Not all GPU workloads require exclusive device access. While large training jobs may need full A100 or H100 instances, many inference and dev/test workloads use only a fraction of available GPU capacity.
GPU time-slicing allows multiple workloads to share a single physical GPU, increasing density and utilization for these lighter use cases. Recent Karpenter releases added native support for multi-resource requests and scheduling capabilities, simplifying GPU slice placement.
Time-slicing provides no memory isolation. One pod can exhaust VRAM and impact all others. For production workloads requiring isolation, use MIG instead.
Use the do-not-disrupt Annotation
To protect those 48-hour training jobs from voluntary disruptions (like consolidation), use the karpenter.sh/do-not-disrupt: “true” annotation.
When you apply this to your pod, Karpenter will not touch the node through automated consolidation or drift until the pod completes or enters a terminal phase.
Use Gang Scheduling for Distributed Training
Distributed training requires all pods to start at once. If Karpenter provisions 7 out of 8 nodes and the 8th fails, the 7 active GPUs sit idle, wasting expensive cycles.
Start using a gang scheduler like Kueue or the NVIDIA KAI Scheduler. These tools ensure Karpenter only provisions nodes if the entire group can be satisfied, preventing partial allocation waste.
Pin AMIs To Prevent Drift
GPU workloads are extremely sensitive to NVIDIA driver versions.
If you use a dynamic AMI alias like @latest, a new EKS AMI release could trigger a drift event, causing Karpenter to recycle your entire GPU fleet to update the drivers — potentially breaking CUDA compatibility.
To maintain a stable environment, always pin your AMI version (e.g., al2023@v20240807) in the EC2NodeClass.
Configure EFA to Fix NCCL Timeouts
Running distributed training on P4 or P5 instances requires Elastic Fabric Adapter (EFA) for NCCL’s low-latency collective operations.
A common cause of NCCL timeouts in EFA-enabled clusters is a misconfigured security group. If the EC2NodeClass does not include a self-referential rule allowing all ports and protocols, NCCL traffic can be blocked.
To avoid this, ensure the EC2NodeClass references a security group that explicitly allows inbound traffic from itself.
Increase Pod Density Prefix Assignment Mode
GPU instances often have a low limit on the number of pods they can support due to Elastic Network Interfaces (ENI) constraints. If you are time-slicing and want to run 20 pods on one node, you’ll hit the IP limit fast.
Enable prefix assignment mode in your VPC CNI. This allows each ENI to handle more IP addresses, ensuring pod density is limited by the GPU’s power, not the network’s plumbing.
Why Karpenter’s Bin-Packing Model Is Critical for GPU Efficiency
Adopting Karpenter shifts capacity management from a node-centric model to a declarative, bin-packing approach. Instead of immediately provisioning new nodes, Karpenter evaluates existing capacity and rearranges workloads to pack resources more efficiently.
This mindset is especially important for GPUs, where capacity is expensive and often underutilized. Applying bin packing to GPUs prioritizes utilization, fitting compatible workloads together and reclaiming wasted capacity before scaling out.
In a world where GPU capacity is scarce and expensive, scaling has shifted from simply adding nodes to orchestrating the right capacity at the right time. This is where Karpenter’s bin-packing model becomes critical.


