The Final Workload
The scale of what’s coming has been understated.
AI isn’t just changing our tools. It’s shifting the center of gravity for how we build, operate and participate in computing itself. We’re not bolting intelligence onto infrastructure anymore. We’re making it native. And in that shift, one workload is quietly becoming dominant: inference.
For most of the past year, the conversation has revolved around model training. Rightfully so. The headlines go to the companies raising billions, wiring up nuclear-scale data centers to crank through massive corpuses of text, images and video. But inference—the act of applying those trained models in the real world—is where AI actually becomes useful. It’s how it goes to market. And by many estimates, it’s 50 to 100 times bigger than training.
NVIDIA CEO Jensen Huang recently said that inference will drive demand for compute resources 100 times greater than that for model training. Not because training is trivial, but because it happens once—or at most periodically. Inference is constant. Every search query, voice prompt and AI-generated draft runs through an inference engine. It’s the business end of AI. The workload we’ll all have to run, every day, everywhere.
And yet, we don’t treat it that way. It gets framed as a sideshow—something to be run on last year’s GPUs, once the new ones are busy training. That thinking is going to cause real problems. Inference isn’t an afterthought. It’s the default runtime of modern applications.
Which raises the big question: Are we ready for that?
Infrastructure Is Hard (Still)
We’ve been here before. Every leap in computing—mainframe to client/server, client/server to cloud—came with a reinvention of infrastructure. And each time, open source showed us the way. Linux, Apache, Kubernetes. These weren’t commercial products handed down from on high. They were messy, bottom-up, community-built efforts to make sense of massive change.
Inference is pushing us into another reinvention. And it’s going to be just as messy.
Unlike training, which happens in relatively few, specialized locations, inference is everywhere. On the edge. In the cloud. Embedded in workflows we barely notice. It’s ambient. That means our infrastructure—the compute, the networking, the storage—has to adapt. Fast.
And right now, it’s not keeping up.
We have a Cambrian explosion of new hardware architectures. Startups are designing 2,000-core chips. Huawei just announced a networking fabric built specifically to compete with NVIDIA’s NVLink. Cerberus and others are rethinking what a processing unit even is. That’s exciting. But hardware alone doesn’t run inference.
We need software. And right now, the software stack is a fragmented sprawl of open source projects, each evolving at breakneck speed in its own silo. If you’re a Fortune 500 IT leader trying to operationalize AI at scale, the message from the community is essentially: Here’s 2,000 GitHub repos. Good luck.
It Doesn’t Have to Be This Way
I believe in open source. But open code isn’t enough. We need coordination. We need shared roadmaps. We need communities—plural—to talk to each other, align efforts and think about the full stack, not just their slice of it.
There’s precedent for this. The cloud-native movement was born from the realization that infrastructure doesn’t get simpler as it scales. It gets exponentially more complex. The Cloud Native Computing Foundation (CNCF) was built around this insight—not to create more projects, but to create coherence as apps were built in fundamentally new ways. We need that again to address the infrastructure demands of AI; we need a concerted effort to make inference infrastructure buildable, deployable and operable as the very concept of “apps” is redefined.
This is already starting to happen. The smartest infrastructure teams I know are asking new questions: How do we route inference traffic efficiently? What does observability look like when the unit of work is a probabilistic model? How do we guarantee security when prompts, not code, define behavior?
These are hard problems. But infrastructure has always been hard. That’s not new. What’s new is the scale and speed of the demand.
The Final Workload
A colleague at KubeCon said to me recently, “Inference is the biggest workload in human history.” At first, it sounded hyperbolic. But the more I thought about it, the more it resonated. Every digital interaction is a chance for a model to run. Every model inference is a unit of cognitive labor, scaled. This is the factory of the Intelligence Age: not scaling manual labor, but cognition.
That’s why I call it the final workload. Not because it’s the last thing we’ll ever invent. But because it changes the very nature of what we build. Inference systems are beginning to improve themselves. We’re on the cusp of software that writes its own prompts, adjusts its own weights, orchestrates its own deployments.
We’re entering a phase where the boundary between creator and consumer of software blurs. There is no upstream or downstream. Just the stream.
What We Build Now Matters
We can be cynical about this. We can treat it as hype, or just another tech cycle. Or we can recognize it as a real, foundational shift.
Inference is not just an architectural challenge. It’s a collective challenge. And open source communities—at their best—are collective problem solvers. They’re messy. But they’re how we’ve built everything that mattered in infrastructure for the past 25 years.
So let’s do it again. Let’s make inference infrastructure something we’re proud to hand off to the next generation.