Best of 2024: Why Generative AI and Wasm are a Perfect Match

December 31, 2024December 20, 2024 Matt Butcher cloud native computing, containers, docker, Wasm, WebAssembly

Generative AI and WebAssembly (Wasm) are two trending technologies. Each has broken out of its earlier stereotypes and is now proving itself in surprising ways. But just as many technologies arise in pairs, so it may be that the two technologies can complement each other.

What is Generative AI?

Generative AI is a subset of the broader quest for artificial intelligence. In short, generative AI describes algorithms that generate works (text, images, voice, music) that mimic human compositions. For example, OpenAI’s DALL-E produces images while Meta’s LLaMa2 writes texts and GitHub’s Copilot creates code. Each of these generates works that are indistinguishable (at least at first glance) from what an artist, writer or software developer may produce.

But at a lower level, generative AI works by analyzing human works and discerning patterns, which it can then replicate. Let’s take a large language model (LLM) as an example. An LLM is trained using copious amounts of text input during model training. Text may consist of books, Wikipedia, chat sessions, blog posts, games… whatever might represent a “normal” human text. From these texts, LLMs build up elaborate maps of how words (or even parts of words) relate to each other. These maps are numerical in nature. And when they receive a prompt (such as a request from a user), the LLM can perform some highly complex calculations and generate a response simply by mimicking patterns that appear in existing texts.

Let’s give it a try. I’m using LLaMa2, and I’ll ask it: Describe a bird.

This is its answer:

“Sure, I’d be happy to describe a bird! Here is a detailed description of a bird: The bird I am observing is a medium-sized bird with a slate-gray back and wings and a white belly. Its head is a bright yellow, with a distinctive crest on top that it can raise and lower at will.”

The answer it has generated sounds both human and accurately descriptive. But it was generated by calculating the proximity of words in my prompt and generating an answer that is probabilistically in the same terminological space.

So generative AI is essentially a way of generating human-like texts using sophisticated math, which requires enormous computing power.

CPUs are good at general computation, but GPUs (originally created to do fast floating point and matrix math) are excellent for AI-grade computing. Asking LLaMa2 to describe a bird on my laptop may require minutes of hefty CPU number crunching, but if I run it on an NVIDIA A100 GPU, it takes just a second or two.

Generative AI is in its early stages, and the tools to interact with this technology are still developing. I am particularly excited, though, about Wasm’s potential to work with GenAI.

What is WebAssembly?

WebAssembly (abbreviated to Wasm) is a technology standard originally developed for the web browser. JavaScript is the only language natively supported in all browsers. But vast troves of the world’s software are not written in JavaScript. Wouldn’t it be nice to take an existing C library and use it from JavaScript? That was the initial intuition that led to the creation of Wasm.

Wasm is a binary format that many programming languages can compile to. But it has two huge advantages: Unlike “native” binary formats, Wasm can run on any operating system and processor architecture. And (thanks to its browser origins) it is also very secure, with each binary being run in a highly restrictive sandbox.

These key attributes are useful well beyond the browser.

Consider the needs of a cloud computing environment. At its core, cloud computing allows a provider to lease CPU, memory, storage and network to other consumers. Take AWS, for example: Amazon makes it possible for anyone to lease chunks of computing and storage in their vast data center, all without ever giving users access to the physical hardware or the host operating system. Everything is virtualized for us. Like VMs and containers, Wasm is a virtualized runtime that allows a provider to lease a Wasm runtime to customers.

Wasm has some additional virtues that make it great for serverless-style workloads (like AWS Lambda or Azure Functions): Wasm binaries can cold start in under a millisecond and run at near-native speeds. They can be moved inside the data center with ease because they are compact and cross-platform. And because they can be moved and started so quickly, they are vastly cheaper to operate than their peers—Docker containers and virtual machines. They consume scant resources and can be moved to underutilized segments of compute.

All of that is generic, though. When it comes to AI, we can see a specific reason why Wasm is the perfect match with generative AI.

Bringing Generative AI and Wasm Together

Generative AI requires hefty chunks of GPU or CPU compute power. And AI-grade GPUs are currently very expensive. An NVIDIA A100 (the GPU that took only a second or two to run my ‘describe a bird’ example) is so hard to buy that most of us could not get our hands on one even if we could afford it. Cloud providers and large companies are scooping up these beefy GPUs as fast as they can be produced. But even they are starved for more.

One way to ease the demand problem is to bolster the efficiency of use. And that’s where Wasm shines. While a typical application may need long periods of exclusive access to a GPU, Wasm binaries running in the serverless style will only consume a GPU for a few moments while an AI inference is being run. Then, the GPU is immediately released to handle other requests. Thus, efficiency is gained by more effectively time-slicing access to the GPU.

But Wasm solves another problem related to hardware access. It would be prohibitively expensive to delegate AI-grade GPUs to each application developer working on AI. But traditional programming tools build (or use) native binaries that require the developer to write specifically for a particular GPU make and model. A developer must, therefore, do their testing on a remote (and expensive to operate) machine with the right GPU architecture.

Wasm is different. Its platform neutrality means that developers can write code that, when run locally, makes use of the plain old CPU or graphics card that comes with their workstation and then deploy this into a production environment that has access to powerful AI-grade GPUs. This is how, for example, the open source Fermyon Spin framework functions.

In 2024, advances in the standardized WebAssembly Systems Interface – Neural Networks (WASI-NN) standard will advance this even further, as will several projects that aim to provide WebGPU support to Wasm. This later feature will enable developers to do low-level GPU instructions (rather than high-level AI inferences) and have the underlying Wasm runtime translate them to the correct instructions for the hardware it currently uses.

Finally, Wasm’s portability has another boon in the realm of AI. The mantra that we should “move compute toward data” resonates well with Wasm’s platform neutrality. In practice, what this means is that because applications are short-lived and platform-neutral, an AI orchestration system can move the Wasm binary as close to the data (the model, the vector database) as possible. This removes the need to send gobs of data from storage to some remote compute service.

It’s Happening Now

AI saw explosive growth in 2023. In that same period, Wasm’s perception as a browser technology transformed into enthusiasm for its role in the cloud-native ecosystem. As 2024 kicks off, we are watching these two trends coalesce before our eyes. Don’t be surprised to see the two technologies frequently mentioned together over the course of this year. They’re a perfect match.