CAST AI Helps Cost-Optimize LLMs Running on Kubernetes

April 9, 2024April 9, 2024 Mike Vizard CAST AI, kubernetes, LLMs, roi

CAST AI today added a feature to detect which large language model (LLM) is being used to drive an artificial intelligence (AI) application, and then recommend and install an alternative inference engine that is less expensive to run.

Announced at the Google Cloud Next ’24 conference, the AI Optimizer capability being added to the CAST AI automation platform for Kubernetes clusters late in the second quarter will initially be made available on the Google Kubernetes Engine (GKE) service. Support for other Kubernetes cloud services that support the open application programming interface (API) defined by OpenAI will follow.

AI Optimizer analyzes users and their associated API keys, overall usage patterns, and the balance of input versus output tokens to determine which cloud instance of a graphical processor unit (GPU) will run an AI model most efficiently. The capability will substantially reduce the total cost of AI by enabling organizations to right-size AI inference engines based on price/performance metrics, said Laurent Gil, chief product officer for CAST AI.

CAST AI will also shortly add orchestration capabilities to enable IT operation teams to make better choices, among the automation capability provided by the company to deploy AI models, the tools made available by Hugging Face or the open source Grok platform championed by Elon Musk.

At this juncture, most organizations can only advance a small number of generative AI initiatives. In addition to GPU scarcity, the cost of building and deploying these applications is significantly higher than existing legacy applications. As a result, IT operations teams are increasingly tasked with optimizing the deployment of the AI models that data science teams create. Most data science teams lack an appreciation for IT costs, so it’s not uncommon for the selection of the AI model to be based on how well it’s known rather than actual cost.

In the meantime, AI continues to emerge as a “killer application” for Kubernetes environments because the platform makes it simple to dynamically scale the consumption of IT infrastructure resources on demand.

However, the complexity of managing Kubernetes environments has limited the number of organizations that have the IT skills required to deploy and manage AI models on Kubernetes clusters. In addition, data scientists are often intimated by the complexity of Kubernetes in much the same way application developers often are. There is a clear need to make Kubernetes more accessible to both data scientists and developers that doesn’t require an IT operations team to manually provision each application.

Long term, it’s hard to envision any modern application that won’t include AI capabilities in one fashion, or another. In some cases that may simply mean invoking an LLM via an API. However the number of applications that will embed AI models to ensure performance will only increase. In that regard, AI models ultimately will be just another software artifact, like any other to be managed within the context of a larger DevOps workflow.

Photo credit: Aaron Burden on Unsplash