The Future of AI Is Hybrid
By Alex Woodie—Artificial intelligence today is largely something that occurs in the cloud, where huge AI models are trained and deployed on massive racks of GPUs. But as AI makes its inevitable migration into to the applications and devices that people use every day, it will need to run on smaller compute devices deployed to the edge and connected to the cloud in a hybrid manner.
That’s the prediction of Luis Ceze, the University of Washington computer science professor and Octo AI CEO, who has closely watched the AI space evolve over the past few years. According to Ceze, AI workloads will need to break out of the cloud and run locally if it’s going to have the impact foreseen by many.
In a recent interview with Datanami, Ceze gave several reasons for this shift. For starters, the Great GPU Squeeze is forcing AI practitioners to search for compute wherever they can find it. find new making the edge look downright hospitable today, he.
“If you think about the potential here, it’s that we’re going to use generative AI models for pretty much every interaction with computers,” Ceze says. “Where are we going to get compute capacity for all of that? There’s not enough GPUs in the cloud, so naturally you want to start making use of edge devices.”
Enterprise-level GPUs from Nvidia continue to push the bounds of accelerated compute, but edge devices are also seeing big speed-ups in compute capacity, Ceze says. Apple and Android devices are often equipped with GPUs and other AI accelerators, which will provide the compute capacity for local inferencing.
The network latency involved with relying on cloud data center to power AI experiences is another factor pushing AI toward a hybrid model, Ceze says.
“You can’t make the speed of light faster and you cannot make connectivity be absolutely guaranteed,” he says. “That means that running locally becomes a requirement, if you think about latency, connectivity, and availability.”
Early GenAI adopters often chain multiple models together when developing AI applications, and that is only accelerating. Whether it’s OpenAI’s massive GPT models, Meta’s popular Llama models, the Mistral image generator, or any of the thousands of other open source models available on Huggingface, the future is shaping up to be multi-model.
The same type of framework flexibility that enables a single app to utilize multiple AI models also enables a hybrid AI infrastructure that mixes on-prem and cloud models, Ceze says. It’s not that it doesn’t matter where the model is running; it does matter. But developers will have options to run locally or in the cloud.
“People are building with a cocktail of models that talk to each other,” he says. “Rarely it’s just a single model. Some of these models could run locally when they can, when there’s some constraints for things like privacy and security…But when the compute capabilities and the model capabilities that can run on the edge device aren’t sufficient, then you run on the cloud.”
At the University of Washington, Ceze led the team that created Apache TVM (Tensor Virtual Machine), which is an open source machine learning compiler framework that allows AI models to run on different CPUs, GPUs, and other accelerators. That team, now at OctoAI, maintains TVM and uses it to provide cloud portability of its AI service.
“We been heavily involved with enabling AI to run on a broad range of devices. And our commercial products evolved to be the OctoAI platform. I’m very proud of what we build there,” Ceze says. “But there’s definitely clear opportunities now for us to enable models to run locally and then connect it to the cloud, and that’s something that we’ve been doing a lot of public research on.
In addition TVM, other tools and frameworks are emerging to enable AI models to run on local devices, such as MLC LLM and Google’s MLIR project. According to Ceze, what the industry needs now is a layer to coordinate the models running on prem and in the cloud.
“The lowest layer of the stack is what we have a history of building, so these are AI compilers, runtime systems, etc.,” he says. “That’s what fundamentally allows you to use the silicon well to run these models. But on top of that, you still need some orchestration layer that figures out when should you call to the cloud? And when you call to the cloud, there’s a whole serving stack.”
The future of AI development will parallel Web development over the past quarter century, where all the processing except HTML rendering started out on the server, but gradually shifted to running on the client device too, Ceze says.
Category: Uncategorized