The AI Serving Substrate
The technological progress that has been made in AI over the last ten years is breathtaking — from AlexNet in 2012 to the recent release of ChatGPT, which has taken large foundational models and conversational AI to another level. These incredible research innovations have shown the immense potential of AI to impact everything from healthcare to manufacturing, finance, climate, communication, and travel, to how we interact with the world around us.
AI can help solve any problem represented by data, assuming suitable algorithms and enough computational resources. However, what often goes unsaid through the myriad of excitement and press around these research innovations is the challenge of practically deploying them. To realize the full potential to improve human lives, they need to permeate through the applications we use daily — they can’t live solely in the deep pockets of AI research institutes and Big Tech. In this post, we discuss how deficiencies in existing serving technologies make the deployment of AI models to cloud server environments challenging.
Today's real-world AI applications require a production team to build, maintain and evolve their own AI Serving Substrate.
We use this term to refer to the bottom layer of tools and distributed computing technologies that are required to build a modern scalable AI-enabled cloud application.
These substrates typically include machine learning frameworks like TensorFlow, PyTorch, ONNX Runtime and TensorRT, AI serving frameworks like TensorFlow Serving, TorchServe, or Nvidia’s Triton Inference Server, and containerization and orchestration technologies like Docker and Kubernetes. In doing so, they strive to support user demand and meet or improve on requirements for cost, throughput, latency, and model predictive or generated content quality — all while avoiding hardware lock-in and maintaining cloud optionality.
Achieving these goals is difficult because the current generation of serving substrates are usually custom in-house designs assembled with duct tape from many uncooperative components. This negatively impacts deployment velocity when new kinds of models need to be deployed, leads to reliability problems scaling these complicated ad-hoc systems, and prevents the use of the latest features needed by the most advanced models. As a result, these systems get replaced every few years as requirements change and model architectures evolve.
At Modular, we have productized huge AI workloads and delivered them to billions of users, and we have replaced numerous serving substrates along the way. Let’s look at some of the problems that need to be addressed to solve this once and for all.
What does a modern AI Cloud application look like?
Modern cloud applications are complex distributed systems orchestrating data-flows across several independent micro-services and components — sometimes spanning both cloud and edge. These distributed applications evolved to need a serving substrate that is responsible for scaling and managing them.
Let’s look at a typical application stack, like automatic speech recognition (ASR) in the cloud, and explore some of the root causes of the problems we have faced.
The many challenges in production AI serving
Modern AI system have evolved to include many problems that go beyond a simple serving binary.
The complexity of integrating multiple AI frameworks
A key source of complexity stems from the challenges serving multiple machine learning frameworks (TensorFlow, PyTorch, JAX, ONNX). Production teams are often asked to provide a unified system in order to be agile and responsive to the needs of research teams. The typical solution is to layer on top of multiple frameworks, and use a “lowest-common-denominator” wrapper for the underlying runtimes. Unfortunately, this typically prevents the use of the most powerful and differentiating features that those frameworks provide.
Another problem is that implementation details such as execution environment (remote vs. local) and communication protocol (gRPC vs. HTTP transport) are leaked to the application developer. As a result, developers typically need to directly manage crucial functional and performance aspects such as fault-tolerance and load-balancing instead of the serving API abstracting these implementation details away.
This is all made worse by the monolithic nature of machine learning frameworks, which are difficult to integrate, have large and complicated dependencies, and think they should own all the resources on a given machine.
Overly simplified systems have hidden ceilings
Many AI serving substrates provide a simplified API for orchestrating the many components in the system, but are often very limited and can be very slow. The challenge occurs when your application becomes successful and you need to start scaling it. These stacks frequently run into reliability problems, fail to scale to larger deployments, do not deliver the latency and throughput requirements needed by the application, and cannot integrate with more advanced use-cases described below.
The appeal of a simple stack quickly loses its charm when you find you need to rewrite your stack to a much more complicated (but also more powerful) substrate.
Challenges multiplexing applications onto shared resources
At scale, production cloud environments host multiple applications that each have their own AI models. Different models have varying compute and memory requirements, different traffic patterns, and latency and throughput needs - depending on the context of the application. Further, models typically have multiple versioned variants in production for A/B testing new algorithms and to support safe and incremental model update rollouts without disrupting production traffic. These factors combine to create complexity for cloud application developers and their production release and management operations.
These challenges are inherent to our domain, but existing systems do little to help - they push the pain onto DevOps and MLOps teams which drives the need for bespoke serving substrates. These substrates must manage model storage and caching, retrieve model features, load-balance and route model inference requests, proactively auto-scale serving capacity across cloud regions, scale with model and data parallelism, implement monitoring and logging, and respond to dynamically changing traffic volumes. Production teams often have amazing engineers who can tackle and solve these problems, but doing so is often not the highest priority of the team or the best ROI for their time.
Large AI models create new challenges for scale & reliability
Giant AI models have been growing at an astonishing rate in terms of their size and prevalence across domains (NLP, Vision, Speech, Multimodal AI tasks, etc). While large transformers are wildly popular, other model architectures like Mixture-of-Experts (MoE) and Recommender models (which have a large number of sparse features and embeddings) can also have 100s of billions of parameters. Individual commodity cloud machines lack the memory and float point compute capacity to run these models by themselves.
Distributing these models across multiple machines can help, but the existing substrates are typically bespoke implementations that only work for select model architectures and hardware targets. One small deviation from the supported models and execution environments can require significant rewrites or switching to a new serving substrate. This is a huge problem for organizations that favor fast iteration and high-velocity research.
The reliability of large models is another under-recognized concern, given the distributed serving systems are typically based on frameworks like Message Passing Interface (MPI). MPI and similar systems are designed for reliable High Performance Computing environments, and lack fault tolerance to software, network, and machine failures — this don’t work well on commodity cloud environments where such failures are common. Furthermore, they don’t support the elasticity needs (scaling the number of workers) that one would expect in a typical scalable cloud environment.
Challenges managing cost and project-level spend when delivering value
The performance of an AI application in production has far more to it than meeting a specific QPS number, throughput or latency threshold. It must also be manageable as the design changes and the workloads evolve, and costs needs to be tied back to the operations teams responsible for a workload. Most of the challenges come from poor integration of the components in the serving substrate, and gaps in system level performance observability.
Today’s serving substrates emit some basic metrics like request volume, latency, and device compute utilization, but fail to provide an integrated resource utilization view that covers fixed and variable memory usage, and utilization of key I/O data-paths (Network, PCIe, Disks, etc.). They also don't provide an integrated diagnostic view of end-to-end inference performance to enable pinpointing bottlenecks and unlocking cost and performance optimizations. Finally, the top-line metrics of actual occupancy and utilization rates of allocated compute resources and provisioned compute capacity are not tracked even though they are the dominant factor in operational costs.
It is difficult to see how to really address this while the fundamental components of the substrate are disaggregated and so uncooperative with each other.
Unlock AI serving for the future
Developers need to be able to integrate AI into performance-sensitive production applications with zero friction and serve AI reliably and cost-effectively at scale. We’ve seen the power of large models, but that can only benefit the world if they can be deployed cost-effectively and reliably — and they’ll keep getting bigger and more complicated. An AI serving substrate that addresses these pain points will dramatically improve the value-to-cost ratio for AI adoption in cloud applications.
Modular is tackling the hardest problems in AI infrastructure because we believe solving these will unlock the power of AI in ways that no one else can. We encourage you to apply for our open roles if you are passionate about these problems and driving the next wave of innovation in AI.