Improving AI for the world
If you’ve followed Modular since our founding in early 2022, you know we’ve been talking a lot about how AI software is broken, fragmented, too expensive, and negatively impacts AI developers and the world. We’ve written a manifesto about why the world needs better AI software, discussed the importance of modularity in AI software, explained how current tech stacks struggle with large models, and described where today’s AI serving stacks fall short. We only recently started sharing our results, including that we developed the world’s fastest matrix multiplication algorithms across multiple CPU targets.
This post is different. We’re excited to finally share what we’ve been building at Modular. This announcement begins Modular’s journey to radically change the nature of AI programmability, usability, scalability, and compute.
We strongly believe in a world that gives fair and equitable access to AI for all. To achieve our vision of enabling AI to be used by anyone, anywhere, we are rethinking and reinventing the entire AI stack from the ground up. Our next-generation developer platform scales to help you defragment your AI and compute stacks and realize more value from the incredible pace at which AI is evolving.
Two incredible breakthroughs
Modular is moving AI infrastructure from the research era into the production era. AI infrastructure itself has been rapidly evolving under heavy research for years now. Still, the lessons learned needed to be brought forward into a production-quality system that combines first-principles thinking with rigorous engineering and a design that scales to help unify the world. To that end, we are excited to announce two new technological breakthroughs on our next-generation AI developer platform:
- The world’s fastest unified inference engine that is easy to use, portable, and highly performant, making it easier than ever to power your AI models in production, and save money in the process.
- Mojo🔥, a programming language for all AI developers that combines the usability of Python and the performance of C, bringing programmability back to AI while unifying the hardware landscape.
Let’s dive deeper into each.
The world’s fastest unified inference engine
We built the Modular Inference Engine to defragment and simplify AI deployment. It is the world’s fastest unified AI execution engine, powering all your PyTorch and TensorFlow workloads while delivering significant usability, performance, and portability gains.
With the Modular Inference Engine, the benefits to AI developers are clear:
- Deploy more models, faster: Powered by state-of-the-art technologies, the Modular Inference Engine allows developers to get rid of bespoke toolchains, radically simplifying AI deployment. Through simple APIs available in popular languages like Python and C/C++, developers can quickly deploy models trained using PyTorch and TensorFlow without intermediate model conversions or pre-optimization steps.
- Full compatibility with existing frameworks and servers: Unlike other AI engines, the Modular Inference Engine natively supports all the operators available in the latest versions of the training frameworks. It is also fully compatible with models containing control flow and fully supports dynamic shapes (such as text sequences of varying lengths that power models such as BERT and GPT). It is extensible by design, allowing developers to write their own custom operators. It is also fully compatible with existing cloud serving solutions such as NVIDIA’s Triton Inference Server and TensorFlow Serving, and supports on-prem deployment as well.
- Take advantage of the latest hardware: The Modular Inference Engine puts portability in the hands of every developer. The same installation packages “just work” everywhere regardless of the platform, micro-architecture capabilities (AVX512, AVX2, Neon), hardware type (CPU, GPU, xPU), or vendor. This means in practice that developers can now migrate workflows to new targets without changing toolchains or rewriting code. The ability to quickly experiment with different hardware allows businesses to gather the data required to make well-informed cost/performance tradeoffs - instead of guessing.
- Maximize performance, minimize costs: Unlike other solutions in the market, developers do not have to trade performance to unlock portability. The Modular Inference Engine is incredibly fast, delivering 3-4x latency and throughput gains out-of-the-box on state-of-the-art models across Intel, AMD and Graviton architectures—unlocking compute everywhere. Check out performance.modular.com for a detailed breakdown of latency, throughput, and total inference cost across popular AI architectures.
The Inference Engine is available now to a limited number of early-access partners, and developers can join the waitlist to get access here. You can also check out a preview of our Python and C/C++ APIs.
Mojo 🔥: A programming language for all AI developers
That’s right, we built a new programming language: we had to to defragment AI software development.
Mojo 🔥 combines the usability of Python with the performance of C, and adds new abilities to scale to AI accelerators. This unlocks unparalleled programmability of AI hardware and extensibility of AI models, without breaking what you love about Python.
Python is a powerful high-level language, complete with clean and easy to learn syntax and an expansive ecosystem of libraries: Python powers almost all AI research today. However, it also has well-known scalability issues—it doesn’t get you the world’s largest workloads nor does it get you to edge devices. Instead, production AI deployment happens in other languages like C++ and CUDA. The result is a fragmented AI software landscape that reduces AI developer productivity and slows the pipeline from research to production.
Mojo solves these problems, bringing superpowers to AI developers:
- Write everything in one language: Mojo combines the parts of Python that researchers love with the systems programming features that require the use of C, C++ and CUDA. Mojo is built on top of next-generation compiler technologies that unlock significant performance gains when you add types to your programs, enables you to define zero-cost abstractions, benefit from Rust-like memory safety, and that powers unique autotuning and compile-time metaprogramming capabilities.
- Unlock Python performance: Mojo is built on Modular’s high performance heterogenous runtime and uses MLIR, which enables it to access all AI hardware. This gives Mojo access to threading, low-level hardware features like TensorCores and AMX extensions, and reach into accelerators. And oh yeah it’s fast… very fast! In fact, Mojo is 35,000x faster than Python when running numeric algorithms like Mandelbrot because it can take full advantage of your hardware.
- Access the entire Python ecosystem: But all this power doesn’t mean you have to sacrifice what you already know and love. Mojo doesn’t just look and feel like Python - it also gives you access to the full Python ecosystem, including favorites like Numpy, Pandas, Matplotlib, and also your other existing custom Python code.
- Upgrade your models and the Modular stack: Mojo isn’t a side project — all of Modular’s in-house kernels are written in Mojo, which is why the Modular Inference Engine has such amazing performance and portability. The engine lets you extend your models with Mojo pre and post-processing operations, and replace existing operations with custom ones. Mojo enables you to customize the Modular stack with kernel fusion, graph rewrites, shape functions, all without having to recompile the framework or write any C++ or CUDA.
The start of an incredible journey
Modular is striving to radically reshape the nature of AI and accelerated compute. Today is just the very beginning a long journey. We have a wide vision and want to enable AI to be used by anyone, anywhere. We hope that our infrastructure acts as a catalyst to unlock more AI developers, more innovation, and ultimately better AI products.