This is part two of a two-part series on how we are fixing the compilation stack for developing with large models. Read part one here. Subscribe to our newsletter to get future posts delivered to your inbox!
A quick recap
In our last post, we highlighted multiple requirements we needed to satisfy large model development. A brief recap of these is below:
- Only allocating when necessary: It is more efficient to memory map large data (like weights) from disk, allowing the operating system to page in data on demand.
- No hashing or uniquing: Checking equality of 2 gigabytes isn’t something we should ever be doing; weights should be identified by name instead of being implicitly uniqued by content.
- Enabling in-place mutation: If there is only one user of the data, we should be able to quantize, transform and manipulate data in place instead of making a copy of it first.
- Destruction: The data we are working with is huge, and we need to deallocate it when the last reference to the data is destroyed.
- Fast serialization: Whether JITing, searching optimization parameters, or just iterating locally, we cache IR for many reasons, and it should be fast.
Fixing the weight attributes
The first four requirements address one fundamental problem with how we've been using MLIR: weights are constant data, but shouldn't be managed like other MLIR attributes. Until now, we've been trying to place a square peg into a round hole, creating a lot of wasted space that's costing us development velocity (and, therefore, money for users of the tools).
We decided we needed to manage this weight data differently than other types of attributes. This prompted our first fundamental extension to MLIR, "Resources," a mechanism to separate data from its references within the computation.
Each blob of serialized MLIR may now contain additional sections, known as “resource” sections. These sections either include “dialect” resources (a dialect is essentially a namespace-like abstraction used when extending MLIR) or “external” resources (for toolchain-specific data). The data within these sections is represented using a simple key-value pairing, creating a json-like structure, like so:
The above example shows how we’ve adapted MLIR to use resources for reproducers. An MLIR reproducer is effectively a configuration containing execution information, such as what transformation pipeline to execute, and is used to reproduce a failure or crash. Before resources, we represented this information using a comment at the top of an MLIR file. Instead, using resources, we have now incorporated it as a first-class piece of information.
To store weights we can now use the resource section to hold the big data blob that used to be unique and immortalized. In the IR, we shift to using light-weight references for attributes instead of the underlying data:
Encoding resources this way also brings some secondary benefits:
- Printing IR for debugging is less error-prone, leading to a better development experience: Resources are specialized sections; we don’t have to worry about accidentally dumping 4 gigabytes to the screen while debugging something.
- We can soundly process the IR without the data present: With the IR only holding references to the data and not the data itself, we can omit the underlying resource data if desired. For example, this greatly simplifies reproducers that don’t need the big weight data (consider sending a colleague a 20-megabyte file instead of a 1.2-gigabyte file).
By introducing resources as a new concept, we’ve finally been able to build a clean separation between program and data. Now we never pass our weight data directly to an attribute. Instead, we pass a weak reference to the attribute and pass the data to a specialized manager. With this, we now have much more control over when and how weights are allocated, mutated, and destroyed.
A new binary encoding for MLIR
With a better representation of our weights, the only thing we needed now was a more efficient method of storing these weights when serializing our MLIR representation.
Until this point, MLIR only had a textual serialization format, which used ASCII hex strings for its weight representation. However, our end goal was to have our local development flow be as fast as possible. To do that, it became clear that we needed to remove text and add a proper binary format to MLIR.
Binary formats require a lot of consideration, especially as they often form the basis for stability in compilers. For MLIR, extra layers of trickery are needed given that we need to be efficient for our use cases (which can be different depending on who you ask), we want to be fast, and because MLIR/LLVM cannot add dependencies on third party encoding libraries.
One nice aspect of MLIR, though, is that its generality makes it nearly trivial to encode. All operations in MLIR have the same structure, so every operation uses the same encoding. Most of the complexity is making sure that our few core concepts are compact and very efficient. Given these constraints, we decided to use a custom encoding.
What is the user impact?
In the end, adding resources and a binary encoding to MLIR has made our toolchain and development workflow significantly faster and reduced our memory usage substantially - making our performance and velocity incredible. It’s also made everything about MLIR better — more on that later.
To validate this, we tested our changes across models of various sizes, measuring the speed of a real-life lowering and optimization pipeline in our MLIR-based graph compiler (from a TensorFlow serialized model to the input format of our runtime) and the memory used during that process.
Speed: Compilation Workflow
MLIR is now significantly faster. Going from a serialized TensorFlow model (from a checkout of TensorFlow 2.10) to our runtime input format, a process that involves many transformations of the underlying program representation, was ~1.8-2x faster in terms of wall clock time than before, with speed scaling consistently across the various model sizes.
Diving a bit deeper, the TF serialized model processing is now basically instant — all our time is spent writing the big-weight data to disk when generating the MLIR. In fact, the actual time spent in our code is about 10x faster than before. Most of the time is now bounded by the speed at which the SSD writes >1 gigabyte of data to disk.
For ML developers using our tools this means faster model compilation, thereby improving productivity and iteration time. This has benefits for production environments as well when loading (and compiling) models. For example, when dynamically loading and unloading models based on incoming traffic — e.g., use cases with many of personalized/fine-tuned user models.
Speed: Serialization
Also faster is serialization due to the introduction of a binary encoding. Interacting with MLIR via external tools depends on the reading and writing of serialized MLIR — whether for introspection, caching, reproducer generation, etc. Again, we tested serialization performance across various model sizes and saw a significant speed-up, peak performance being SSD bound. More specifically, reading textual data for larger models took ~5 seconds compared to <10ms for reading binary. And writing was > ~5x faster for binary than textual formats.
For Modular, this enables us to develop infrastructure and tooling around MLIR that would otherwise be prohibitively slow or expensive. For example, this would allow us to provide an efficient debugger that relies on caching model representations throughout the compilation workflow, improving the underlying compiler performance, and much much more.
Memory Usage
Finally, the mmap capabilities of our binary serialization and the separation of IR and data via resources have also significantly reduced memory consumption. Across all model sizes, we are using less memory during the compilation process. Where before we had to allocate the relative size of the weights in a model, we no longer have to allocate at all for the weights, meaning we save significant memory every time we compile.
Building for everyone
At Modular, we intend to lift the AI ecosystem for everyone, not just ourselves. The improvements we’ve discussed, both the new resource representation and the binary encoding, have been contributed upstream to the LLVM/MLIR repository. And while we were inspired to make these improvements to MLIR to solve our customer’s problems and improve our internal infrastructure, the impact of these changes isn’t limited to just our use cases — it will enhance other products using MLIR as a foundational technology. For example, because of this work, the MLIR community is now discussing MLIR stability guarantees.
Ultimately though, the value of these contributions, and the evolution of these foundational technologies, flow directly into the products we build for our customers. These contributions represent a small look into the vast number of core improvements we are making. Whether working with large models or deploying on-device, Modular is building and is actively working to make all of AI infrastructure significantly more performant and easier to use than anything else in existence today.
Lastly, we are so excited about the future of AI, the LLVM community, MLIR and our contributions from the earliest days - that we made the decision to become a Platinum Sponsor of LLVM to continue to support and grow the LLVM community for many years to come.
If you are an experienced developer in AI infrastructure or a MLIR practitioner, you should definitely consider joining Modular. If you like AI and enjoy using great tools, please subscribe to our newsletter.