Increasing development velocity of giant AI models

August 12, 2022

River Riddle

AI Compiler Engineer

Eric Johnson

Product Lead

This post is part one of a two-part series on how we are improving productivity for engineers and scientists for those developing large models. Subscribe to our newsletter to get future posts delivered to your inbox!

Machine learning models are getting larger and larger — some might even say, humongous. The world’s most advanced technology companies have been in an arms race to see who can train the largest model (MUM, OPT, GPT-3, Megatron), while other companies focused on production systems have scaled their existing models to great effect. Through all the excitement, what’s gone unsaid is the myriad of practical challenges larger models present for existing AI infrastructure and developer workflows.

One of the many challenges of developing large models is the painful experience of working with tooling that isn’t equipped to deal with models with huge weights — which can be upwards of 100+ gigabytes. An unspoken truth is that AI deployment engineers often have to wait minutes or even hours for tooling to work with large models. This isn’t great for productivity, isn’t how we want AI specialists to spend their time, and reduces the ability to iterate rapidly.

At Modular, we recognize that developer productivity is a significant part of the cost of training and deploying models. We are constantly optimizing our toolchain to improve the lives of our early customers and our internal developers. This post discusses the technical challenges of managing many gigabytes of data in the compilation process and the changes we have made to our infrastructure (and the MLIR compiler framework) to solve them.

Working with AI models is uniquely challenging

If you aren’t familiar with graph transformations, optimizations, and compilers in machine learning, there is no need to be intimidated.  These are techniques used to increase the performance and portability of an AI model or enable it to deploy to some target hardware. There are high-level “compilers” like the TensorFlow Lite Converter that transforms and optimizes a TensorFlow SavedModel into a highly optimized program format (i.e., FlatBuffer) for execution on edge devices. There are also domain-specific compilers like XLA and TorchScript JIT Compiler. They create or consume an intermediate representation (e.g., a “graph”) of an AI model and compile it to another format - for example machine code or a domain specific runtime representation (e.g. CUDA graphs).

Compiling an AI graph is actually quite different from traditional compilation problems. An AI graph contains two things: 1) the graph topology (how the layers are interconnected) and 2) the model weights (parameters associated with specific layers). In terms of size, the graph topology is on the order of kilobytes, whereas weights are on the order of megabytes and gigabytes. For example, look at some of the bigger models released by Meta. The Open Pre-trained Transformers have 30B, 66B, or even 175B+ parameters, which equates to 100+ gigabytes of weights. There are even larger models like Gopher or Megatron too.

Training Compute-Optimal Large Language Models (Deepmind)

Large models are not handled well by existing tools in the AI ecosystem. For example, protobufs have an inherent 2-gigabyte limit, creating problems for model formats backed by this serialization format. In the latest version of TensorRT, “transformer-based networks such as BERT and GPT, can consume CPU memory up to 10 times the model size during compilation” which makes it difficult to use for large models. In ONNX, users must split the model weights across many files to support large models. All this introduces unnecessary complexity into the AI development workflow, can lose the “single source of truth” for a model, and generally makes it harder to distribute models.

To compound the situation, the heft of these weights can lead you to add workarounds that complicate your overall AI development workflow. For example, at Modular, we built a mechanism for caching temporary files because certain compiler stages were taking 2+ minutes — which broke our developers out of interactive flow state.

Like other workarounds, we realized that this caching was a “duct tape solution”: it wasn’t 100% reliable and didn’t help when the cache missed. Because we care so much about developer productivity, we decided to tackle the core of the problem.

MLIR in the Modular compilation stack

The Modular stack leverages the MLIR compiler infrastructure to represent and transform AI models, including AI operator graphs (for multiple frameworks), mid-level runtime primitives, and low-level machine code generation. Our team has many of the foundational architects of MLIR, who were deeply involved in releasing MLIR to the world, and we continue to actively maintain large portions of core MLIR today.

Multi-level Intermediate Representation (MLIR)

MLIR is a sub-project of the LLVM compiler infrastructure project, which provides a modern toolkit for building domain-specific compilers. It provides a set of core building blocks necessary for modeling, analyzing, and transforming a wide range of computational domains including hardware design, quantum, artificial intelligence among many others.

MLIR has allowed us to build a single cohesive system that spans the entire stack, which is more powerful, layered, extensible, and easier to maintain than conventional stacks. Using unified infrastructure enables our improvements to easily transfer across our tooling stack and enables greater composability and modularity of our entire development workflow.

Modular is not alone in leveraging MLIR — many other systems use MLIR for representation and transformation, including TensorFlow, XLA, PyTorch, ONNX, etc. As this ecosystem continues to grow, we can all celebrate MLIR’s benefits, but we must also continue to invest in its evolution.

MLIR is a good thing, but its approach for managing weights was not!

One of the fundamental building blocks of MLIR is an Attribute, which you can think of as a form constant data that is “unique’d” (aka, memoized, or intern’ed). Attributes are user extensible, meaning they may take various forms depending on the use case. Attributes are used for things like constant expression values (e.g. “5”, “10.0”, etc.), string literals, for enumerators (e.g. “less than”, “greater than”, “equal to”, etc.), for arrays of data … and far more. Most MLIR-based AI tooling uses attributes to hold weights for AI models.

However, this is a problem: model weights can be enormous, and MLIR stores a two-gigabyte weight tensor the same way as a four-byte tensor — in an attribute containing a unique’d array of elements. This creates an obvious problem given we just used the words unique’d and gigabytes so close together!

Here is the challenge: when something is unique’d in MLIR, it is allocated, hashed, and stored within an "MLIRContext". These objects have lifetimes attached to the MLIRContext, and they are not destroyed until the context is destroyed. This is great for small values because we can pass them around and compare unique'd objects by pointer, share allocations for attributes (very common), and more.

These benefits turn into a liability with huge weight tensors: we don’t want to reallocate, copy, or unique them. We also don’t want them to live forever: it is important to deallocate big weights when the computation no longer references them. For example, when we run a tool that quantizes our model, it needs to transform the operator graph and generate new weights — and can end up with multiple copies of that data which all live for the duration of the compilation process.

Another problem for ML tooling is how MLIR was serialized to the file system. When we started, MLIR had no binary serialization format - just a textual format. This is a problem for large weights because each byte of binary data ended up being emitted in a hexadecimal form - taking 2x the space as the data it is encoding. That means that we end up not only taking a long time to create the hex (about 20 seconds for a decently sized multi-gigabyte model), but our intermediate files are twice as big as they should be - 2x an already big number!

A bigger impact than just developer productivity

This well-intended design mechanism can cripple even the best compilers. The most obvious challenge is that it compounds the time necessary to compile, inspect, and transform a model. If you have ever used the excuse, "My code's compiling," you'll be aware of the pain this creates. Here, we are forcing the processor to continuously allocate, copy, and hash multiple gigabytes of data.

XKCD - Compiling

A bigger problem than compile-time is that memory use impacts larger scale architectural features in the Modular stack. For example, because our compiler and technology stack itself is highly parallel and utilizes advanced features like online search, memory use directly affects the amount of work we can do in parallel. This is important to get the highest quality of results.

At Modular, it is core to our ethos that we build tools that users will fall in love with. We realize that advanced features simply won’t get used if they are difficult to use, impact productivity, or have significant caveats (e.g. they don’t work in all cases). We love that fixing these foundational problems with large weights, allows us to subtract complexity from our users lives and workflows.

Fixing this the right way: core additions to MLIR

The team at Modular are prominent stakeholders and maintainers of MLIR, and a big part our culture is to “Build it Right“ - this applies to every project we contribute to. As we contribute to and drive the evolution of MLIR, we have a vested interest in ensuring that steps are right for the project overall, and we collaborate with the MLIR community at large to gain consensus on our approach.

We took a step back to understand what we needed to solve this problem with large model tooling and listed out:

  • Only allocate memory when necessary: We know it is more efficient to memory map large data (like weights) from disk, instead of copying data into malloc’d blocks.
  • No hashing or uniquing: Let’s not check equality of 2 gigabytes blobs of data; weights should be identified by name instead of being implicitly unique’d by content.
  • Enabling Inline Mutation: If there is only one user of the data, we should be able to quantize, transform and manipulate data in place instead of making a copy of it first.
  • Enable deallocation: The data we are working with is huge, and we need to deallocate it when the last reference to the data is destroyed.
  • Fast serialization: Whether JITing, searching optimization parameters, or just iterating locally, we cache IR for many reasons, and it should be fast.

None of these insights are particularly novel, but these are requirements that traditional compilers (e.g. for typical CPU programming languages) don’t run into.

As a result of our design work, we added two extensions to MLIR that better support our use cases and are general to many other use cases throughout the community. We are actively contributing these back upstream to the MLIR community.

In part two of this series, we discuss how we separated IR and data, and the overall end-state impact. Hint, it’s huge. If you are an experienced  developer in AI infrastructure or a MLIR practitioner, you should definitely consider joining Modular. If you like AI and enjoy using great tools, please subscribe to our newsletter.

River Riddle
,
AI Compiler Engineer
Eric Johnson
,
Product Lead