June 7, 2024

MAX 24.4 - Introducing quantization APIs and MAX on macOS

Today, we're thrilled to announce the release of MAX 24.4, which introduces a powerful new quantization API for MAX Graphs and extends MAX’s reach to macOS. Together, these unlock a new industry standard paradigm where developers can leverage a single toolchain to build Generative AI pipelines locally and seamlessly deploy them to the cloud, all with industry-leading performance. Leveraging the Quantization API reduces the latency and memory cost of Generative AI pipelines by up to 8x on desktop architectures like macOS, and up to 7x on cloud CPU architectures like Intel and Graviton, without requiring developers to rewrite models or update any application code.

Release highlights:

  • New Quantization API for MAX Graphs: Reduce LLM inference latency by 7x compared to llama.cpp context encoding through decreased memory usage and improved performance with the new Quantization API. MAX Graph supports BF16, INT4, and INT6 quantization, including K-Quantization.
  • Llama on MAX: MAX 24.4 includes new implementations of  Llama 3 and Llama 2, demonstrating the full power of MAX Graphs and the Quantization API!
  • MAX on macOS: Everything you love about MAX is now available on Apple silicon, further enhancing MAX’s programmability and portability. Developers can build state-of-the-art AI pipelines locally and seamlessly deploy them to cloud systems like Intel x86 and ARM Graviton.
  • Mojo 🔥 Improvements: Mojo 24.4 features several performance improvements to the core language and standard library. For the core language, def functions are more flexible, and advanced users will appreciate more advanced loop unrolling features, the ability to return safe references, and many others.
  • Community-Driven Innovation: Mojo 24.4 features 214 community pull requests from 18 contributors, with 30 contributed features accounting for 11% of all improvements in the latest Mojo release. These include performance and quality-of-life improvements to the standard library collections, new data and filetype operations, and updates to SIMD bitwise operations. You can find a complete list of contributors and enhancements in the Mojo 24.4 release notes.

You can get started with MAX 24.4 now through the Modular developer console or by installing it directly from your terminal:

curl -s https://get.modular.com | sh -

Head over to the MAX documentation for complete instructions on installing or updating MAX.

Quantization API

The new MAX Quantization API is a huge step in bringing state-of-the-art performance to models built with MAX Graphs.

Why does quantization matter?

Token generation in LLMs is memory-bound, and reducing the size of the model weight from FP32 to INT4 proportionally improves performance without significantly impacting model quality. Reduced model size also decreases the hardware requirements for running LLMs, making models more widely available and cost-effective to run.

MAX’s quantization API makes transitioning from full precision to INT4 quantization easier, a massive win for the MAX and Mojo developer communities.

New Quantized Llama Models

To highlight the power of MAX’s quantization API, we’re releasing two new quantized LLMs as part of the MAX 24.4 release: Llama 3 and Llama 2. These models are built entirely in Mojo 🔥 using the MAX Graph API. These models are the first of a series to meet a need for State-Of-The-Art LLMs that are performant and portable across all CPU types.

You can download Llama 3 now and try it out!

# get the latest MAX examples from GitHub git clone https://github.com/modularml/max.git # navigate to the llama3 pipeline cd max/examples/graph-api/pipelines/llama3 # run INT4 quantized llama3! mojo ../../run_pipeline.🔥 llama3 \ --prompt "I believe the meaning of life is"

Read more in the MAX Getting Started guide.

MAX on MacOS

MAX is now available for macOS, delivering the full suite of acceleration and inference APIs to Apple silicon. This includes the new, fully quantized Llama3 model, which has a more than 8x performance boost in context encoding using INT4 compared to F32.

Developers can seamlessly transition from building SOTA models on their development machines to putting them into production on Intel x86 and ARM Graviton cloud-serving infrastructure. We’re excited to expand the portfolio of hardware platforms supported by MAX, delivering on the promise of programmability and portability. You can get started with MAX on macOS today!

New Developer Resources

To support our growing community of developers and users, we’ve completely reworked our documentation to focus on the user journey with MAX. There’s now a single API reference to cover the entire MAX platform and a new Getting Started guide that makes it faster and easier to get MAX up and running. This new experience makes it easier to understand and use all of MAX’s capabilities, and we’re excited to see what the Mojo and MAX communities will build!

In addition to the refreshed docs, we’re excited to announce Modular AI resources: a centralized hub for the latest and most relevant research papers on LLMs, Generative AI, and optimized ML systems.

🚀 Get Started with MAX 24.4!

Download MAX 24.4 now to get started with the new MAX Graph Quantization API, and start accelerating your models now. Read the docs to learn more, and check out our examples on how to run llama3 with the MAX Engine.

We’re excited to see what you build with MAX 24.4 ⚡️ and Mojo 🔥!

Until next time! 🔥


Modular Team