MAX 24.3 - Introducing MAX Engine Extensibility

May 2, 2024

Modular Team

Today, we’re thrilled to announce the launch of MAX 24.3, highlighting a preview of the new MAX Engine Extensibility API that allows developers to unify, program, and compose their AI pipelines on top of our next-generation compiler and runtime stack for best-in-class performance.

MAX Engine is a next-generation compiler and runtime library for running AI inference. With support for PyTorch (TorchScript), ONNX, and native Mojo models, it delivers low-latency, high-throughput inference on a wide range of hardware to accelerate your entire AI workload. Furthermore, the MAX platform empowers you to harness the full potential of the MAX Engine through the creation of bespoke inference models using our MAX Graph APIs.

In this release, we continue to build on our incredible technology foundation to improve programmability of MAX for your workloads with 24.3 features including:

  • Custom Operator Extensibility: Write custom operators for MAX models using the Mojo programming language for intuitive and performant extensibility.
  • Mojo 🔥 Improvements: The Mojo language and standard library continues to mature, with key improvements that will be welcomed by Python experts. This includes enhancements to built-in types like Tuple and support for in function types for both optional and variadic arguments. Read the What’s New in Mojo 24.3 blog post and check out the complete list of changes in the Mojo 24.3 changelog.
  • Less Dependencies and Smaller Package Size: TensorFlow support has been removed from the standard MAX package, making it 60% smaller, resulting in faster download times and fewer dependencies. TensorFlow is still available for enterprise users. Contact us for more information.
  • Community-Driven Innovation: Following on open sourcing the Mojo standard library, this release includes the first community-submitted PRs to the Mojo standard library – featuring 32 significant community contributions, improving on the built-in types and usability of the Mojo standard library. Together with our amazing community, we’re shaping the future of AI development!

Custom Operators

One of the major features of the MAX 24.3 release is a preview of the ability to easily work with custom operations when building AI pipelines. If you’re new to custom operations in AI pipelines, they refer to an operations that you define and implement on your own. This matters a lot when you want to build novel mathematical operations or algorithms that might not ship out-of-the-box in AI frameworks like PyTorch or ONNX, if you need to optimize performance for specific tasks, or if you need to utilize hardware accelerations that are not natively supported in these frameworks.

Writing custom operations in frameworks like PyTorch and ONNX is challenging because:

  • They are far too slow when implemented in Python.
  • Implementing them in low-level languages like C++ and CUDA requires deep expertise.
  • There’s no guarantee your custom operation will work consistently across different hardware platforms.
  • Maintaining custom ops to work with new framework updates is incredibly challenging.

In addition to all this, the API surfaces that you have to develop with to implement custom operations are incredibly ugly when implementing, registering, building and executing your workload. The traditional AI stack is fragmented and slows down AI innovation for everyone.

It doesn’t have to be this hard

Imagine a world where you can write a custom operation for your AI workload that seamlessly executes across hardware, compiles cleanly, and generates platform independent packages. It's as easy as writing, packaging and executing your op – and that's it. And rather than tell you, let’s see how easy it is using an example from our docs. Let's start by adding a custom op like the Det op since this op is currently not supported in MAX Engine so any ONNX model using this op fails to compile. To add it, we start by writing the op in Mojo:

from python import Python from .python_utils import tensor_to_numpy, numpy_to_tensor from max import register from max.extensibility import Tensor, empty_tensor @register.op("monnx.det_v11") fn det[type: DType, rank: Int](x: Tensor[type, rank]) -> Tensor[type, rank - 2]: try: print("Hello, custom DET!") var np = Python.import_module("numpy") var np_array = tensor_to_numpy(x, np) var np_out = np.linalg.det(np_array) return numpy_to_tensor[type, rank - 2](np_out) except e: print(e) return empty_tensor[type, rank - 2](0)

To package the custom op, create a directory that includes the above Mojo code, plus an empty __init__.mojo file. Then, pass that directory name to the mojo package command like below:

custom_ops ├── __init__.mojo ├── det.mojo └── python_utils.mojo $ mojo package custom_ops

And then all we have to do is load our ONNX model into MAX Engine with the custom op and run inference with the Python API:

from max import engine import numpy as np session = engine.InferenceSession() model = session.load("onnx_det.onnx", custom_ops_path="custom_ops.mojopkg") for tensor in model.input_metadata: print(f'name: {}, shape: {tensor.shape}, dtype: {tensor.dtype}') input_x = np.random.rand(3, 3, 5).astype(np.float32) input_a = np.random.rand(5, 3).astype(np.float32) input_b = np.random.rand(3).astype(np.float32) result = model.execute(X=input_x, A=input_a, B=input_b) print(result)

And boom 💥 the ONNX model with our new op is optimized and executing:

python3 Compiling model... Done! name: X, shape: [3, 3, 5], dtype: DType.float32 name: A, shape: [5, 3], dtype: DType.float32 name: B, shape: [3], dtype: DType.float32 Hello, custom DET! {'Z': array([-0.04415698, -0.00949615, 0.07051321], dtype=float32)}

And that's not all ….

Not only can you add custom operators to existing models like ONNX and Pytorch, you can also write your own MAX Graphs with custom operation extensions using Mojo - a much cleaner, more performant and more dependency free offering. We have made the whole process of building and implementing a custom operation much easier for everyone with MAX Graphs as a completely clean approach. You can also benefit when using MAX Graph API:

  • AI Pipelines Unification: Centralize all your AI workflows with a single source of truth. By utilizing Mojo for authoring kernels, developers write kernels once and can reuse them across different ML frameworks by simply adding one annotation ensuring seamless integration and fewer discrepancies.
  • Accelerated Development Cycle: Experience a smoother, faster development process with no need for traditional build processes, linkers, or C++ compilers. MAX Engine simplifies your workflow, allowing you to iterate and deploy rapidly.
  • Built-in Performance Optimization: Leverage automatic kernel fusion and graph optimizations with the MAX Engine runtime for free, letting you focus on building your AI pipelines rather than optimization.
  • Zero Cost Abstractions: MAX Engine and Mojo being built on top of MLIR take full advantage of modern compilers. Your code is inlined and participates directly in the compiler's optimization phases, enhancing efficiency and reducing overhead.
  • Portability: MAX Engine ensures that your Mojo code is portable across diverse platforms and hardware setups, as long as they support LLVM or MLIR. This provides hardware optionality and frees you from platform-specific constraints, broadening your deployment options.

Check out some MAX Graph API examples here - we have a lot more incredible additions coming to MAX Graphs soon.

Download MAX 24.3 Now 🚀

MAX 24.3 is just the beginning of some incredible updates we're delivering on our MAX roadmap, with MacOS support and Quantization coming soon and GPUs coming this summer. Head over to the Modular Developer Portal now to download MAX 24.3 and get started. Read the docs to learn more, check out our examples to learn how to build custom operators for the MAX Engine, and share your feedback on how we can improve upon this preview release.

We’re excited to see what you build with MAX 24.3 and Mojo. Get started with MAX and Mojo now!

Until next time! 🔥

Modular Team