Today, we’re excited to release Modular Platform 25.7, an update that deepens our vision of a unified, high-performance compute layer for AI. With a fully open MAX Python API, an experimental next-generation modeling API, expanded hardware support for NVIDIA Grace superchips, and a safer, more capable Mojo GPU programming experience, this release moves us closer to an ecosystem where developers spend less time fighting infrastructure and more time advancing what AI can do.
MAX: Faster, Easier, and More Open
MAX continues to evolve into the fastest cross-vendor inference framework available. In 25.7, we’ve deepened MAX across three fronts: openness, modeling flexibility, and production-grade control.
MAX: Great way to build high performance inference models
With this release, the entire Python interface to MAX is now fully open-source. This gives developers:
- Visibility into how MAX models are built, executed, and served across hardware vendors.
- Real examples of how features like MAX ↔ PyTorch interoperability were implemented.
- Greater clarity on internal abstractions through newly-visible unit tests, making it easier to reason about performance and correctness.
- Better bug reporting and contribution paths, enabling faster iteration across the ecosystem.
We've been working hard to refine our APIs, enable a first class "eager" programming model, and really lean into the power of our next generation technology stack. It is now time to open up to more developers, so we are open sourcing the full API and starting to teach how to build MAX-native models.
The New (Experimental) Model API and Development Workflow
25.7 introduces a redesigned Model API consisting of the new max.nn.module_v3 and max.experimental.tensor – together, they deliver our most significant modeling upgrade since MAX launched.
Our new API gives developers an experience more aligned with the higher-level model abstractions they are familiar with, bringing better modeling ergonomics to MAX. You write models the way you’re used to, and MAX handles the heavy lifting underneath:
- PyTorch-like syntax with minimal cognitive overhead for defining and composing models
- Lazy evaluation with eager semantics that matches PyTorch’s eager mode for enhanced debugging
- No more manual graph construction or session-load-run ceremonies
- Intuitive weight loading with
load_static_dict()similar to PyTorch - Pure MAX/Mojo stack: no dependencies on PyTorch, NumPy or other frameworks
When you’re ready for production, just add model.compile(input_type) to unlock the performance benefits of full model compilation. This enables full ahead-of-time graph compilation with the full speed and memory efficiency benefits you come to expect from MAX.
The latest API dramatically speeds up model development, debugging, and customization – especially for teams extending frontier-scale architectures. To get started using the new model APIs, check out our new online book to build an LLM from scratch with MAX. This is a guided lesson on building LLMs starting with GPT-2 that explains each component of the transformer model along the way. We’d love your feedback on forum.modular.com, and report bugs or feature requests on github.com/modular/modular/issues.
Expanded Model and Hardware Support
25.7 significantly broadens MAX’s reach across both models and hardware, ensuring high-performance inference on the newest accelerators and system architectures.
- MAX now supports bfloat16 models running on GPUs attached to ARM CPU hosts, including Grace Hopper (GH200) and Grace Blackwell (GB200) systems. This unlocks higher performance and lower power consumption on next-generation NVIDIA platforms built around the Grace superchip architecture.
- MAX delivers outsized performance wins on vision models, unlocking +30-80% additional throughput on Qwen2.5-VL compared to 25.6 and over 2x performance compared to vLLM.
- Early support for GPT-OSS has landed, with MXFP4 integration coming soon to unlock major 4-bit performance gains.
Together, these updates deepen MAX’s position as the fastest fully portable inference engine. Stay tuned for very significant model and performance updates in our next release.
Dynamic LoRA for Real-Time Specialization
25.7 introduces support for Dynamic LoRA, initially tuned for speech and low-latency real-time workloads. This allows developers to:
- Hot-swap LoRA adapters during runtime
- Personalize models without restarting the server
- Use LoRA for speaker-specific or domain-specific behaviors
- Achieve near-zero-overhead tuning for production workloads
This is essential for those that require rapid, high-fidelity model adaptation with minimal cost. Today, dynamic LoRA in MAX powers real-time applications like Inworld AI’s voice cloning, but we’ll make this broadly available for more models shortly.
Mojo: Safer, better Apple GPU support, and more
Mojo is evolving into the simplest, most portable way to write high-performance GPU code. In the process, we are enhancing our hardware support, adding new language features, and improving the overall safety of the language.
Enhanced Apple Silicon GPU Support
We launched initial support for Apple silicon GPUs in 25.6, and in 25.7 we’ve been expanding the coverage of fundamental intrinsics and capabilities in these GPUs. This has included
- Support for synchronization primitives
- Access to shared memory
- Mapping of warp operations to Metal equivalents
As a result, you can now run 20 of our popular Mojo GPU puzzles on Apple silicon GPUs in 25.7, up from 5 GPU puzzles in our last release.
Mojo now provides one of the cleanest paths to developing AI kernels on Apple hardware — no vendor-specific APIs required.
Safer GPU Programming
Correctness and illegal memory access errors in GPU code can be subtle and extremely hard to diagnose. Catching these earlier and making them more obvious saves an enormous amount of time when writing new kernels or AI models. In 25.7, some significant improvements have been made to the Mojo language and libraries to catch common GPU programming issues:
- GPU functions now have strong type checking using
enqueue_function_checked(), identifying many potential crashes and bad memory accesses before they happen. - A newly-reworked
UnsafePointertype that fixes some frequent issues with lifetimes and accidentally unsafe changes. - No more implicit conversions between
IntandUInttypes, a source of common numerical correctness issues in GPU kernels. - Much better error messages around constraint failures: traces, line numbers, and printed parameter values.
- Expanded support for using Address Sanitizer with Mojo code to identify memory leaks and illegal accesses.
Beyond language and library features, a new TestSuite module lays the foundation for better unit testing in Mojo. It provides a clean interface for test results, automatic test discovery, and more.
These safety improvements align with our long-term goal: Mojo makes GPU programming feel like writing high-level, safe systems code.
Try the Latest Updates and Join the Community!
The best way to experience everything in 25.7 is to try it yourself.
You can get everything you need to deploy an LLM with MAX, write GPU kernels with Mojo, and build models with the experimental API today by installing the modular package with pip, uv, pixi, or conda. For more details, see our quickstart guide.
Once you’re set up, you can:
- Explore the fully open-source MAX Python API on GitHub
- Walk through our guided lessons to to build a transformer LLM from scratch
- Review the complete list of changes in the MAX and Mojo changelogs
- Browse the updated Guides section on docs.modular.com, now including all developer guides and tutorials for the Modular Platform
25.7 is a major step toward a unified, fully portable compute layer for AI – and we’re building it in the open. Your questions, feedback, and contributions directly shape the platform, so join the discussion and report any issues or request features.
We’re excited to see what you build. Come experiment, contribute, and help define the future of high-performance, hardware-portable AI with us.
.png)


