Blog

Democratizing AI Compute Series
Go behind the scenes of the AI industry with Chris Lattner

Matrix Multiplication on Blackwell: Part 4 - Breaking SOTA
In this blog post, we’ll continue our journey to build a state-of-the-art (SOTA) matmul kernel on NVIDIA Blackwell by exploring the cluster launch control (CLC) optimization. At the end of the post we’ll improve our performance by another 15% and achieve 1772 TFLOPs, exceeding that of the current SOTA.

Matrix Multiplication on Blackwell: Part 2 - Using Hardware Features to Optimize Matmul
In this post we are going to continue our journey and improve our performance by more than 50x our initial kernel benchmark. Along the way we are going to explain more GPU programming concepts and leverage novel Blackwell features.

Matrix Multiplication on Blackwell: Part 1 - Introduction
This series of blog posts will showcase how one can: 1. Write a high-performance GPU kernel on Blackwell that offers performance competitive to that of NVIDIA's cuBLAS implementation. 2. Shows how one can leverage Mojo's special features to make the kernel as simple as possible.

How is Modular Democratizing AI Compute? (Democratizing AI Compute, Part 11)
Given time, budget, and expertise from a team of veterans who’ve built this stack before, Modular set out to solve one of the defining challenges of our era: how to Democratize AI Compute. But what does that really mean—and how does it all add up?
No items found within this category
We couldn’t find anything. Try changing or resetting your filters.

Get started guide
Install MAX with a few commands and deploy a GenAI model locally.
Read Guide
Browse open models
500+ models, many optimized for lightning-fast performance
Browse models




