Blog

Democratizing AI Compute Series
Go behind the scenes of the AI industry with Chris Lattner

Matrix Multiplication on Blackwell: Part 4 - Breaking SOTA
In this blog post, we’ll continue our journey to build a state-of-the-art (SOTA) matmul kernel on NVIDIA Blackwell by exploring the cluster launch control (CLC) optimization. At the end of the post we’ll improve our performance by another 15% and achieve 1772 TFLOPs, exceeding that of the current SOTA.

Matrix Multiplication on Blackwell: Part 1 - Introduction
This series of blog posts will showcase how one can: 1. Write a high-performance GPU kernel on Blackwell that offers performance competitive to that of NVIDIA's cuBLAS implementation. 2. Shows how one can leverage Mojo's special features to make the kernel as simple as possible.
Sign up for our newsletter
Get all our latest news, announcements and updates delivered directly to your inbox. Unsubscribe at anytime.
Thank you for your submission.
Your report has been received and is being reviewed by the Sales team. A member from our team will reach out to you shortly.
Thank you,
Modular Sales Team
Start building with Modular
Quick start resources
Get started guide
With just a few commands, you can install MAX as a conda package and deploy a GenAI model on a local endpoint.
Browse open source models
500+ supported models, most of which have been optimized for lightning fast speed on the Modular platform.
Find examples
Follow step by step recipes to build Agents, chatbots, and more with MAX.