Blog

🚨

News

Series

Matrix Multiplication on Blackwell: Part 4 - Breaking SOTA

In this blog post, we’ll continue our journey to build a state-of-the-art (SOTA) matmul kernel on NVIDIA Blackwell by exploring the cluster launch control (CLC) optimization. At the end of the post we’ll improve our performance by another 15% and achieve 1772 TFLOPs, exceeding that of the current SOTA.

September 19, 2025

Ali Taha

Jiexiang Liu

Hengjie Wang

Abdul Dakkak

Read

🚨

News

Series

Matrix Multiplication on Blackwell: Part 3 - The Optimizations Behind 85% of SOTA Performance

In this post, we continue on this journey and discuss how to leverage the 2SM technique along with pipelining to increase our performance about 5x and get within 85% of state-of-the-art (SOTA).

September 12, 2025

Ali Taha

Jiexiang Liu

Hengjie Wang

Abdul Dakkak

Read

🚨

News

Series

Matrix Multiplication on Blackwell: Part 2 - Using Hardware Features to Optimize Matmul

In this post we are going to continue our journey and improve our performance by more than 50x our initial kernel benchmark. Along the way we are going to explain more GPU programming concepts and leverage novel Blackwell features.

September 5, 2025

Ali Taha

Jiexiang Liu

Hengjie Wang

Abdul Dakkak

Read

🚨

News

Series

Matrix Multiplication on Blackwell: Part 1 - Introduction

This series of blog posts will showcase how one can: 1. Write a high-performance GPU kernel on Blackwell that offers performance competitive to that of NVIDIA's cuBLAS implementation. 2. Shows how one can leverage Mojo's special features to make the kernel as simple as possible.

August 28, 2025

Ali Taha

Jiexiang Liu

Hengjie Wang

Read

🚨

News

Series

How is Modular Democratizing AI Compute? (Democratizing AI Compute, Part 11)

Given time, budget, and expertise from a team of veterans who’ve built this stack before, Modular set out to solve one of the defining challenges of our era: how to Democratize AI Compute. But what does that really mean—and how does it all add up?

June 20, 2025

Chris Lattner

Read

🚨

News

Series

Modular’s bet to break out of the Matrix (Democratizing AI Compute, Part 10)

May 8, 2025

Chris Lattner

Read

🚨

News

Series

Why do HW companies struggle to build AI software? (Democratizing AI Compute, Part 9)

April 22, 2025

Chris Lattner

Read

🚨

News

Series

What about the MLIR compiler infrastructure? (Democratizing AI Compute, Part 8)

April 8, 2025

Chris Lattner

Read

🚨

News

Series

What about Triton and Python eDSLs? (Democratizing AI Compute, Part 7)

In this post, we’ll break down how Python eDSLs work, their strengths and weaknesses, and take a close look at Triton.

March 26, 2025

Chris Lattner

Read

🚨

News

Series

What about TVM, XLA, and AI compilers? (Democratizing AI Compute, Part 6)

March 12, 2025

Chris Lattner

Read

Series
Democratizing Compute Series
Go behind the scenes of the AI industry in this blog series by Chris Lattner. Trace the evolution of AI compute, dissect its current challenges, and discover how Modular is raising the bar with the world’s most open inference stack.
11 part series
View Series
Series
Matrix Multiplication on Blackwell
Learn how to write a high-performance GPU kernel on Blackwell that offers performance competitive to that of NVIDIA's cuBLAS implementation while leveraging Mojo's special features to make the kernel as simple as possible.
4 part series
View Series

No items found within this category

We couldn’t find anything. Try changing or resetting your filters.

Build the future of AI with Modular

Get started - FREE

View Editions

Get started guide
Install MAX with a few commands and deploy a GenAI model locally.
Read Guide
Browse open models
500+ models, many optimized for lightning-fast performance
Browse models

Blog

Sign up for our newsletter