Blog

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Democratizing AI Compute Series

Go behind the scenes of the AI industry with Chris Lattner

🚨

News

Series

Matrix Multiplication on Blackwell: Part 4 - Breaking SOTA

In this blog post, we’ll continue our journey to build a state-of-the-art (SOTA) matmul kernel on NVIDIA Blackwell by exploring the cluster launch control (CLC) optimization. At the end of the post we’ll improve our performance by another 15% and achieve 1772 TFLOPs, exceeding that of the current SOTA.

September 19, 2025

/

Ali Taha

,  

Jiexiang Liu

,  

Hengjie Wang

,  

Abdul Dakkak

,  

🚨

News

Series

Matrix Multiplication on Blackwell: Part 3 - The Optimizations Behind 85% of SOTA Performance

In this post, we continue on this journey and discuss how to leverage the 2SM technique along with pipelining to increase our performance about 5x and get within 85% of state-of-the-art (SOTA).

September 12, 2025

/

Ali Taha

,  

Jiexiang Liu

,  

Hengjie Wang

,  

Abdul Dakkak

,  

🚨

News

Series

Matrix Multiplication on Blackwell: Part 2 - Using Hardware Features to Optimize Matmul

In this post we are going to continue our journey and improve our performance by more than 50x our initial kernel benchmark. Along the way we are going to explain more GPU programming concepts and leverage novel Blackwell features.

September 5, 2025

/

Ali Taha

,  

Jiexiang Liu

,  

Hengjie Wang

,  

Abdul Dakkak

,  

🚨

News

Series

Matrix Multiplication on Blackwell: Part 1 - Introduction

This series of blog posts will showcase how one can: 1. Write a high-performance GPU kernel on Blackwell that offers performance competitive to that of NVIDIA's cuBLAS implementation. 2. Shows how one can leverage Mojo's special features to make the kernel as simple as possible.

August 28, 2025

/

Ali Taha

,  

Jiexiang Liu

,  

Hengjie Wang

,  

🚨

News

Series

How is Modular Democratizing AI Compute? (Democratizing AI Compute, Part 11)

Given time, budget, and expertise from a team of veterans who’ve built this stack before, Modular set out to solve one of the defining challenges of our era: how to Democratize AI Compute. But what does that really mean—and how does it all add up?

June 20, 2025

/

Chris Lattner

,  

🚨

News

Series

Modular’s bet to break out of the Matrix (Democratizing AI Compute, Part 10)

May 8, 2025

/

Chris Lattner

,  

🚨

News

Series

Why do HW companies struggle to build AI software? (Democratizing AI Compute, Part 9)

April 22, 2025

/

Chris Lattner

,  

🚨

News

Series

What about the MLIR compiler infrastructure? (Democratizing AI Compute, Part 8)

April 8, 2025

/

Chris Lattner

,  

🚨

News

Series

What about Triton and Python eDSLs? (Democratizing AI Compute, Part 7)

In this post, we’ll break down how Python eDSLs work, their strengths and weaknesses, and take a close look at Triton.

March 26, 2025

/

Chris Lattner

,  

🚨

News

Series

What about TVM, XLA, and AI compilers? (Democratizing AI Compute, Part 6)

March 12, 2025

/

Chris Lattner

,  

No items found within this category

We couldn’t find anything. Try changing or resetting your filters.

Build the future of AI with Modular

View Editions
  • Get started guide

    Install MAX with a few commands and deploy a GenAI model locally.

    Read Guide
  • Browse open models

    500+ models, many optimized for lightning-fast performance

    Browse models