Modular acquires BentoML to deliver production AI in the cloud!  - Read more

Blog

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Illustration of a smiling astronaut and a cheerful orange flame character floating in front of a neon-lit triangular background.

Democratizing AI Compute Series

Go behind the scenes of the AI industry with Chris Lattner

🚨

News

Series

Matrix Multiplication on Blackwell: Part 4 - Breaking SOTA

In this blog post, we’ll continue our journey to build a state-of-the-art (SOTA) matmul kernel on NVIDIA Blackwell by exploring the cluster launch control (CLC) optimization. At the end of the post we’ll improve our performance by another 15% and achieve 1772 TFLOPs, exceeding that of the current SOTA.

September 19, 2025

/

Ali Taha

,  

Jiexiang Liu

,  

Hengjie Wang

,  

Abdul Dakkak

,  

🚨

News

Series

Matrix Multiplication on Blackwell: Part 3 - The Optimizations Behind 85% of SOTA Performance

In this post, we continue on this journey and discuss how to leverage the 2SM technique along with pipelining to increase our performance about 5x and get within 85% of state-of-the-art (SOTA).

September 12, 2025

/

Ali Taha

,  

Jiexiang Liu

,  

Hengjie Wang

,  

Abdul Dakkak

,  

🚨

News

Series

Matrix Multiplication on Blackwell: Part 2 - Using Hardware Features to Optimize Matmul

In this post we are going to continue our journey and improve our performance by more than 50x our initial kernel benchmark. Along the way we are going to explain more GPU programming concepts and leverage novel Blackwell features.

September 5, 2025

/

Ali Taha

,  

Jiexiang Liu

,  

Hengjie Wang

,  

Abdul Dakkak

,  

🚨

News

Series

Matrix Multiplication on Blackwell: Part 1 - Introduction

This series of blog posts will showcase how one can: 1. Write a high-performance GPU kernel on Blackwell that offers performance competitive to that of NVIDIA's cuBLAS implementation. 2. Shows how one can leverage Mojo's special features to make the kernel as simple as possible.

August 28, 2025

/

Ali Taha

,  

Jiexiang Liu

,  

Hengjie Wang

,  

🚨

News

Series

How is Modular Democratizing AI Compute? (Democratizing AI Compute, Part 11)

Given time, budget, and expertise from a team of veterans who’ve built this stack before, Modular set out to solve one of the defining challenges of our era: how to Democratize AI Compute. But what does that really mean—and how does it all add up?

June 20, 2025

/

Chris Lattner

,  

🚨

News

Series

Modular’s bet to break out of the Matrix (Democratizing AI Compute, Part 10)

May 8, 2025

/

Chris Lattner

,  

🚨

News

Series

Why do HW companies struggle to build AI software? (Democratizing AI Compute, Part 9)

April 22, 2025

/

Chris Lattner

,  

🚨

News

Series

What about the MLIR compiler infrastructure? (Democratizing AI Compute, Part 8)

April 8, 2025

/

Chris Lattner

,  

🚨

News

Series

What about Triton and Python eDSLs? (Democratizing AI Compute, Part 7)

In this post, we’ll break down how Python eDSLs work, their strengths and weaknesses, and take a close look at Triton.

March 26, 2025

/

Chris Lattner

,  

🚨

News

Series

What about TVM, XLA, and AI compilers? (Democratizing AI Compute, Part 6)

March 12, 2025

/

Chris Lattner

,  

  • Series

    Democratizing Compute Series

    Go behind the scenes of the AI industry in this blog series by Chris Lattner. Trace the evolution of AI compute, dissect its current challenges, and discover how Modular is raising the bar with the world’s most open inference stack.

    11 part series

  • Series

    Matrix Multiplication on Blackwell

    Learn how to write a high-performance GPU kernel on Blackwell that offers performance competitive to that of NVIDIA's cuBLAS implementation while leveraging Mojo's special features to make the kernel as simple as possible.

    4 part series

No items found within this category

We couldn’t find anything. Try changing or resetting your filters.

Build the future of AI with Modular

View Editions
  • Person with blonde hair using a laptop with an Apple logo.

    Get started guide

    Install MAX with a few commands and deploy a GenAI model locally.

    Read Guide
  • Magnifying glass emoji with black handle and round clear lens.

    Browse open models

    500+ models, many optimized for lightning-fast performance

    Browse models