Modular: Matrix Multiplication on Blackwell: Part 3 - The Optimizations Behind 85% of SOTA Performance

Ali Taha

AI Performance Engineer

View more articles from Ali

Jiexiang Liu

View more articles from Jiexiang

Hengjie Wang

AI Performance Engineer

Hengjie Wang is a software engineer focusing on performance optimizations for AI and scientific applications. He has many years of experience in developing and optimizing large-scale scientific applications on world-ranking supercomputers. He has also developed Deep learning algorithms to advance physical simulations. Before joining Modular, he was a postdoctoral scholar in the Lawerence Berkeley National Lab, where he participated in developing the Exa-scale projects MFIX-Exa and AMReX on national supercomputers. He is a big fan of Go and enjoys hiking and dog training.

View more articles from Hengjie

Abdul Dakkak

AI Compiler Engineer

Expert in machine learning, compilers, programming languages, and accelerated computing. Before Modular, Abdul led the development of AI compilers for GPUs at Microsoft Research and the Mathematica Compiler at Wolfram Research. Abdul has developed open-source tools for accelerating real-world applications to optimize their performance across the hardware and software stack.

View more articles from Abdul

Matrix Multiplication on Blackwell: Part 3 - The Optimizations Behind 85% of SOTA Performance

Kernel 5: multicast and 2xSM MMA

CTA memory multicasting

2xSM MMA

Tensor memory layout

Kernel 6: 2SM pipelining

Pipelining MMA and TMA

Warp specialization

Output

Kernel 7: double-buffering the write-out

Next steps:

Next blog post: