Modular: Matrix Multiplication on Blackwell: Part 4

Kernel name	TFLOPs	% of cuBLAS
Ultra Naive (no Tensor Cores)	5.6	0.3
Tensor Cores	155.0	8.8
Swizzling	295.6	16.8
TMA + ST_Matrix	293.6	16.7
2SM MMA	360.2	20.4
Pipelining	1429.6	81.1
Double Buffering Output	1493.0	84.7
CLC Persistent	1772.9	100.6

Ali Taha

AI Performance Engineer

View more articles from Ali

Jiexiang Liu

View more articles from Jiexiang

Hengjie Wang

AI Performance Engineer

Hengjie Wang is a software engineer focusing on performance optimizations for AI and scientific applications. He has many years of experience in developing and optimizing large-scale scientific applications on world-ranking supercomputers. He has also developed Deep learning algorithms to advance physical simulations. Before joining Modular, he was a postdoctoral scholar in the Lawerence Berkeley National Lab, where he participated in developing the Exa-scale projects MFIX-Exa and AMReX on national supercomputers. He is a big fan of Go and enjoys hiking and dog training.

View more articles from Hengjie

Abdul Dakkak

AI Compiler Engineer

Expert in machine learning, compilers, programming languages, and accelerated computing. Before Modular, Abdul led the development of AI compilers for GPUs at Microsoft Research and the Mathematica Compiler at Wolfram Research. Abdul has developed open-source tools for accelerating real-world applications to optimize their performance across the hardware and software stack.

View more articles from Abdul

Matrix Multiplication on Blackwell: Part 4 - Breaking SOTA

Kernel 8 : CLC Persistent Kernel

What is a persistent kernel?

A deep dive into the CLC scheduler

Pipelining CLC fetches

TMEM as a circular buffer

Kernel 9 : Thread Block Swizzle

Putting things together

Applying to Shapes in Production

Summary

Next blog post: