Modular: Matrix Multiplication on Blackwell: Part 2 - Using Hardware Features to Optimize Matmul

Ali Taha

AI Performance Engineer

View more articles from Ali

Jiexiang Liu

View more articles from Jiexiang

Hengjie Wang

AI Performance Engineer

Hengjie Wang is a software engineer focusing on performance optimizations for AI and scientific applications. He has many years of experience in developing and optimizing large-scale scientific applications on world-ranking supercomputers. He has also developed Deep learning algorithms to advance physical simulations. Before joining Modular, he was a postdoctoral scholar in the Lawerence Berkeley National Lab, where he participated in developing the Exa-scale projects MFIX-Exa and AMReX on national supercomputers. He is a big fan of Go and enjoys hiking and dog training.

View more articles from Hengjie

Abdul Dakkak

AI Compiler Engineer

Expert in machine learning, compilers, programming languages, and accelerated computing. Before Modular, Abdul led the development of AI compilers for GPUs at Microsoft Research and the Mathematica Compiler at Wolfram Research. Abdul has developed open-source tools for accelerating real-world applications to optimize their performance across the hardware and software stack.

View more articles from Abdul

Matrix Multiplication on Blackwell: Part 2 - Using Hardware Features to Optimize Matmul

Shared memory

Kernel 2: TMA and tensor cores

1 - Loading tiles into shared memory

The need for a TMA barrier

Core matrices

2 - Issuing the MMA instructions

What is Tensor Memory?

3 - TMEM → registers

4 - Registers→ GMEM

5 - Setup shared memory for everything above

Kernel 3: swizzling

Shared Memory Banks

Bank conflicts

Swizzling

128 byte swizzling

The updated kernel

Performance

Kernel 4: Packing output in shared memory and using TMA store

Pack output in shared memory

TMA store

Performance

Appendix

Descriptors

Swizzling mathematics

Next blog post:

Matrix Multiplication on Blackwell: Part 2 - Using Hardware Features to Optimize Matmul

Shared memory

Kernel 2: TMA and tensor cores

1 - Loading tiles into shared memory

The need for a TMA barrier

Core matrices

2 - Issuing the MMA instructions

What is Tensor Memory?

3 - TMEM → registers

4 - Registers→ GMEM

5 - Setup shared memory for everything above

Kernel 3: swizzling

Shared Memory Banks

Bank conflicts

Swizzling

128 byte swizzling

The updated kernel

Performance

Kernel 4: Packing output in shared memory and using TMA store

Pack output in shared memory

TMA store

Performance

Appendix

Descriptors

Swizzling mathematics

Next blog post:

Sign up for our newsletter