Modular: Achieving State-of-the-Art Performance on AMD MI355

MI355 feature	Impact	Our adjustment
New FP32→BF16 conversion instructions	Faster casting between data types	Updated cast dispatch in Mojo’s standard library
Larger tensor-core tile sizes	Higher arithmetic intensity	Expanded matmul tile selection space
Increased shared memory (64KB→160KB)	Larger tiles, less memory pressure	Tuned tile-size parameters
Transposed DRAM→shared-memory loads	Removes extra in-place transpositions	Updated LayoutTensor load logic
Larger HBM capacity	Allows larger dynamic batch sizes	Nothing — handled automatically by MAX’s scheduler

Matmul kernel	GFlop/s
Mojo (day 0)	1202302.79
hipblaslt	1561446.68
Mojo (day 1)	1610514.29

Tracy Sharpe

Mojo Kernel Engineer

Tracy is a software engineer focusing on low-level optimizations. He has worked for over 25 years at Microsoft on systems projects. At Microsoft, he developed the operating system kernels for three generations of the Xbox game console. He then built the MLAS library for ONNX Runtime, which is used for high performance matrix multiplication and convolution and is used throughout Microsoft's offerings. He is currently exploring the intersection of hardware and compiler technology.

View more articles from Tracy

Anand Pratap Singh

Engineer

View more articles from Anand

Prince Jain

Engineer

View more articles from Prince

Abdul Dakkak

AI Compiler Engineer

Expert in machine learning, compilers, programming languages, and accelerated computing. Before Modular, Abdul led the development of AI compilers for GPUs at Microsoft Research and the Mathematica Compiler at Wolfram Research. Abdul has developed open-source tools for accelerating real-world applications to optimize their performance across the hardware and software stack.

View more articles from Abdul

Achieving State-of-the-Art Performance on AMD MI355 — in Just 14 Days

Why AI Hardware Enablement Is Hard

A Foundation Built for Portability

What’s New in AMD MI355

Two Weeks to SOTA: A Day-by-Day Story

Day 0: Preparing Without Hardware

Day 1: First Login, First Success

Week 1: Finding the Levers

Week 2: From Optimization to Demo

Results: Outperforming the Field

Are We Done? Not Even Close.

Next blog post: