Modular: Structured Mojo Kernels Part 4 - Portability and the Road Ahead

Feature	NVIDIA B200	AMD MI355X
Thread group size	32 (warp)	64 (wave)
Register allocation	Dynamic	Static (shared equally)
Dedicated tensor memory	Yes (TMEM, 256KB)	No (registers only)
Hardware barriers	`mbarrier` (arrive/wait, byte counting)	None (use atomics or `s_barrier`)
Async memory engine	TMA (descriptor-based, hardware im2col)	`load_to_lds` only
Shared memory per SM/CU	228KB	160KB
Matrix instruction	`tcgen05.mma`	`mfma`

Component	Common baseline	NVIDIA specialization	AMD specialization
TileIO	Software-managed loads	TMA (async, descriptor-based)	Cooperative `load_to_lds`
TileOp	Generic MMA wrapper	`tcgen05.mma` (Blackwell) / `wgmma` (Hopper)	`mfma`
TilePipeline	Software barriers	Hardware `mbarrier`	Atomic counters
Scheduling	Software tile distribution	CLC (hardware scheduling)	Software fallback

Metric	CUTLASS	Mojo	Change
Conv2d-specific code	~870 lines	~130 lines	-85%
Block-scaled addition	~1,500 lines	~200 lines	-87%
Performance	Baseline	Equal	Zero-cost

AMD MI355X has a fundamentally different execution model. Static registers, no TMEM, no mbarrier. The ping-pong pattern addresses the static register allocation constraint directly; the tile-based decomposition and layout algebra carry over unchanged.

Portability means progressive specialization, not source compatibility. The component boundaries make it possible to swap platform-specific implementations without rewriting kernel logic.

Composition beats specialization because of Mojo's metaprogramming. CUTLASS uses C++ template specialization: 500K+ lines, NVIDIA-only. Structured kernels use composition: small components, clean interfaces, platform-agnostic kernel logic. That difference follows from the language itself.

The numbers hold up at scale. 48% code reduction, conv2d in 130 lines, block-scaled matmul in 200 lines, zero performance cost. The reductions compound as you build more kernels on the architecture.

This is a foundation. Portable warp-specialized kernels, attention fusion, and automated scheduling are next. The principles stay the same but the scope of this work will continue to expand.

Structured Mojo Kernels Part 4 - Portability and the Road Ahead

AMD MI355X: a genuinely different machine

Why warp specialization does not map to AMD

The ping-pong pipeline pattern

What is shared, what adapts

The portable warp-specialized kernel

How Mojo enables kernel composition

Series conclusion

The numbers

Three principles

What comes next

Get started

Summary

Read more from Modular

Structured Mojo Kernels Part 4 - Portability and the Road Ahead

AMD MI355X: a genuinely different machine

Why warp specialization does not map to AMD

The ping-pong pipeline pattern

What is shared, what adapts

The portable warp-specialized kernel

How Mojo enables kernel composition

Series conclusion

The numbers

Three principles

What comes next

Get started

Summary

Read more from Modular

Sign up for our newsletter