Modular: Structured Mojo Kernels Part 3

Warps	Role	Responsibility
0–3 (128 threads)	Epilogue	TMEM → registers → SMEM → global memory via TMA
4 (32 threads)	Scheduler	CLC-based work distribution
5 (32 threads)	Load	TMA loads for A and B tiles
6 (32 threads)	MMA	Tensor core operations, TMEM accumulation

Warps	Role	Responsibility
0–3	Epilogue	TMEM → registers, add residual, store
4	Scheduler	CLC-based work distribution
5	MainLoad	TMA im2col loads for activation and filter
6	MMA	Tensor core operations
7	EpilogueLoad	TMA loads for residual tensor C

Layer	Conv only	Conv + Residual	Overhead
16×16, 128→128	0.037 ms	0.037 ms	~0%
32×32, 256→256	0.068 ms	0.066 ms	~0%
64×64, 256→128	0.068 ms	0.066 ms	~0%

Component	Standard	Block-Scaled	Blockwise FP8	Shared?
InputTilePipeline	✓	✓	✓	Yes
InputProducer/Consumer	✓	✓	✓	Yes
ProducerConsumerPipeline	✓	✓	✓	Yes
MmaWarpContext	✓	✓	✓	Yes
EpilogueWarpContext	✓	✓	✓	Yes
TileScheduler	✓	✓	✓	Yes
TilePayload	Standard	BlockScaled	BlockwiseFP8	No
Epilogue	Standard	Scale-aware	Per-K scales	No

Benchmark	Mean Delta	Notes
Llama 8B Decode	-0.2%	Performance parity
Llama 8B Prefill	-0.1%	Performance parity
Llama 405B TP8	+0.2%	Slightly faster

Conv2d composes by swapping TileIO. Replace the contiguous tile loader with an im2col-aware variant; reuse the entire matmul pipeline, compute, epilogue, and scheduler. ~130 lines of conv-specific code vs CUTLASS's 870-line separate kernel.

Block-scaled matmul composes by parameterizing TilePipeline. The TilePayload trait separates synchronization from tile storage. Three payload implementations share one pipeline with zero changes to barrier management.

The 8th warp extends without forking. Conv2d's fused residual add adds one warp role without touching the existing seven. Zero overhead because the residual load hides behind compute.

The abstractions are zero-cost. SASS output is identical between structured and legacy kernels. Llama benchmarks confirm performance parity.

Changes stay localized. New kernels compose from existing components. Fixes propagate automatically. Each new kernel variant costs a predictable number of incremental lines, not a full reimplementation.

Structured Mojo Kernels Part 3 - Composition in Practice

The Blackwell execution model

Context-managed warp lifecycle

Composition axis 1: Conv2d via TileIO swap

What changes

What stays the same

The 8th warp: fused residual add

What others do instead

Composition axis 2: Block-scaled matmul via TilePayload

The TilePayload trait

Three payloads, one pipeline

The generic pipeline

What you change vs what you reuse

Zero-cost abstractions

What these two examples show

TL;DR

Read more from Modular

Structured Mojo Kernels Part 3 - Composition in Practice

The Blackwell execution model

Context-managed warp lifecycle

Composition axis 1: Conv2d via TileIO swap

What changes

What stays the same

The 8th warp: fused residual add

What others do instead

Composition axis 2: Block-scaled matmul via TilePayload

The TilePayload trait

Three payloads, one pipeline

The generic pipeline

What you change vs what you reuse

Zero-cost abstractions

What these two examples show

TL;DR

Read more from Modular

Sign up for our newsletter