Modular acquires BentoML to deliver production AI in the cloud!  - Read more

March 11, 2026

Structured Mojo Kernels Part 2 - The Three Pillars

Fabio Riccardi

Modular Kernel Team

Engineering

In Part 1, we introduced the structured kernel architecture and made the case for why it exists. Conventional GPU kernels (where pipeline coordination, data movement, and compute interleave across thousands of lines) are difficult to maintain and difficult to extend. Adding block-scaled quantization to a conventional kernel means writing a new kernel. Adding support for a new hardware feature means touching code that has nothing to do with that feature. Structured Mojo Kernels address this by separating concerns into three components with defined interfaces, and we showed the result: 48% less code, identical performance, and a conv2d kernel that reused the entire matmul infrastructure with only ~130 lines of conv-specific code.

This post examines how these components work and why their boundaries matter. TileIO, TilePipeline, and TileOp each own one part of the kernel's job. These are patterns, not library APIs. The concrete implementations live in the open-source kernels library, but the patterns are what make the architecture work. That distinction matters because GPU kernels don't stay static. Structured Mojo Kernels enable durable patterns that work across NVIDIA and AMD hardware generations without rewriting the kernel structure.

💡
Code: All kernels mentioned in this series are available in the Modular repository.

The Data Flow Model

A structured matmul kernel moves data through three distinct memory levels: global memory (DRAM), shared memory (SMEM/LDS), and registers or tensor memory. The three components own one leg of this journey each. TileIO moves tiles from DRAM to shared memory. TilePipeline signals when tiles are ready to consume or free to refill. TileOp reads from shared memory, runs MMA instructions, and writes accumulators. No component reaches into another's domain.

Data Flow Model Diagram
Data Flow Model Diagram

Each component owns a specific part of this flow

  • TileIO: Moves data between global and shared memory
  • TilePipeline: Coordinates when tiles are ready or consumed
  • TileOp: Executes the actual computation

TileIO: The Data Movement Specialist

TileIO handles all data movement between memory hierarchies. Its responsibility is to hide the complexity of platform-specific DMA operations behind a clean interface.

What TileIO Encapsulates

ConcernNVIDIAAMDEncapsulated By
DMA operationTMA async_copyload_to_ldsload()
Address calculationTMA descriptorsBuffer resourcesConstructor
Layout transformationTMA swizzle modesManual swizzleConfiguration
Masking/boundsTMA handlesManual checksload()

NVIDIA: TileLoaderTMA

On NVIDIA SM100, TileIO wraps TMA (Tensor Memory Accelerator) operations. The loader stores a pointer to the TMA descriptor and a multicast mask for cluster distribution:

mojo
struct TileLoaderTMA[
    tma_origin: ImmutOrigin,
    dtype: DType,
    gmem_layout: Layout,
    desc_layout: Layout,
    /,
    *,
    cta_group: Int,
](TrivialRegisterPassable):
    """TMA-based tile loader for SM100."""

    comptime TmaOp = TMATensorTile[Self.dtype, Self.gmem_layout, Self.desc_layout]

    # TMA descriptor pointer (referencing grid constant)
    var tma_op: Pointer[Self.TmaOp, Self.tma_origin]
    # Multicast mask for cluster distribution
    var multicast_mask: UInt16

    @always_inline
    fn load[tile_layout: Layout, /, alignment: Int = 128](
        self,
        dest: SMemTile[Self.dtype, tile_layout, alignment=alignment],
        ref[AddressSpace.SHARED] barrier: SharedMemBarrier,
        k_coord: UInt,
        row_coord: UInt,
    ):
        """Load a tile asynchronously via TMA hardware."""
        self.tma_op[].async_multicast_load[Self.cta_group](
            dest, barrier, (k_coord, row_coord), self.multicast_mask
        )

The loader is deliberately minimal; it wraps one TMA descriptor and one multicast mask to expose a data movement primitive. The kernel manages all orchestration (k-group iteration, barrier management, expected byte counts), and interacts with the loader through the stage readiness/consumption protocol exposed by TilePipeline, which we'll discuss in "TilePipeline: The Pipeline Coordinator".

AMD: TileLoaderLDS

On AMD, TileIO uses buffer resources for cooperative global-to-LDS loading. The architecture requires a fundamentally different loading strategy:

mojo
struct TileLoaderLDS[
    dtype: DType,
    src_layout: Layout,
    src_tile_layout: Layout,
    num_loading_warps: Int,
    swizzle: Optional[Swizzle] = Optional[Swizzle](),
    load_width: Int = simd_width_of[dtype](),
    use_full_tile_width: Bool = False,
](TrivialRegisterPassable):
    """Cooperative global→LDS tile loader with swizzle support.

    Loading Modes (controlled by use_full_tile_width):
    - False: Interleaved layout for BF16 where MMA_K < BK
    - True: Row-major layout for FP8 where MMA_K == BK
    """

    var buffer: AMDBufferResource
    var thread_row: Int
    var thread_col: Int
    var warp_id: Int

    # Each warp loads a portion of the tile cooperatively
    # AMDBufferResource provides automatic out-of-bounds clamping to zero

Despite the completely different mechanisms, one hardware DMA, one software-cooperative, the call site for both looks the same. A kernel that adds block-scaled quantization needs to change TileLoaderTMA to handle two descriptors instead of one; it does not touch the pipeline or the compute. That's the boundary doing its job.


TilePipeline: The Pipeline Coordinator

TilePipeline manages the producer-consumer pipeline that overlaps loading, computing, and storing. It does not move data and does not run MMA instructions. It signals when tiles are full or empty.

The Pipeline Model

Pipeline Model Diagram
Pipeline Model Diagram

NVIDIA: InputTilePipeline

On NVIDIA SM100, TilePipeline uses hardware mbarrier primitives. mbarrier provides hardware-accelerated arrive/wait, automatic byte counting that integrates with TMA, and phase tracking across pipeline stages. Parameterizing the pipeline by payload type means the same synchronization logic works for both standard and block-scaled tile configurations; only the payload differs:

mojo
struct InputTilePipeline[
    Payload: TilePayload,
    num_group_stages: Int,
    k_group_size: Int,
](TrivialRegisterPassable):
    """Tile pipeline with configurable payload type.

    Separates synchronization from tile storage. The Payload parameter
    (e.g., StandardTilePayload or BlockScaledTilePayload) holds tile arrays.
    """

    comptime Pipeline = ProducerConsumerPipeline[Self.num_group_stages]

    var pipeline: Self.Pipeline
    var payload: Self.Payload

    @always_inline
    fn __init__(out self, barriers: Self.BarrierArray, payload: Self.Payload):
        self.pipeline = Self.Pipeline(barriers.ptr)
        self.payload = payload

    fn producer(ref [origin]self) -> InputProducer[...]:
        """Get producer role handle."""
        return InputProducer(pipeline_ptr=Pointer(to=self))

    fn consumer(ref [origin]self) -> InputConsumer[...]:
        """Get consumer role handle."""
        return InputConsumer(pipeline_ptr=Pointer(to=self))

The InputProducer and InputConsumer role handles provide acquire() methods that return context managers. The with producer.acquire() as tiles: pattern automatically handles barrier waits on entry and stage advancement on exit.

AMD: Atomic Counter Synchronization

AMD lacks hardware mbarrier. The consequence is that AMD kernels cannot offload barrier byte-counting to hardware, instead all synchronization is explicit. The design uses a SyncStrategy trait to make the ring buffer logic independent of the synchronization mechanism:

mojo
trait SyncStrategy(TrivialRegisterPassable):
    """Interface for producer-consumer synchronization protocols."""

    fn wait_producer_acquire(self, tile_idx: Int, stage: Int, phase: Int32): ...
    fn signal_producer_release(mut self, tile_idx: Int, stage: Int): ...
    fn wait_consumer_acquire(self, tile_idx: Int, stage: Int, phase: Int32): ...
    fn signal_consumer_release(mut self, tile_idx: Int, stage: Int): ...

Two concrete implementations trade off simplicity for reduced contention:

mojo
struct SingleCounterSync[...](SyncStrategy):
    """One atomic counter per tile. Simpler but higher contention."""
    var sync_counter: SMemArray[Int32, total_tiles]

struct SplitCounterSync[...](SyncStrategy):
    """Separate producer/consumer counters. Reduces contention."""
    var producer_counters: SMemArray[Int32, total_tiles]
    var consumer_counters: SMemArray[Int32, total_tiles]

The underlying synchronization primitives are worth examining directly:

mojo
@always_inline
fn wait_for_counter(
    counter: UnsafePointer[Int32, address_space = AddressSpace.SHARED],
    threshold: Int32,
):
    """Spin-wait until counter reaches threshold."""
    while Atomic.load(counter) < threshold:
        inlined_assembly["s_sleep 0", ...]()

@always_inline
fn increment_counter_if_first_thread(
    counter: UnsafePointer[Int32, address_space = AddressSpace.SHARED],
    increment: Int32,
):
    """Atomically increment counter, but only from first thread in warp."""
    if thread_idx.x % WARP_SIZE == 0:
        _ = Atomic.fetch_add(counter, increment)

Two non-obvious decisions here. First, only the first thread in each warp increments the counter. All threads in a warp execute in lockstep, so having all 64 threads issue fetch_add would produce 64 increments where one is correct. Second, s_sleep 0 yields the warp scheduler rather than burning compute cycles in a tight spin. On AMD, issuing a zero-cycle sleep hint lets other wavefronts make progress while this one waits.

Despite the completely different primitives, both NVIDIA and AMD TilePipeline expose the same acquire/release interface. The ring buffer logic above it is platform-agnostic.


TileOp: The Compute Specialist

TileOp encapsulates all matrix computation: loading operands from shared memory to registers (or TMEM), executing MMA/mfma instructions, and managing accumulator storage.

NVIDIA SM100: TMEM-Based Accumulator

Blackwell introduces Tensor Memory (TMEM), a 256KB per-SM scratchpad dedicated to accumulator storage. This matters because moving accumulators out of the register file frees VGPRs for other uses and enables larger tile sizes. TileOp manages the full TMEM lifecycle through a RAII context manager:

mojo
struct MmaWarpContext[
    opc: OutputPipelineConfig,
    mma_threads: Int,
    epilogue_threads: Int,
](TrivialRegisterPassable):
    """MMA warp context - owns TMEM lifecycle and output pipeline.

    __enter__: Signals epilogue that TMEM is allocated
    __exit__: Waits for epilogue, deallocates TMEM
    """

    comptime Tmem = TmemAllocation[Self.opc.cta_group]
    comptime Pipeline = OutputTilePipeline[Self.opc]
    comptime Dealloc = TmemDeallocBarrier[Self.opc.cta_group]

    var tmem: Self.Tmem
    var output_pipeline: Self.Pipeline
    var dealloc_barrier: Self.Dealloc

    @always_inline
    fn __enter__(self) -> Self:
        Self.Sync.arrive()  # Signal epilogue that TMEM is ready
        return self

    @always_inline
    fn __exit__(self):
        self.dealloc_barrier.complete_dealloc(self.tmem)

The context manager matters beyond tidiness. TMEM must be deallocated only after the epilogue warp has finished reading accumulators. If the MMA warp deallocates early, the epilogue reads garbage. If it never deallocates, the next kernel invocation fails to allocate. Expressing this as __enter__/__exit__ makes the invariant structural: the compiler enforces that deallocation happens at scope exit, and there is no code path that bypasses it.

The MMA operation itself receives tiles from the pipeline and executes tcgen05.mma instructions, writing results to TMEM accumulators.

AMD MI300X: Register-Based Accumulator

On AMD, there is no dedicated tensor memory. Accumulators live in VGPRs alongside operands, and the compiler manages their allocation:

mojo
struct AmdTileOperator[
    InType: DType,
    OutType: DType,
    mma_shape: IndexList[3],
    ...
](TrivialRegisterPassable):
    """Matrix operations on AMD using mfma instructions."""

    var _a_reg_tile: Self.ARegTileType
    var _b_reg_tile: Self.BRegTileType
    var out_reg_tile: Self.OutRegTileType  # Accumulator in registers

    fn load_tile_fragment[k_tile_idx: Int](
        self,
        smem_tile_a: TileTensor,
        smem_tile_b: TileTensor,
    ):
        """Load operands from LDS to registers with swizzle handling."""
        @parameter
        for i in range(num_fragments):
            self._a_reg_tile[i] = smem_tile_a.load[swizzle](i, k_tile_idx)
            self._b_reg_tile[i] = smem_tile_b.load[swizzle](i, k_tile_idx)

    fn mma_compute[k_tile_idx: Int](self):
        """Execute mfma instruction."""
        self.out_reg_tile = mfma[mma_shape](
            self._a_reg_tile, self._b_reg_tile, self.out_reg_tile,
        )

The Key Difference

AspectNVIDIA SM100AMD MI300X
Accumulator storageTensor Memory (256KB)Registers (VGPRs)
Instructiontcgen05.mmamfma
Operand sourceDirect from SMEMMust load to registers first
AllocationExplicit allocate/deallocateCompiler-managed

Both TileOp implementations present the same conceptual interface (load fragments, compute MMA, store results) but the resource lifecycle is structurally different. NVIDIA's RAII requirement is a direct consequence of explicit TMEM allocation. AMD needs no such wrapper because the compiler handles register lifetimes automatically.


TileTensor: The Unified Foundation

All three components share a common foundation: TileTensor. This abstraction carries compile-time layout information (shape, stride, swizzle patterns), address space tags (GMEM, SMEM, registers), and type-safe access patterns. It is what allows TileIO, TilePipeline, and TileOp to hand tiles to each other without any component knowing the other's internal representation.

Example: Thread-to-Element Mapping

mojo
# Map 64 threads to a 16x64 output tile
comptime thread_layout = Layout.row_major(8, 4)

# Each thread handles 2 consecutive elements
var my_elements = output_tile
    .vectorize[1, 2]()                      # View as 2-element vectors
    .distribute[thread_layout](lane_id())   # My portion

# Write my portion
@parameter
for i in range(my_elements.shape[0]()):
    for j in range(my_elements.shape[1]()):
        my_elements[i, j] = accumulator[i * stride + j]

Putting It Together: A Complete Pipeline

Here's how the three components interact in the actual SM100 structured matmul kernel. Each warp role has one responsibility, and the pipeline structure is explicit in the code.

Setup: shared memory and pipeline initialization

mojo
# From the SM100 structured matmul kernel's run() method

# Setup: shared memory, pipeline, and context
ref smem = external_memory[...].bitcast[Self.SmemType]()[]
var tile_payload = Self.TilePayload(smem.a_tiles(), smem.b_tiles())
var input_pipeline = Self.InputTilePipeline(
    smem.pipelines.input_barriers(), tile_payload)
var ctx = Self.Context(smem.pipelines.tmem_addr())
var work_iter = scheduler.work_iterator()

TMA Load Warp: data movement only

The load warp acquires producer stages, issues TMA loads, and signals completion. It does not touch accumulators or barriers beyond what TilePipeline exposes.

mojo
# TMA Load Warp: data movement specialist
if WarpRole.is_main_load():
    with input_pipeline.producer() as producer:
        while work_iter.has_work():
            with work_iter.next() as current:
                for i in range(0, num_iters, Self.config.k_group_size):
                    with producer.acquire() as tiles:
                        Self.load_input_tiles(a_tma_op, b_tma_op, tiles, ...)
                syncwarp()
        producer.drain()

MMA Warp: compute only

The MMA warp waits on the input pipeline, computes, and signals the output pipeline. mma_ctx owns the TMEM allocation for the lifetime of this block: __enter__ signals the epilogue that accumulators are allocated, and __exit__ deallocates only after the epilogue confirms it has finished reading.

mojo
# MMA Warp: compute specialist
if WarpRole.is_mma():
    var mma_ctx = Self.MmaCtx.create(...)
    with mma_ctx:
        while work_iter.has_work():
            with work_iter.wait_and_advance():
                if ctx.elect_one_cta:
                    with mma_ctx.output_pipeline.producer() as output_stage:
                        with input_pipeline.consumer() as consumer:
                            for i in range(0, num_iters, Self.config.k_group_size):
                                with consumer.acquire() as input_tiles:
                                    Self.mma(output_stage.tmem, input_tiles, mma_op, ...)

Epilogue Warp: output only

The epilogue warp reads from TMEM through the output pipeline and writes to global memory. It signals the MMA warp when it is done, triggering TMEM deallocation.

mojo
# Epilogue Warp: output specialist
if WarpRole.is_epilogue():
    var epi_ctx = Self.EpilogueCtx(...)
    with epi_ctx:
        while work_iter.has_work():
            with work_iter.next() as current:
                with epi_ctx.output_pipeline.consumer() as output_stage:
                    tile_writer.write_batched(smem.c_tiles(), output_stage, ...)

The load warp never touches accumulators. The MMA warp never issues TMA loads. The epilogue warp never manages barriers. Each component enforces its boundary not through convention but through the types it exposes.


What’s Next

We've now seen each pillar from the inside: TileIO moves data, TilePipeline coordinates timing, TileOp runs the math. The same three abstractions appear in both the NVIDIA and AMD implementations but the implementations diverge significantly. In Part 3, we show what that divergence looks like end to end: NVIDIA SM100 with 7-warp specialization, tensor memory, and TMA pipelines alongside AMD MI300X with 8-wave ping-pong, register-based accumulation, and atomic synchronization. Same structural pattern, different everything underneath.


TL;DR

  • TileIO owns data movement: hides TMA vs. buffer load differences behind a single load() call
  • TilePipeline owns coordination: mbarrier on NVIDIA, atomic counters on AMD, same acquire/release interface
  • TileOp owns computation: TMEM on NVIDIA (with explicit RAII lifecycle), registers on AMD (compiler-managed)
  • TileTensor is the shared data abstraction: one type across all memory levels and address spaces
  • Component boundaries are enforced by types, not convention: the load warp cannot issue MMA instructions because TilePipeline does not expose that interface


Read more from Modular

View all blogs

Build the future of AI with Modular

View Editions
  • Person with blonde hair using a laptop with an Apple logo.

    Get started guide

    Install MAX with a few commands and deploy a GenAI model locally.

    Read Guide
  • Magnifying glass emoji with black handle and round clear lens.

    Browse open models

    500+ models, many optimized for lightning-fast performance

    Browse models
No items found.