Modular acquires BentoML to deliver production AI in the cloud!  - Read more

February 27, 2026

Structured Mojo Kernels Part 1 - Why Structured Kernels?

Fabio Riccardi

Modular Kernel Team

Engineering

The Elephant in the Data Center

The GPU programming community is facing a difficult reality: writing high-performance GPU code is unreasonably hard, and it's getting worse.

The bar for peak TFLOPS keeps getting higher and kernel authors are expected to achieve rapid model bringup on ever-changing hardware. This is the code that powers LLM inference, trains foundation models, and runs production AI infrastructure. Everyone is betting on this code to lower TCO and customer latency. In this environment, high performance kernels are an essential competitive edge.

Most of this code is brittle by necessity. A typical production kernel spans 3,000-5,000 lines of tightly-coupled logic where a single misplaced barrier will silently corrupt your results. Good luck figuring out which one at 2am when your inference server is on fire.

This complexity is a direct consequence of how GPU hardware has evolved. It's created a problem: only a handful of engineers on the planet can write this code well.

The Accessibility Problem

GPUs are everywhere. They're in every cloud, every data center, and (let's not forget) in every laptop and phone. The demand for GPU-accelerated software has never been higher. Yet the pool of people who can write efficient GPU code hasn't grown to match.

Why? Complexity acts as a gatekeeper. You need to understand:

  • Warp-level programming and thread divergence
  • Memory hierarchies (registers, shared memory, L2, HBM)
  • Asynchronous execution and software pipelining
  • Platform-specific intrinsics (TMA, wgmma, mfma, buffer loads...)
  • Barrier synchronization in multi-stage pipelines


Writing performant GPU code typically requires years of specialized experience. We recently wrote about this challenge in our 4 part series on Blackwell performance.

The DSL Tradeoff

This is where DSLs like Triton entered the picture, and credit where it's due: Triton makes GPU programming more accessible. Write something that looks like Python, get reasonable GPU performance. For many use cases, that's sufficient. However, there's a catch. For inference workloads: "reasonable" isn't good enough.

When you're serving millions of requests per day, the difference between 85% and 100% of peak performance is the difference between 6 GPUs and 7 GPUs. Multiply that across your fleet, and you're talking significant money. You can't close this performance gap without dropping below the abstraction layer entirely. At that point you've lost the productivity benefit that justified using the DSL in the first place. High-level abstractions and full hardware control are in direct tension, and inference workloads tend to force the issue.

(For a deeper dive on this tradeoff, see our DAIC post on Python eDSLs.)

Our Requirements

What we actually need, and what the industry has been chasing for decades, is the trinity of:

  1. Performance: Hit peak hardware utilization
  2. Productivity: Write code that humans can understand and maintain
  3. Portability: Run efficiently across different GPU architectures

Pick any two, they said. You can't have all three.

At Modular, we respectfully disagree. And this blog series is going to show you why.

In our Matrix Multiplication on Blackwell series, we showed how to squeeze every last TFLOP out of NVIDIA's latest hardware. That series was about understanding the hardware at the lowest level and writing code that matches it exactly.

This series asks a different question: How do we write GPU kernels that are both fast AND maintainable?

The answer is Structured Mojo Kernels: a software architecture that brings the separation of concerns we take for granted in every other domain of software to the world of GPU programming, through modular components, clear interfaces, and code you can actually read six months later.

Here’s the kicker: none of this costs anything at runtime. Mojo's compile-time metaprogramming means the abstractions exist only in the source code. The GPU assembly the compiler produces is identical to what you'd get from a hand-written, monolithic kernel. You get the maintainability of structured software and the performance of code written close to the metal without the headaches.

💡
Code: All kernels mentioned in this series are available in the Modular repository.

The Growing Complexity Problem

Modern GPU kernels have become extraordinarily complex. A production matmul kernel targeting NVIDIA's Blackwell or AMD's MI300X typically spans 3,000-5,000 lines of highly interdependent code. This complexity results in a number of issues:

The Monolithic Kernel Problem
The Monolithic Kernel Problem

Here’s what this looks like in practice. This is from our 3,721-line SM100 matmul kernel — the production code that powered our Blackwell inference before the structured refactoring. Just the shared memory and pipeline setup requires extracting 11 fields from the SMEM struct, constructing 3 separate LayoutTensorIter objects with full type annotations, creating 3 ProducerConsumerPipeline instances, and manually extracting raw pointers for barriers and TMEM:

mojo
# Original kernel: shared memory and pipeline setup (trimmed from ~60 lines)

ref smem_storage = external_memory[...].bitcast[SmemType]()[]

# Manually extract 11 named fields from the SMEM struct
ref a_smem_storage = smem_storage.a_smem
ref b_smem_storage = smem_storage.b_smem
ref c_smem_storage = smem_storage.c_smem
ref tma_mma_mbars_storage = smem_storage.tma_mma_mbars
ref accum_mbars_storage = smem_storage.accum_mbars
ref clc_mbars_full_storage = smem_storage.clc_mbars_full
ref clc_mbars_empty_storage = smem_storage.clc_mbars_empty
ref clc_response_storage = smem_storage.clc_response
ref clc_throttle_storage = smem_storage.clc_throttle_mbars
ref tmem_addr_storage = smem_storage.tmem_addr
ref tmem_dealloc_mbar_storage = smem_storage.tmem_dealloc_mbar

# Construct typed iterators with full layout + address space annotations
var a_smem = LayoutTensorIter[a_type, a_smem_layout, MutAnyOrigin,
    address_space = AddressSpace.SHARED, alignment=128,
](a_smem_storage.unsafe_ptr(), SmemType.a_smem_size)
var b_smem = LayoutTensorIter[b_type, b_smem_layout, MutAnyOrigin,
    address_space = AddressSpace.SHARED, alignment=128,
](b_smem_storage.unsafe_ptr(), SmemType.b_smem_size)
var c_smem_iter = LayoutTensorIter[c_type, Layout.row_major(...), MutAnyOrigin,
    address_space = AddressSpace.SHARED, alignment=128,
](c_smem_storage.unsafe_ptr(), SmemType.c_smem_size)

# Create 3 separate pipeline objects from raw barrier pointers
var load_mma_pipeline = ProducerConsumerPipeline[...](
    tma_mma_mbars_storage.unsafe_ptr())
var mma_output_pipeline = ProducerConsumerPipeline[...](
    accum_mbars_storage.unsafe_ptr())
var load_clc_pipeline = ProducerConsumerPipeline[...](
    clc_throttle_storage.unsafe_ptr())

# Extract raw pointers for TMEM address, CLC responses, barriers
var ptr_tmem_addr = tmem_addr_storage.unsafe_ptr()
clc_response = clc_response_storage.unsafe_ptr()
clc_full_mbar = clc_mbars_full_storage.unsafe_ptr()
clc_empty_mbar = clc_mbars_empty_storage.unsafe_ptr()
tmem_dealloc_mbar = tmem_dealloc_mbar_storage.unsafe_ptr()

This is before the kernel even reaches its barrier initialization or main loop. Every field extraction, every pipeline constructor, every raw pointer — they’re all loose variables in the same flat scope, repeated across every kernel variant.

In the structured kernel, the same setup is:

mojo
# Structured kernel: shared memory and pipeline setup

ref smem = external_memory[
    Scalar[DType.uint8], address_space = AddressSpace.SHARED, alignment=128,
]().bitcast[Self.SmemType]()[]

# Pipeline bundles tile storage with synchronization
var tile_payload = Self.TilePayload(smem.a_tiles(), smem.b_tiles())
var input_pipeline = Self.InputTilePipeline(
    smem.pipelines.input_barriers(), tile_payload)

# Kernel context encapsulates election vars, CTA coords, and masks
var ctx = Self.Context(smem.pipelines.tmem_addr())

Six lines. We’ll explain these components in detail shortly, but even at a glance the difference is clear: SmemType encapsulates all tile storage, barrier arrays, and CLC state behind typed accessors. InputTilePipeline bundles the barriers with tile data into a single object that manages the producer-consumer lifecycle. And Context replaces the scattered election variables and CTA coordinate calculations.

When everything is interleaved in a monolithic kernel, making changes becomes treacherous:

  • Adding a feature (like block-scaled quantization) requires touching dozens of locations across hundreds of lines
  • Debugging a race condition means understanding the entire file’s control flow
  • Porting to new hardware often means a complete rewrite

From Hardware Scheduling to Software Orchestration

Why did kernels become so complex? The answer lies in how GPU hardware has evolved.

GPU Programming Model Evolution
GPU Programming Model Evolution

The Old Model: Let Hardware Handle It

In the SIMT era, GPU programming was relatively simple:

mojo
# Classic SIMT pattern (conceptual)
fn simple_kernel(data: Tensor):
    idx = thread_id()
    result = compute(data[idx])
    output[idx] = result
    # Hardware scheduler handles everything else

The hardware scheduler would:

  • Swap warps in/out as memory operations completed
  • Hide latency through massive parallelism
  • Manage the memory hierarchy transparently

The New Model: Software Must Orchestrate

Modern GPUs have shifted to coarse-grained, long-latency operations:

EraOperation StyleLatency Hiding
SIMTMany small opsHardware scheduling
ModernFew large opsSoftware pipelining

Today's tensor cores issue matrix operations spanning hundreds of cycles. DMA engines (TMA on NVIDIA, buffer loads on AMD) transfer entire tiles asynchronously. The programmer must now explicitly:

  1. Pipeline stages: Overlap load, compute, and store
  • Manage barriers: Coordinate producer-consumer relationships
  • Specialize threads: Assign different warps different roles
  • Control memory explicitly: Decide what lives where, and when
  • In this persistent kernel paradigm kernels launch once and manage their entire execution lifecycle:

    Persistent Kernel Pipeline
    Persistent Kernel Pipeline

    The pipeline overlap is critical: while Tile 1 is being computed, Tile 2 is being loaded and Tile 0 is being stored. This overlap is what achieves peak performance, yet it also creates the complexity problem.


    The Solution: Structured Kernel Architecture

    Structured Mojo Kernels address this complexity through separation of concerns. Instead of one monolithic kernel, we organize functionality into three core components, which share a unified configuration layer and a unified data management layer.

    Structured Kernel Architecture
    Structured Kernel Architecture

    Each component has a single responsibility and well-defined interfaces:

    ComponentResponsibilityEncapsulates
    TileIOMoves data between memory levels. Acts as a producer to TilePipeline.TMA/DMA, layout transforms, swizzling
    TilePipelineCoordinates pipeline stages, and manages shared memory.Barriers, producer-consumer sync
    TileOpExecutes compute operations. Acts as a consumer to TilePipeline.MMA instructions, register management

    Separation of concerns keeps the components simpler:

    • TileIO doesn't know about compute operations.
    • TilePipeline doesn't know about memory layouts.
    • TileOp doesn't know about global memory.

    The Power of Separation

    With the old monolithic kernels, adding block-scaled quantization (a feature requiring 4 TMA loads instead of 2) meant adding a separate 1,500 line kernel—1,500 lines that were mostly copy-pasted from existing kernels, but with changes scattered throughout the code.

    With structured components, those changes were almost all localized in a single place: the “payload” structure representing the data passed between the components.

    mojo
    # Before: Scattered changes across 3000+ lines
    
    # After: Localized changes
    @register_passable("trivial")
    struct BlockScaledTilePayload[...](TilePayload):
        var a_tiles: Pointer[...]
        var b_tiles: Pointer[...]
        var sfa_tiles: Pointer[...]  # NEW: Scale factor A
        var sfb_tiles: Pointer[...]  # NEW: Scale factor B
    

    The pipeline coordination (TilePipeline), compute logic (TileOp), and data movement patterns (TileIO) didn’t need to change because they're properly decoupled.

    Pipeline Lifecycle Management

    The encapsulation payoff extends to the MMA and epilogue warps. In the original kernel, the MMA warp manually manages the output pipeline — calling producer_stage(), wait_consumer(), producer_step(), and handling TMEM allocation, lock release, and deallocation as separate operations:

    mojo
    # Original: MMA warp with manual pipeline management
    if WarpRole.is_mma():
        tcgen05_alloc[config.cta_group](ptr_tmem_addr, max_tmem_cols)
        syncwarp()
        named_barrier_arrive[MMA_THREADS + EPILOGUE_THREADS](1)
        tmem_addr = ptr_tmem_addr[0]
    
        while work_info.is_valid():
            next_work_info = scheduler.fetch_next_work(work_info, clc_pipe_consumer_state)
            clc_pipe_consumer_state.step()
    
            if elect_one_cta:
                var mma_output_mma_stage = mma_output_pipeline.producer_stage()
                mma_output_pipeline.wait_consumer()
                var tmem_offset = tmem_addr + (mma_output_mma_stage * stage_stride_cols)
    
                for i in range(num_iters // config.k_group_size):
                    consumer_main_loop[
                        block_tile_shape = config.block_tile_shape,
                        mma_shape = config.mma_shape,
                        cta_group = config.cta_group,
                        cluster_shape = config.cluster_shape,
                        k_group_size = config.k_group_size,
                    ](tmem_offset, a_smem, b_smem, load_mma_pipeline,
                      mma_op, elect_one_warp, i * config.k_group_size, 0)
                    load_mma_pipeline.consumer_step()
    
                # ... mma_arrive / mma_arrive_multicast based on cta_group ...
                mma_output_pipeline.producer_step()
            work_info = next_work_info
    
        tcgen05_release_allocation_lock[config.cta_group]()
        tmem_dealloc_mbar[].wait()
        tcgen05_dealloc[config.cta_group](tmem_addr, max_tmem_cols)

    Every pipeline transition is explicit: producer_stage()wait_consumer() → work → producer_step(). TMEM allocation at the top, lock release and deallocation at the bottom. Miss any step and you get a silent hang or corruption.

    In the structured kernel, context managers handle the entire lifecycle:

    mojo
    # Structured: MMA warp with context-managed pipeline
    if WarpRole.is_mma():
        var tmem = Self.Tmem.allocate(smem.pipelines.tmem_addr())
        var mma_ctx = Self.MmaCtx(tmem, Self.OutputPipeline(...), Self.TmemDealloc(...))
    
        with mma_ctx:  # handles TMEM alloc → lock release → dealloc on exit
            while work_iter.has_work():
                with work_iter.wait_and_advance():
                    if ctx.elect_one_cta:
                        with mma_ctx.output_pipeline.producer() as output_stage:
                            with input_pipeline.consumer() as consumer:
                                for i in range(0, num_iters, Self.config.k_group_size):
                                    with consumer.acquire() as input_tiles:
                                        Self.mma(output_stage.tmem, input_tiles,
                                            mma_op, ctx.elect_one_warp, UInt32(i), 0)

    The with mma_ctx: block handles the full TMEM lifecycle (allocate, sync with epilogue, deallocate). The nested with blocks handle output pipeline producer staging, input pipeline consumer stepping, and per-tile barrier acquire/release. No manual step() calls, no explicit deallocation sequence.


    In Practice: The TMA Load Warp

    We’ve seen how structured kernels simplify setup and MMA pipeline management. Now let’s look at the TMA load warp — the warp responsible for moving tiles from global memory into shared memory.

    Both versions use the same warp-specialized architecture — dedicated warps for TMA loading, MMA compute, scheduling, and epilogue. The difference is how pipeline synchronization is managed within each warp.

    Original: Manual Pipeline Management

    In the original kernel, the TMA load warp manually manages pipeline state — calling wait_consumer() before loading, producer_step() after, and implementing its own drain loop to prevent early CTA exit:

    mojo
    # Original: TMA load warp with manual pipeline stepping
    if WarpRole.is_main_load():
        while work_info.is_valid():
            # CLC throttle: manual wait/arrive/step
            if is_first_cta_in_cluster and required_clc_query:
                load_clc_pipeline.wait_consumer()
                var load_clc_producer_state = load_clc_pipeline.producer_stage()
                _ = load_clc_pipeline.producer_mbar(load_clc_producer_state)[0].arrive()
                load_clc_pipeline.producer_step()
    
            for i in range(num_iters // config.k_group_size):
                load_AB[
                    block_tile_shape = config.block_tile_shape,
                    mma_shape = config.mma_shape,
                    cta_group = config.cta_group,
                    k_group_size = config.k_group_size,
                ](a_tma_op, b_tma_op, a_smem, b_smem, load_mma_pipeline,
                  peer_cta_coord, (UInt(work_info.m), UInt(work_info.n)),
                  a_multicast_mask, b_multicast_mask,
                  i * config.k_group_size, elect_one_cta)
                load_mma_pipeline.producer_step()
    
            syncwarp()
            var next_work_info = scheduler.fetch_next_work(
                work_info, clc_pipe_consumer_state)
            work_info = next_work_info
            clc_pipe_consumer_state.step()
    
        # Manual drain: prevent CTA exit while peer is still consuming
        for i in range(config.num_pipeline_stages // config.k_group_size):
            load_mma_pipeline.wait_consumer()
            load_mma_pipeline.producer_step()

    The pipeline protocol is spread across the loop: producer_step() after each tile load, additional wait_consumer() calls inside the load_AB helper (not shown), manual CLC throttle signaling, and an explicit drain loop at the end. Every step() call must be paired correctly or the kernel hangs.

    Structured: Context-Managed Pipelines

    The structured version uses context managers to handle pipeline transitions automatically. The with blocks guarantee correct acquire/release ordering at compile time:

    mojo
    # Structured: TMA load warp with context-managed pipeline
    if WarpRole.is_main_load():
        with input_pipeline.producer() as producer:
            while work_iter.has_work():
                with work_iter.next() as current:
                    work_iter.throttle_signal(ctx.is_first_cta_in_cluster)
    
                    for i in range(0, num_iters, Self.config.k_group_size):
                        with producer.acquire() as tiles:
                            Self.load_input_tiles(
                                a_tma_op, b_tma_op, tiles,
                                ctx.peer_cta_coord,
                                (UInt(current.m), UInt(current.n), UInt(current.k_start)),
                                ctx.a_multicast_mask, ctx.b_multicast_mask,
                                UInt32(i), ctx.elect_one_cta)
    
                    syncwarp()
    
            producer.drain()  # automatic drain before CTA exit

    The with producer.acquire() as tiles: block replaces the manual wait_consumer() → load → producer_step() sequence. The with work_iter.next() as current: block replaces the manual scheduler.fetch_next_work() and clc_pipe_consumer_state.step(). And producer.drain() replaces the hand-written drain loop.

    The structured version:

    • Encapsulates pipeline protocol in context managersacquire() / release() replace manual wait_consumer() / producer_step() pairs
    • Bundles work iteration with schedulingwork_iter.next() replaces scattered scheduler.fetch_next_work() + step() calls
    • Guarantees correctness — the with blocks make it impossible to forget a step() call or misorder operations
    • Enables code reuse: load_input_tiles, mma, and TileWriterType are separate methods shared across kernel variants

    Performance: Zero-Cost Abstractions

    At this point the reasonable question is: what does all this structure cost at runtime? Nothing. The abstractions exist only in the source code.

    This is a direct consequence of how Mojo handles compile-time execution. Four properties work together to ensure the structured components leave no runtime trace:

    1. @register_passable("trivial") guarantees core types live entirely in registers (no heap allocation, no pointer indirection).
    2. @always_inline ensures every function call is inlined at compile time, leaving no call overhead in the generated code.
    3. Compile-time parameters mean the GPU never evaluates configuration logic. It receives code that was specialized for exactly this use case at compile time.
    4. RAII patterns make resource management explicit in the source code without adding runtime cost. The compiler resolves ownership and lifetimes statically, generating no additional instructions in the output

    Mojo is currently the only language with all these features working together for GPU programming. C++ has templates and RAII, but no @register_passable for guaranteeing register placement. Rust has zero-cost abstractions, but GPU support is still maturing. CUDA has GPU support, but limited compile-time metaprogramming. Python has great ergonomics, but no compile-time execution.

    Mojo combines full compile-time metaprogramming, guaranteed resource management, and first-class GPU support in a single language, enabling these abstractions to deliver at zero runtime cost.

    The production numbers bear this out. Here's actual data from our SM100 kernel:

    MetricLegacy sm100Structured sm100Change
    Total Lines14,6837,634-48%
    Main Kernel3,7211,843-50%
    Performance~1770 TFLOPS~1770 TFLOPSEqual

    We cut the code nearly in half while maintaining identical performance. The abstractions exist only in your text editor. The generated GPU assembly is indistinguishable from hand-written code.

    When you do need to go lower, when a specific kernel demands something the structured components don't express, Mojo lets you. Nothing in this architecture prevents you from writing directly to the hardware. The difference is that you'll rarely need to, and when you do, you'll be modifying a codebase that's half the size and considerably easier to reason about

    What's Next

    This post has laid out the problem and the architecture at a high level. The next three posts get into the specifics:

    • Part 2: The Three Pillars: TileIO, TilePipeline, and TileOp explained with real code, covering how each component works and why the interfaces are designed the way they are.
    • Part 3: Platform Implementations: How the same structured patterns map onto NVIDIA Blackwell and AMD MI300X, and what changes when the underlying hardware does.
    • Part 4: Composition and Unification: Extending the architecture for block-scaled matmul and other demanding kernels.

    The goal of this series isn't just to show you how to write fast GPU code. It's to show you how to write GPU code that stays fast as requirements evolve, stays understandable as your team grows, and stays portable as the hardware landscape shifts.

    Because the next GPU architecture is coming. And the one after that. And the code you write today should still be maintainable when it arrives.


    TL;DR

    1. GPU programming has an accessibility problem. A decade of architectural changes has progressively shifted responsibility for managing execution onto the programmer. That burden gatekeeps hardware that should be available to far more developers.
    2. DSLs narrow the gap but don't close it. Triton and similar tools trade peak performance for productivity. This may be good enough for training, but it doesn’t keep up with the demands of today’s inference workloads.
    3. Performance, productivity, and portability are not mutually exclusive. They require the right abstractions and the right language primitives, but they are achievable together.
    4. Structured kernels are the path. Separate concerns, compose components, and let the compiler do the rest.
    5. Zero-cost abstractions are real. Mojo is uniquely positioned with the compile-time metaprogramming and resource management features needed to make high-level abstractions disappear at runtime. No other language currently offers this combination for GPU programming

    By implementing Structured Mojo Kernels, we cut our kernel codebase nearly in half while maintaining identical performance. The abstractions aren't slowing anything down: they're enabling us to maintain and extend our kernel library faster than ever.


    Read more from Modular

    View all blogs

    Build the future of AI with Modular

    View Editions
    • Person with blonde hair using a laptop with an Apple logo.

      Get started guide

      Install MAX with a few commands and deploy a GenAI model locally.

      Read Guide
    • Magnifying glass emoji with black handle and round clear lens.

      Browse open models

      500+ models, many optimized for lightning-fast performance

      Browse models
    No items found.