The Elephant in the Data Center
The GPU programming community is facing a difficult reality: writing high-performance GPU code is unreasonably hard, and it's getting worse.
The bar for peak TFLOPS keeps getting higher and kernel authors are expected to achieve rapid model bringup on ever-changing hardware. This is the code that powers LLM inference, trains foundation models, and runs production AI infrastructure. Everyone is betting on this code to lower TCO and customer latency. In this environment, high performance kernels are an essential competitive edge.
Most of this code is brittle by necessity. A typical production kernel spans 3,000-5,000 lines of tightly-coupled logic where a single misplaced barrier will silently corrupt your results. Good luck figuring out which one at 2am when your inference server is on fire.
This complexity is a direct consequence of how GPU hardware has evolved. It's created a problem: only a handful of engineers on the planet can write this code well.
The Accessibility Problem
GPUs are everywhere. They're in every cloud, every data center, and (let's not forget) in every laptop and phone. The demand for GPU-accelerated software has never been higher. Yet the pool of people who can write efficient GPU code hasn't grown to match.
Why? Complexity acts as a gatekeeper. You need to understand:
- Warp-level programming and thread divergence
- Memory hierarchies (registers, shared memory, L2, HBM)
- Asynchronous execution and software pipelining
- Platform-specific intrinsics (TMA, wgmma, mfma, buffer loads...)
- Barrier synchronization in multi-stage pipelines
Writing performant GPU code typically requires years of specialized experience. We recently wrote about this challenge in our 4 part series on Blackwell performance.
The DSL Tradeoff
This is where DSLs like Triton entered the picture, and credit where it's due: Triton makes GPU programming more accessible. Write something that looks like Python, get reasonable GPU performance. For many use cases, that's sufficient. However, there's a catch. For inference workloads: "reasonable" isn't good enough.
When you're serving millions of requests per day, the difference between 85% and 100% of peak performance is the difference between 6 GPUs and 7 GPUs. Multiply that across your fleet, and you're talking significant money. You can't close this performance gap without dropping below the abstraction layer entirely. At that point you've lost the productivity benefit that justified using the DSL in the first place. High-level abstractions and full hardware control are in direct tension, and inference workloads tend to force the issue.
(For a deeper dive on this tradeoff, see our DAIC post on Python eDSLs.)
Our Requirements
What we actually need, and what the industry has been chasing for decades, is the trinity of:
- Performance: Hit peak hardware utilization
- Productivity: Write code that humans can understand and maintain
- Portability: Run efficiently across different GPU architectures
Pick any two, they said. You can't have all three.
At Modular, we respectfully disagree. And this blog series is going to show you why.

In our Matrix Multiplication on Blackwell series, we showed how to squeeze every last TFLOP out of NVIDIA's latest hardware. That series was about understanding the hardware at the lowest level and writing code that matches it exactly.
This series asks a different question: How do we write GPU kernels that are both fast AND maintainable?
The answer is Structured Mojo Kernels: a software architecture that brings the separation of concerns we take for granted in every other domain of software to the world of GPU programming, through modular components, clear interfaces, and code you can actually read six months later.
Here’s the kicker: none of this costs anything at runtime. Mojo's compile-time metaprogramming means the abstractions exist only in the source code. The GPU assembly the compiler produces is identical to what you'd get from a hand-written, monolithic kernel. You get the maintainability of structured software and the performance of code written close to the metal without the headaches.
The Growing Complexity Problem
Modern GPU kernels have become extraordinarily complex. A production matmul kernel targeting NVIDIA's Blackwell or AMD's MI300X typically spans 3,000-5,000 lines of highly interdependent code. This complexity results in a number of issues:

Here’s what this looks like in practice. This is from our 3,721-line SM100 matmul kernel — the production code that powered our Blackwell inference before the structured refactoring. Just the shared memory and pipeline setup requires extracting 11 fields from the SMEM struct, constructing 3 separate LayoutTensorIter objects with full type annotations, creating 3 ProducerConsumerPipeline instances, and manually extracting raw pointers for barriers and TMEM:
This is before the kernel even reaches its barrier initialization or main loop. Every field extraction, every pipeline constructor, every raw pointer — they’re all loose variables in the same flat scope, repeated across every kernel variant.
In the structured kernel, the same setup is:
Six lines. We’ll explain these components in detail shortly, but even at a glance the difference is clear: SmemType encapsulates all tile storage, barrier arrays, and CLC state behind typed accessors. InputTilePipeline bundles the barriers with tile data into a single object that manages the producer-consumer lifecycle. And Context replaces the scattered election variables and CTA coordinate calculations.
When everything is interleaved in a monolithic kernel, making changes becomes treacherous:
- Adding a feature (like block-scaled quantization) requires touching dozens of locations across hundreds of lines
- Debugging a race condition means understanding the entire file’s control flow
- Porting to new hardware often means a complete rewrite
From Hardware Scheduling to Software Orchestration
Why did kernels become so complex? The answer lies in how GPU hardware has evolved.

The Old Model: Let Hardware Handle It
In the SIMT era, GPU programming was relatively simple:
The hardware scheduler would:
- Swap warps in/out as memory operations completed
- Hide latency through massive parallelism
- Manage the memory hierarchy transparently
The New Model: Software Must Orchestrate
Modern GPUs have shifted to coarse-grained, long-latency operations:
| Era | Operation Style | Latency Hiding |
|---|---|---|
| SIMT | Many small ops | Hardware scheduling |
| Modern | Few large ops | Software pipelining |
Today's tensor cores issue matrix operations spanning hundreds of cycles. DMA engines (TMA on NVIDIA, buffer loads on AMD) transfer entire tiles asynchronously. The programmer must now explicitly:
- Pipeline stages: Overlap load, compute, and store
In this persistent kernel paradigm kernels launch once and manage their entire execution lifecycle:

The pipeline overlap is critical: while Tile 1 is being computed, Tile 2 is being loaded and Tile 0 is being stored. This overlap is what achieves peak performance, yet it also creates the complexity problem.
The Solution: Structured Kernel Architecture
Structured Mojo Kernels address this complexity through separation of concerns. Instead of one monolithic kernel, we organize functionality into three core components, which share a unified configuration layer and a unified data management layer.

Each component has a single responsibility and well-defined interfaces:
| Component | Responsibility | Encapsulates |
|---|---|---|
| TileIO | Moves data between memory levels. Acts as a producer to TilePipeline. | TMA/DMA, layout transforms, swizzling |
| TilePipeline | Coordinates pipeline stages, and manages shared memory. | Barriers, producer-consumer sync |
| TileOp | Executes compute operations. Acts as a consumer to TilePipeline. | MMA instructions, register management |
Separation of concerns keeps the components simpler:
- TileIO doesn't know about compute operations.
- TilePipeline doesn't know about memory layouts.
- TileOp doesn't know about global memory.
The Power of Separation
With the old monolithic kernels, adding block-scaled quantization (a feature requiring 4 TMA loads instead of 2) meant adding a separate 1,500 line kernel—1,500 lines that were mostly copy-pasted from existing kernels, but with changes scattered throughout the code.
With structured components, those changes were almost all localized in a single place: the “payload” structure representing the data passed between the components.
The pipeline coordination (TilePipeline), compute logic (TileOp), and data movement patterns (TileIO) didn’t need to change because they're properly decoupled.
Pipeline Lifecycle Management
The encapsulation payoff extends to the MMA and epilogue warps. In the original kernel, the MMA warp manually manages the output pipeline — calling producer_stage(), wait_consumer(), producer_step(), and handling TMEM allocation, lock release, and deallocation as separate operations:
Every pipeline transition is explicit: producer_stage() → wait_consumer() → work → producer_step(). TMEM allocation at the top, lock release and deallocation at the bottom. Miss any step and you get a silent hang or corruption.
In the structured kernel, context managers handle the entire lifecycle:
The with mma_ctx: block handles the full TMEM lifecycle (allocate, sync with epilogue, deallocate). The nested with blocks handle output pipeline producer staging, input pipeline consumer stepping, and per-tile barrier acquire/release. No manual step() calls, no explicit deallocation sequence.
In Practice: The TMA Load Warp
We’ve seen how structured kernels simplify setup and MMA pipeline management. Now let’s look at the TMA load warp — the warp responsible for moving tiles from global memory into shared memory.
Both versions use the same warp-specialized architecture — dedicated warps for TMA loading, MMA compute, scheduling, and epilogue. The difference is how pipeline synchronization is managed within each warp.
Original: Manual Pipeline Management
In the original kernel, the TMA load warp manually manages pipeline state — calling wait_consumer() before loading, producer_step() after, and implementing its own drain loop to prevent early CTA exit:
The pipeline protocol is spread across the loop: producer_step() after each tile load, additional wait_consumer() calls inside the load_AB helper (not shown), manual CLC throttle signaling, and an explicit drain loop at the end. Every step() call must be paired correctly or the kernel hangs.
Structured: Context-Managed Pipelines
The structured version uses context managers to handle pipeline transitions automatically. The with blocks guarantee correct acquire/release ordering at compile time:
The with producer.acquire() as tiles: block replaces the manual wait_consumer() → load → producer_step() sequence. The with work_iter.next() as current: block replaces the manual scheduler.fetch_next_work() and clc_pipe_consumer_state.step(). And producer.drain() replaces the hand-written drain loop.
The structured version:
- Encapsulates pipeline protocol in context managers —
acquire()/release()replace manualwait_consumer()/producer_step()pairs - Bundles work iteration with scheduling —
work_iter.next()replaces scatteredscheduler.fetch_next_work()+step()calls - Guarantees correctness — the
withblocks make it impossible to forget astep()call or misorder operations - Enables code reuse:
load_input_tiles,mma, andTileWriterTypeare separate methods shared across kernel variants
Performance: Zero-Cost Abstractions
At this point the reasonable question is: what does all this structure cost at runtime? Nothing. The abstractions exist only in the source code.
This is a direct consequence of how Mojo handles compile-time execution. Four properties work together to ensure the structured components leave no runtime trace:
@register_passable("trivial")guarantees core types live entirely in registers (no heap allocation, no pointer indirection).@always_inlineensures every function call is inlined at compile time, leaving no call overhead in the generated code.- Compile-time parameters mean the GPU never evaluates configuration logic. It receives code that was specialized for exactly this use case at compile time.
- RAII patterns make resource management explicit in the source code without adding runtime cost. The compiler resolves ownership and lifetimes statically, generating no additional instructions in the output
Mojo is currently the only language with all these features
working together for GPU programming. C++ has templates and RAII, but no
@register_passable for guaranteeing register placement. Rust has zero-cost
abstractions, but GPU support is still maturing. CUDA has GPU support, but
limited compile-time metaprogramming. Python has great ergonomics, but no
compile-time execution.
Mojo combines full compile-time metaprogramming, guaranteed resource management, and first-class GPU support in a single language, enabling these abstractions to deliver at zero runtime cost.
The production numbers bear this out. Here's actual data from our SM100 kernel:
| Metric | Legacy sm100 | Structured sm100 | Change |
|---|---|---|---|
| Total Lines | 14,683 | 7,634 | -48% |
| Main Kernel | 3,721 | 1,843 | -50% |
| Performance | ~1770 TFLOPS | ~1770 TFLOPS | Equal |
We cut the code nearly in half while maintaining identical performance. The abstractions exist only in your text editor. The generated GPU assembly is indistinguishable from hand-written code.
When you do need to go lower, when a specific kernel demands something the structured components don't express, Mojo lets you. Nothing in this architecture prevents you from writing directly to the hardware. The difference is that you'll rarely need to, and when you do, you'll be modifying a codebase that's half the size and considerably easier to reason about
What's Next
This post has laid out the problem and the architecture at a high level. The next three posts get into the specifics:
- Part 2: The Three Pillars: TileIO, TilePipeline, and TileOp explained with real code, covering how each component works and why the interfaces are designed the way they are.
- Part 3: Platform Implementations: How the same structured patterns map onto NVIDIA Blackwell and AMD MI300X, and what changes when the underlying hardware does.
- Part 4: Composition and Unification: Extending the architecture for block-scaled matmul and other demanding kernels.
The goal of this series isn't just to show you how to write fast GPU code. It's to show you how to write GPU code that stays fast as requirements evolve, stays understandable as your team grows, and stays portable as the hardware landscape shifts.
Because the next GPU architecture is coming. And the one after that. And the code you write today should still be maintainable when it arrives.
TL;DR
- GPU programming has an accessibility problem. A decade of architectural changes has progressively shifted responsibility for managing execution onto the programmer. That burden gatekeeps hardware that should be available to far more developers.
- DSLs narrow the gap but don't close it. Triton and similar tools trade peak performance for productivity. This may be good enough for training, but it doesn’t keep up with the demands of today’s inference workloads.
- Performance, productivity, and portability are not mutually exclusive. They require the right abstractions and the right language primitives, but they are achievable together.
- Structured kernels are the path. Separate concerns, compose components, and let the compiler do the rest.
- Zero-cost abstractions are real. Mojo is uniquely positioned with the compile-time metaprogramming and resource management features needed to make high-level abstractions disappear at runtime. No other language currently offers this combination for GPU programming
By implementing Structured Mojo Kernels, we cut our kernel codebase nearly in half while maintaining identical performance. The abstractions aren't slowing anything down: they're enabling us to maintain and extend our kernel library faster than ever.




