Modular: Structured Mojo Kernels Part 1

mojo

# Original kernel: shared memory and pipeline setup (trimmed from ~60 lines)

ref smem_storage = external_memory[...].bitcast[SmemType]()[]

# Manually extract 11 named fields from the SMEM struct
ref a_smem_storage = smem_storage.a_smem
ref b_smem_storage = smem_storage.b_smem
ref c_smem_storage = smem_storage.c_smem
ref tma_mma_mbars_storage = smem_storage.tma_mma_mbars
ref accum_mbars_storage = smem_storage.accum_mbars
ref clc_mbars_full_storage = smem_storage.clc_mbars_full
ref clc_mbars_empty_storage = smem_storage.clc_mbars_empty
ref clc_response_storage = smem_storage.clc_response
ref clc_throttle_storage = smem_storage.clc_throttle_mbars
ref tmem_addr_storage = smem_storage.tmem_addr
ref tmem_dealloc_mbar_storage = smem_storage.tmem_dealloc_mbar

# Construct typed iterators with full layout + address space annotations
var a_smem = LayoutTensorIter[a_type, a_smem_layout, MutAnyOrigin,
    address_space = AddressSpace.SHARED, alignment=128,
](a_smem_storage.unsafe_ptr(), SmemType.a_smem_size)
var b_smem = LayoutTensorIter[b_type, b_smem_layout, MutAnyOrigin,
    address_space = AddressSpace.SHARED, alignment=128,
](b_smem_storage.unsafe_ptr(), SmemType.b_smem_size)
var c_smem_iter = LayoutTensorIter[c_type, Layout.row_major(...), MutAnyOrigin,
    address_space = AddressSpace.SHARED, alignment=128,
](c_smem_storage.unsafe_ptr(), SmemType.c_smem_size)

# Create 3 separate pipeline objects from raw barrier pointers
var load_mma_pipeline = ProducerConsumerPipeline[...](
    tma_mma_mbars_storage.unsafe_ptr())
var mma_output_pipeline = ProducerConsumerPipeline[...](
    accum_mbars_storage.unsafe_ptr())
var load_clc_pipeline = ProducerConsumerPipeline[...](
    clc_throttle_storage.unsafe_ptr())

# Extract raw pointers for TMEM address, CLC responses, barriers
var ptr_tmem_addr = tmem_addr_storage.unsafe_ptr()
clc_response = clc_response_storage.unsafe_ptr()
clc_full_mbar = clc_mbars_full_storage.unsafe_ptr()
clc_empty_mbar = clc_mbars_empty_storage.unsafe_ptr()
tmem_dealloc_mbar = tmem_dealloc_mbar_storage.unsafe_ptr()

mojo

# Structured kernel: shared memory and pipeline setup

ref smem = external_memory[
    Scalar[DType.uint8], address_space = AddressSpace.SHARED, alignment=128,
]().bitcast[Self.SmemType]()[]

# Pipeline bundles tile storage with synchronization
var tile_payload = Self.TilePayload(smem.a_tiles(), smem.b_tiles())
var input_pipeline = Self.InputTilePipeline(
    smem.pipelines.input_barriers(), tile_payload)

# Kernel context encapsulates election vars, CTA coords, and masks
var ctx = Self.Context(smem.pipelines.tmem_addr())

mojo

# Classic SIMT pattern (conceptual)
fn simple_kernel(data: Tensor):
    idx = thread_id()
    result = compute(data[idx])
    output[idx] = result
    # Hardware scheduler handles everything else

Era	Operation Style	Latency Hiding
SIMT	Many small ops	Hardware scheduling
Modern	Few large ops	Software pipelining

Component	Responsibility	Encapsulates
TileIO	Moves data between memory levels. Acts as a producer to TilePipeline.	TMA/DMA, layout transforms, swizzling
TilePipeline	Coordinates pipeline stages, and manages shared memory.	Barriers, producer-consumer sync
TileOp	Executes compute operations. Acts as a consumer to TilePipeline.	MMA instructions, register management

Metric	Legacy sm100	Structured sm100	Change
Total Lines	14,683	7,634	-48%
Main Kernel	3,721	1,843	-50%
Performance	~1770 TFLOPS	~1770 TFLOPS	Equal

Structured Mojo Kernels Part 1 - Why Structured Kernels?

The Elephant in the Data Center

The Accessibility Problem

The DSL Tradeoff

Our Requirements

The Growing Complexity Problem

From Hardware Scheduling to Software Orchestration

The Old Model: Let Hardware Handle It

The New Model: Software Must Orchestrate

The Solution: Structured Kernel Architecture

The Power of Separation

Pipeline Lifecycle Management

In Practice: The TMA Load Warp

Original: Manual Pipeline Management

Structured: Context-Managed Pipelines

Performance: Zero-Cost Abstractions

What's Next

TL;DR

Read more from Modular

Structured Mojo Kernels Part 1 - Why Structured Kernels?

The Elephant in the Data Center

The Accessibility Problem

The DSL Tradeoff

Our Requirements

The Growing Complexity Problem

From Hardware Scheduling to Software Orchestration

The Old Model: Let Hardware Handle It

The New Model: Software Must Orchestrate

The Solution: Structured Kernel Architecture

The Power of Separation

Pipeline Lifecycle Management

In Practice: The TMA Load Warp

Original: Manual Pipeline Management

Structured: Context-Managed Pipelines

Performance: Zero-Cost Abstractions

What's Next

TL;DR

Read more from Modular

Sign up for our newsletter