Gemma 4 just dropped on Modular, Day Zero! Read More →

April 13, 2026

TileTensor Part 1 - Safer, More Efficient GPU Kernels

Lukas Hermann

Engineering

Part 1: Why TileTensor?

Writing a high-performance GPU kernel means thinking carefully about memory (see our series on Matrix Multiplication on Blackwell). Not just what data to load, but how that data is laid out and how it maps to physical addresses. Getting this right manually is tedious and error-prone. TileTensor is Mojo's answer: a tensor type that lets kernel authors express complex memory layouts precisely, safely, and efficiently.

This post covers what TileTensor does, why it exists, and how to use it. Part 2 covers the Mojo language features that made the design possible.

Background: layouts, strides, and swizzles

To understand why TileTensor is useful, you need to understand what a layout is and why it matters.

A layout describes both the logical shape of a tensor and how its elements map to physical memory addresses. It has two components: a shape (such as (1024, 8)) and a stride (such as (8, 1)). The stride tells you how many elements to step in memory for each step along a given logical dimension. A stride of 1 in the final dimension makes this row-major. These pairs are written together as ((1024, 8):(8, 1)).

Layouts can also be nested (as shown in the Mojo Manual). A layout like ((1024, (4, 2)):(8, (2, 1))) describes a tiled memory arrangement where a 1024-row, 8-column logical space maps to a specific interleaved physical pattern. In this example, we have interleaved the ‘inner’ 8 values in a row to arrangement [0, 2, 4, 6, 1, 3, 5, 7]. This flexibility is what makes layout algebra useful for GPU programming: you can express row-major, column-major, and tiled arrangements in a single framework.

But shape and stride alone are not the full picture. GPU shared memory is organized into banks, and when multiple threads in a warp access the same bank simultaneously, they serialize(this is called bank conflict). The solution is swizzling: rearranging memory layout to distribute accesses across different memory banks. Swizzle patterns are part of a layout's definition, and they're one of the main reasons you want a dedicated abstraction for this rather than maintaining index arithmetic by hand. A swizzle cannot be expressed as an affine transform: no combination of shape and stride values produces the non-linear bit permutation that swizzling requires. This is why layout algebra needs a separate swizzle component rather than folding everything into strides. TileTensor makes layout a first-class, compile-time object so indexing, vectorization, and correctness constraints are generated and checked, not handwritten.

For a full walkthrough of layout algebra, see Modular's layout algebra documentation.

The indexing problem

Before getting to TileTensor, here's what the problem looks like without it.

Suppose you want to load a 2D tile of a matrix, where the tile is stored in shared memory in a specific interleaved layout to avoid bank conflicts. This example uses a toy XOR swizzle to illustrate the class of bugs; real kernels use hardware- and layout-specific swizzles and vectorized accesses. Without a layout abstraction, here is how you would launch a kernel with a block size of (32,8):

c++
  // Manual 2D-to-1D index computation for a swizzled 32x8 tile
  // tile_row and tile_col are logical coordinates
  // physical address requires computing the swizzled bank index
  __global__ void load_tile(const float* src, float* dst,
                            int tile_row, int tile_col, int stride) {
      __shared__ float smem[32 * 8];  // 32 rows x 8 cols
      for (int r = threadIdx.y; r < 32; r += blockDim.y) {
          for (int c = threadIdx.x; c < 8; c += blockDim.x) {
              // Swizzle: XOR col with bits from row to distribute across banks
              int swizzled_col = c ^ (r & 0x7);
              int logical_idx  = (tile_row * 32 + r) * stride + (tile_col * 8 + c);
              int physical_idx = r * 8 + swizzled_col;
              smem[physical_idx] = src[logical_idx];
              // do some operation
              dst[logical_idx]   = smem[physical_idx];
          }
      }
  }

This is easy to get wrong. The swizzle formula is specific to the tile shape and the hardware bank width. Change the tile shape or target a different GPU and you have to recompute it. There's no type system help.

In CUDA C++ you can use the CuTe library to simplify the program:

c++
#include <cuda_runtime.h>

#include <cute/atom/copy_atom.hpp>
#include <cute/swizzle.hpp>
#include <cute/tensor.hpp>

using namespace cute;

template <typename SrcTensor, typename DstTensor>
__global__ void load_tile(SrcTensor src, DstTensor dst, int tile_row, int tile_col) {
    using SmemLayout = decltype(
        composition(
            Swizzle<3, 0, 3>{},
            make_layout(
                make_shape(Int<32>{}, Int<8>{}),
                make_stride(Int<8>{}, Int<1>{})
            )
        )
    );

    __shared__ float smem_buf[32 * 8];
    auto s_tile = make_tensor(make_smem_ptr(smem_buf), SmemLayout{});

    auto g_tile = local_tile(
        src,
        make_tile(Int<32>{}, Int<8>{}),
        make_coord(tile_row, tile_col)
    );
    auto d_tile = local_tile(
        dst,
        make_tile(Int<32>{}, Int<8>{}),
        make_coord(tile_row, tile_col)
    );

    for (int r = threadIdx.y; r < 32; r += blockDim.y) {
        for (int c = threadIdx.x; c < 8; c += blockDim.x) {
            s_tile(r, c) = g_tile(r, c);
            d_tile(r, c) = s_tile(r, c);
        }
    }
}

This is more expressive than manual indexing, but TensorA and SmemLayout are unconstrained template parameters. Nothing in the type system prevents passing the wrong layout, a mismatched element size, or a non-swizzled layout where a swizzled one is required. Errors show up at run-time as incorrect results or as silent correctness bugs that only appear at specific tile sizes.

Here's the same operation in Mojo with TileTensor:

python
from layout import TileTensor, TensorLayout, row_major, stack_allocation
from layout.swizzle import Swizzle
from std.gpu import block_dim, thread_idx


def load_tile_kernel(
    src: TileTensor[DType.float32, ...],
    dst: TileTensor[mut=True, DType.float32, ...],
    tile_row: Int,
    tile_col: Int,
):
    comptime assert src.flat_rank == 2
    comptime assert dst.flat_rank == 2

    g_tile = src.tile[32, 8](tile_row, tile_col)
    s_tile = stack_allocation[
        dtype=DType.float32, address_space=AddressSpace.SHARED
    ](row_major[32, 8]())

    d_tile = dst.tile[32, 8](tile_row, tile_col)
    comptime thread_layout = row_major[32, 8]()

    for r in range(thread_idx.y, 32, block_dim.y):
        for c in range(thread_idx.x, 8, block_dim.x):
            tid = src_layout(r, c)
            g_frag = g_tile.distribute[thread_layout](tid)
            s_frag = s_tile.distribute[thread_layout, Swizzle(3, 0, 3)](tid)
            d_frag = d_tile.distribute[thread_layout](tid)

            s_frag[0, 0] = g_frag[0, 0]
            d_frag[0, 0] = s_frag[0, 0]

The three versions compute the same tile load. In the CUDA version, the swizzle formula is handwritten and specific to a 32x8 tile on hardware with 8-bank shared memory. Change the tile dimensions or move to a GPU with different bank geometry and the formula has to be recomputed manually. The CuTe version removes the explicit index arithmetic, but SrcTensor and DstTensor are unconstrained template parameters: nothing prevents passing a non-swizzled layout where a swizzled one is required. In the Mojo version, the swizzle is part of the layout passed to distribute , so the compiler generates the address computation from that description and enforces at compile time that the access pattern is valid.

Constructing a TileTensor

Here's what construction looks like for both static and dynamic shapes, on host and device:

python
from std.gpu import DeviceContext
from layout import TileTensor, Idx, row_major
from layout.tile_layout import Layout

def main() raises:
    a = alloc[Float32](16)
    tensor_8s = TileTensor(a, row_major[8]())           # static shape
    tensor_8s_idx = TileTensor(a, row_major(Idx(8)))    # static via Idx
    tensor_8s_param = TileTensor(a, row_major(Idx[8]())) # static via parameter
    row, col = 8, 2
    tensor_8d2 = TileTensor(a, row_major(Idx(row), Idx(col))) # dynamic

    with DeviceContext() as ctx:
        b = ctx.enqueue_create_buffer[DType.float32](16)
        gmem_8s = TileTensor(b, row_major[8]())         # static
        col = 8
        gmem_8d = TileTensor(b, row_major(Idx(col)))    # dynamic

The equivalent in CuTe:

c++
#include <cute/tensor.hpp>
#include <cute/numeric/integral_constant.hpp>
#include <cute/pointer.hpp>
#include <malloc.h>

int main() {
  float* A = (float*)(malloc(sizeof(float)*16));

  Tensor tensor_8 = make_tensor(A, make_layout(Int<8>{}));
  Tensor tensor_8s = make_tensor(A, Int<8>{});
  Tensor tensor_8d2 = make_tensor(A, 8, 2);

  float* B;
  cudaMalloc(&B, 8 * 16 * sizeof(float));
  Tensor gmem_8s = make_tensor(make_gmem_ptr(B), Int<8>{});
  Tensor gmem_8d = make_tensor(make_gmem_ptr(B), 8);

  return 0;
}

The construction syntax is similar. The difference shows up when you use the tensor.

Shared memory allocation follows the same pattern. In Mojo:

python
from layout import coord, col_major, row_major, stack_allocation

def kernel():
    smem_layout = coord[4, 8]()
    smem_4x8_col = stack_allocation[DType.float32, AddressSpace.SHARED](
        col_major(smem_layout)
    )
    smem_4x8_row = stack_allocation[DType.float32, AddressSpace.SHARED](
        row_major(smem_layout)
    )

In CuTe:

c++
void kernel() {
  Layout smem_layout = make_layout(make_shape(Int<4>{},Int<8>{}));
  __shared__ float smem[decltype(cosize(smem_layout))::value];
  Tensor smem_4x8_col = make_tensor(make_smem_ptr(smem), smem_layout);
  Tensor smem_4x8_row = make_tensor(make_smem_ptr(smem), shape(smem_layout), LayoutRight{});
}

Memory tiling

One of the core use cases for TileTensor is expressing tiled data movement between global memory and shared memory. Modern GPU hardware provides dedicated instructions for this: NVIDIA's Tensor Memory Accelerator (TMA) and AMD's Data Movement Engine (DME) both operate at the tile level. TileTensor's layout system maps directly onto these instructions, letting you express tile-level transfers without manually computing addresses or managing alignment constraints. For a broader look at tile-based computation patterns in Mojo, see the structured kernels series.

Type safety at the call site

The type-safety difference is most visible when writing kernel functions. Here's a vector add in Mojo:

mojo
from std.gpu import global_idx
from layout import TileTensor

def vector_add_kernel[
    dtype: DType
](
    a: TileTensor[dtype, element_size=1, ...],
    b: TileTensor[dtype, element_size=1, ...],
    c: TileTensor[mut=True, dtype, element_size=1, ...],
    n: Int,
):
    comptime assert a.flat_rank == 1, "a must be rank 1"
    comptime assert b.flat_rank == 1, "b must be rank 1"
    comptime assert c.flat_rank == 1, "c must be rank 1"

    idx = global_idx.x
    if idx < n:
        c[idx] = a[idx] + b[idx]

The type signature guarantees that a, b, and c share the same dtype and are all scalar. The equivalent in CuTe:

c++
template <typename TensorA, typename TensorB, typename TensorC>
__global__ void vector_add_kernel(TensorA A, TensorB B, TensorC C, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        C(idx) = A(idx) + B(idx);
    }
}

TensorA, TensorB, and TensorC are unconstrained. Nothing prevents passing operands of different types, mismatched shapes, or tensors with incompatible layouts. To express these constraints using CuTe types involves 7 template parameters and the types become much more complex.

c++
  #include <cuda_runtime.h>
  #include <cute/tensor.hpp>
  using namespace cute;
  template <typename T, typename ShapeA, typename StrideA,
                       typename ShapeB, typename StrideB,
                       typename ShapeC, typename StrideC>
  __global__ void vector_add_kernel(
      Tensor<ViewEngine<gmem_ptr<T const>>, Layout<ShapeA, StrideA>> A,
      Tensor<ViewEngine<gmem_ptr<T const>>, Layout<ShapeB, StrideB>> B,
      Tensor<ViewEngine<gmem_ptr<T>>,       Layout<ShapeC, StrideC>> C,
      int N)
  {
      static_assert(rank_v<Layout<ShapeA, StrideA>> == 1, "A must be rank 1");
      static_assert(rank_v<Layout<ShapeB, StrideB>> == 1, "B must be rank 1");
      static_assert(rank_v<Layout<ShapeC, StrideC>> == 1, "C must be rank 1");
      int idx = blockIdx.x * blockDim.x + threadIdx.x;
      if (idx < N) {
          C(idx) = A(idx) + B(idx);
      }
  }

Compiler integration and diagnostics

TileTensor works with Mojo's where clause, so shape constraints participate in overload resolution and produce clear error messages when violated:

mojo
def my_kernel(
    tensor: TileTensor
) where tensor.static_shape[0] == 4:
    pass

def call_my_kernel():
    stack = InlineArray[Float32, 16](uninitialized=True)
    tensor = TileTensor(stack, row_major[4, 4]())
    my_kernel(tensor)

Passing a tensor with a different static shape produces an error pointing at the violated constraint, making errors easier to identify.

This extends to element-level constraints for Scalars or SIMD vectors:

python
def takes_scalar(s: Scalar) -> type_of(s):
    return s

def get_00_scalar(t: TileTensor[element_size=1, ...]):
    comptime assert t.flat_rank == 2
    result = takes_scalar(t[0, 0])

def takes_vector(s: SIMD[_, 4]) -> type_of(s):
    return s

def get_00_vector(t: TileTensor[element_size=4, ...]):
    comptime assert t.flat_rank == 2
    result = takes_vector(t[0, 0])

def kernel(t: TileTensor[element_size=1, ...]):
    scalar = get_00_scalar(t)
    vector = get_00_vector(t.vectorize[4]())

The element type returned by t[0, 0] is determined by element_size at compile time. No casting is required.

Mojo’s constraint system works nicely with TileTensor, and allows us to express safety requirements that can be evaluated in the LSP.

python
from layout import TileTensor, stack_allocation, row_major, col_major


def my_kernel(t: TileTensor) where t.is_row_major:
    pass


def calls_my_kernel():
    col_major_tensor = stack_allocation[DType.float32, AddressSpace.SHARED](
        col_major[16, 16]() # ((16, 16):(1, 16))
    )
    row_major_tensor = stack_allocation[DType.float32, AddressSpace.SHARED](
        row_major[16, 16]() # ((16, 16):(16, 1))
    )
    my_kernel(row_major_tensor) # works just fine
    my_kernel(col_major_tensor) # LSP error: invalid call to 'my_kernel': violated constraint

In the example above TileTensor.is_row_major checks that the layout has a static final dimension of 1 . The parser then evaluates that the final dimension on col_major_tensor's layout is 16, which fails that constraint.

What's next

TileTensor is now the default tensor type across Modular's kernel library. The AMD MHA kernel migration above is one example. The same pattern applies across attention, normalization, and matrix kernels. Static layouts eliminate entire categories of run-time overhead, and the type system catches layout mismatches before they reach hardware.

The next post breaks down the Mojo language features that made TileTensor possible and the benefits of its small run-time footprint.

Read more from Modular

View all blogs

Build the future of AI with Modular

View Editions
  • Person with blonde hair using a laptop with an Apple logo.

    Sign up today

    Signup to our Cloud Platform today to get started easily.

    Sign Up
  • Magnifying glass emoji with black handle and round clear lens.

    Browse open models

    Browse our model catalog, or deploy your own custom model

    Browse models
No items found.