Blog

🚨

News

Engineering

Why LLM Inference Needs a New Kind of Router - Part 3

Most routing stacks ship with a fixed set of algorithms: round-robin, least-requests, consistent hashing, etc. These are generally independent implementations rather than composable components. As a result, when a customer asks for "consistent hashing with a concurrency cap" or "cache-aware with session stickiness," it requires adding a new algorithm from scratch. Disaggregated prefill/decode increases this proliferation. Every variant traditionally has its own HTTP handler, discovery logic, proxy code, and session management. That requires hundreds of lines of additional plumbing per variant.

June 5, 2026

Aayush Deshpande

Deep Dhillon

Alexandr Nikitin

Michael Dunn-OConnor

Read

🚨

News

Engineering

Three trends from MLSys 2026

The shared conclusion of these talks was that agentic engineering requires substantially greater rigor in specification, design, and validation.

May 29, 2026

Michael Dunn-OConnor

Brian Zhang

Shouzheng Liu

Read

🚨

News

Engineering

Why LLM Inference Needs a New Kind of Router - Part 2

To route a request to the pod with the best cached prefix, you need to know which blocks are cached on which pod. That sounds simple until you look at the numbers. You may have hundreds of pods, each with thousands of cached blocks. State can change hundreds of times per second. Across this complexity, queries need to return in microseconds because they sit on the critical path of every inference request.

May 21, 2026

Aayush Deshpande

Deep Dhillon

Alexandr Nikitin

Michael Dunn-OConnor

Read

🚨

News

Engineering

Why LLM Inference Needs a New Kind of Router - Part 1

HTTP routing has been a solved problem for many years. Round-robin, consistent hashing, least-connections. Pick one, put it in front of a pool of identical servers, and the traffic spreads pretty evenly.

May 8, 2026

Aayush Deshpande

Deep Dhillon

Alexandr Nikitin

Michael Dunn-OConnor

Read

🚨

News

Engineering

TileTensor Part 1 - Safer, More Efficient GPU Kernels

Suppose you want to load a 2D tile of a matrix, where the tile is stored in shared memory in a specific interleaved layout to avoid bank conflicts. This example uses a toy XOR swizzle to illustrate the class of bugs; real kernels use hardware- and layout-specific swizzles and vectorized accesses. Without a layout abstraction, here is how you would launch a kernel with a block size of (32,8):

April 13, 2026

Lukas Hermann

Read

🚨

News

Engineering

Structured Mojo Kernels Part 4 - Portability and the Road Ahead

GPU portability has a mixed track record. “Write once, run everywhere” usually means “write once, run slowly everywhere.” CUTLASS does not attempt portability beyond NVIDIA hardware and is usually limited within a generation of the hardware. Triton provides portability but performance degrades on non-NVIDIA targets. The conventional wisdom is that you have to choose between being portable or being fast.

April 3, 2026

Fabio Riccardi

Modular Kernel Team

Read

🚨

News

Engineering

Software Pipelining for GPU Kernels: Part 1 - The Pipeline Problem

Flash Attention is a simple algorithm: tiled back-to-back matmuls with an online softmax algorithm in between. The algorithm fits in a few dozen lines of pseudocode. Yet Flash Attention 4's production kernel is 2,875 lines, and the hardest part to get right isn't the math. It's the async execution and pipelining synchronization, all hand-derived from a schedule that no standard debugging tool can verify.

March 30, 2026

Yingbo Ma

Read

🚨

News

Engineering

Structured Mojo Kernels Part 3 - Composition in Practice

This post shows the practical benefit of this modular design. We take two real kernel families, conv2d and block-scaled matmul, and trace exactly how they are built around the matmul foundation. In both cases, a new kernel family requires changing one component while leaving the rest untouched. The conv2d kernel adds roughly 130 lines of new code, whileBlock-scaled matmul adds roughly 200 with no performance degradation.

March 26, 2026

Fabio Riccardi

Modular Kernel Team

Read

🚨

News

Engineering

Structured Mojo Kernels Part 2 - The Three Pillars

This post explains the components of Structured Mojo Kernels: TileIO, TilePipeline, and TileOp. Each component forms a node in a kernel execution pipeline, and the links between them create a logical separation of concerns that makes kernels easier to extend and update. That organization matters because GPU kernels don't stay static. By abstracting hardware optimized implementations into patterns, the same kernel structure can adapt across NVIDIA and AMD hardware generations with minimal rewrite.

March 11, 2026

Fabio Riccardi

Modular Kernel Team

Read

🚨

News

Engineering

Structured Mojo Kernels Part 1 - Peak Performance, Half the Code

GPU programming has always demanded precision, but the cost of that precision keeps rising. A production matmul kernel written in C++ spans 3,000–5,000 lines of tightly coupled code where a misplaced barrier silently corrupts results. That complexity gatekeeps hardware that should be available to far more developers, and it's a direct product of how GPUs have evolved: with each architecture generation, more of the orchestration burden has shifted onto the programmer.

March 5, 2026

Fabio Riccardi

Modular Kernel Team

Read

Series
Democratizing Compute
Go behind the scenes of the AI industry in this blog series by Chris Lattner. Trace the evolution of AI compute, dissect its current challenges, and discover how Modular is raising the bar with the world’s most open inference stack.
11 part series
View Series
Series
Matrix Multiplication on Blackwell
Learn how to write a high-performance GPU kernel on Blackwell that offers performance competitive to that of NVIDIA's cuBLAS implementation while leveraging Mojo's special features to make the kernel as simple as possible.
4 part series
View Series
Series
Structured Mojo Kernels
Learn how Mojo simplifies GPU programming with modular kernel architecture, compile-time abstractions, and zero-cost performance across modern GPU hardware.
4 part series
View Series
Series
Software Pipelining for GPU Kernels
Explore software pipelining for GPU kernels from first principles. We formalize dependencies as a graph, solve for the optimal schedule with a constraint solver, and show how it all integrates into MAX via pure Mojo.
1 part series
View Series
Series
Why LLM Inference Needs a New Kind of Router
This series walks through why traditional HTTP routing breaks down under LLM workloads and how Modular Cloud solves it with a three-layer architecture built for cache-aware routing.
2 part series
View Series
Series
TileTensor
This series walks through how Modular built TileTensor, a Mojo tensor type that lets kernel authors express complex memory layouts precisely, safely, and efficiently.
1 part series
View Series

No items found within this category

We couldn’t find anything. Try changing or resetting your filters.

Build the future of AI with Modular

Get started - FREE

View Editions

Sign up today
Signup to our Cloud Platform today to get started easily.
Sign Up
Browse open models
Browse our model catalog, or deploy your own custom model
Browse models

Blog

Sign up for our newsletter