Why LLM Inference Needs a New Kind of Router

This series walks through why traditional HTTP routing breaks down under LLM workloads and how Modular Cloud solves it with a three-layer architecture built for cache-aware routing.

Start reading now

View all posts

Posts in this series

Why LLM Inference Needs a New Kind of Router - Part 1

HTTP routing has been a solved problem for many years. Round-robin, consistent hashing, least-connections. Pick one, put it in front of a pool of identical servers, and the traffic spreads pretty evenly.

May 8, 2026

Why LLM Inference Needs a New Kind of Router - Part 2

To route a request to the pod with the best cached prefix, you need to know which blocks are cached on which pod. That sounds simple until you look at the numbers. You may have hundreds of pods, each with thousands of cached blocks. State can change hundreds of times per second. Across this complexity, queries need to return in microseconds because they sit on the critical path of every inference request.

May 21, 2026

Why LLM Inference Needs a New Kind of Router - Part 3

Most routing stacks ship with a fixed set of algorithms: round-robin, least-requests, consistent hashing, etc. These are generally independent implementations rather than composable components. As a result, when a customer asks for "consistent hashing with a concurrency cap" or "cache-aware with session stickiness," it requires adding a new algorithm from scratch. Disaggregated prefill/decode increases this proliferation. Every variant traditionally has its own HTTP handler, discovery logic, proxy code, and session management. That requires hundreds of lines of additional plumbing per variant.

June 5, 2026

Why LLM Inference Needs a New Kind of Router

Posts in this series

Read more from Modular