Blog

Democratizing AI Compute Series
Go behind the scenes of the AI industry with Chris Lattner
Latest

Why LLM Inference Needs a New Kind of Router - Part 2
In Part 1, we argued that LLM routing is qualitatively different from HTTP routing. Inference backends hold state that traditional load balancers ignore. This post covers the first of the three layers we identified: the data layer that makes that state queryable on the hot path of every inference request.

Hippocratic AI partners with Modular to power flexible, high-quality inference for real-time patient conversations
Every millisecond matters in real-time voice, and at Hippocratic AI's scale latency gains compound directly into better patient experience and per-node efficiency. Production deployments run across multiple frameworks, including SGLang and vLLM, with ongoing evaluation of emerging frameworks for additional latency headroom, alongside a hardware roadmap spanning NVIDIA, AMD, and future-generation accelerators.

Translating to Mojo via AI Agents
At Modular, we’re always experimenting with the latest agentic programming tools, integrating the best ones into our workflows, and learning quite a few lessons along the way. One thing we realized is that the Mojo language is ideally suited to the needs of modern AI coding agents.

Inkwell: Why Your Inference Platform Matters As Much As Your Model
Inkwell is a web app that lets users create interactive storybooks with a custom character along infinite branching paths. When the user opens a story, the first page of text and image art streams in - text appears character-by-character via WebSocket within the first second, the illustration paints in as you read, and by the time you tap a choice, the next page is already written and illustrated. Creating a user experience around the seamless generation of new content requires an inference layer that can perform at scale.

Modular 26.3: Mojo 1.0 Beta, MAX Video Gen, and more
Surprise: Mojo 1.0 is officially in beta! Modular’s 26.3 release includes new features and modalities, but the headline is that we’ve officially hit beta for Mojo 1.0, with a clear plan to finalize Mojo 1.0 in the coming months. We share details below, alongside other key announcements in our 26.3 release including video generation in MAX with Wan 2.2 and MAX framework updates.

Modverse #54: AMD AI DevDay, New Modular Offices, and a Community That Keeps Shipping
There was a lot to celebrate in April: the community shipped GPU renderers, FFmpeg bindings, raylib wrappers, BLAS routines, and a 2D graphics API, just to name a few. The team connected with tons of developers at AMD AI DevDay and our joint meetup with AMD, two new Modular offices opened on two different continents, and Gemma 4 launched with same-day support on NVIDIA and AMD. Here’s the April roundup.

TileTensor Part 1 - Safer, More Efficient GPU Kernels
Suppose you want to load a 2D tile of a matrix, where the tile is stored in shared memory in a specific interleaved layout to avoid bank conflicts. This example uses a toy XOR swizzle to illustrate the class of bugs; real kernels use hardware- and layout-specific swizzles and vectorized accesses. Without a layout abstraction, here is how you would launch a kernel with a block size of (32,8):
No items found within this category
We couldn’t find anything. Try changing or resetting your filters.

Sign up today
Signup to our Cloud Platform today to get started easily.
Sign Up
Browse open models
Browse our model catalog, or deploy your own custom model
Browse models



