Now serving MiniMax-M3! Request access today. Read More →

Blog

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Illustration of a smiling astronaut and a cheerful orange flame character floating in front of a neon-lit triangular background.

Democratizing AI Compute Series

Go behind the scenes of the AI industry with Chris Lattner

Latest

🚨

News

Company

Day Zero: MiniMax M3 Open Weights on Modular Cloud

To avoid the repeated loads, MSA inverts the mapping by grouping the queries by the KV block they selected; i.e. executing in key-block-major form and what MiniMax calls “KV outer gather Q”. As a result, we can improve the arithmetic intensity since the blocks are loaded once, before computing partial attention for all of those queries, and then merging the partial results.

June 11, 2026

/

Modular Team

,  

🚨

News

Community

Modverse #55: Mojo 1.0 Beta, Community Mojo Libraries, and Real-Time Patient Conversations Powered by MAX

This edition captures everything happening across the Modular ecosystem, from developers building with MAX and Mojo🔥 to the broader impact Modular is having across AI infrastructure. Here's a look at what's been happening lately.

June 10, 2026

/

Caroline Frasca

,  

🚨

News

Engineering

Why LLM Inference Needs a New Kind of Router - Part 3

Most routing stacks ship with a fixed set of algorithms: round-robin, least-requests, consistent hashing, etc. These are generally independent implementations rather than composable components. As a result, when a customer asks for "consistent hashing with a concurrency cap" or "cache-aware with session stickiness," it requires adding a new algorithm from scratch. Disaggregated prefill/decode increases this proliferation. Every variant traditionally has its own HTTP handler, discovery logic, proxy code, and session management. That requires hundreds of lines of additional plumbing per variant.

June 5, 2026

/

Aayush Deshpande

,  

Deep Dhillon

,  

Alexandr Nikitin

,  

Michael Dunn-OConnor

,  

🚨

News

Engineering

Three trends from MLSys 2026

The shared conclusion of these talks was that agentic engineering requires substantially greater rigor in specification, design, and validation.

May 29, 2026

/

Michael Dunn-OConnor

,  

Brian Zhang

,  

Shouzheng Liu

,  

🚨

News

Engineering

Why LLM Inference Needs a New Kind of Router - Part 2

To route a request to the pod with the best cached prefix, you need to know which blocks are cached on which pod. That sounds simple until you look at the numbers. You may have hundreds of pods, each with thousands of cached blocks. State can change hundreds of times per second. Across this complexity, queries need to return in microseconds because they sit on the critical path of every inference request.

May 21, 2026

/

Aayush Deshpande

,  

Deep Dhillon

,  

Alexandr Nikitin

,  

Michael Dunn-OConnor

,  

🚨

News

Community

How I built a pure Mojo app (and 10 libraries) with AI agents

To build it, I needed libraries that did not exist yet or did not support the exact required features. So I built them:

May 19, 2026

/

Ehsan M. Kermani

,  

🚨

News

Company

Hippocratic AI partners with Modular to power flexible, high-quality inference for real-time patient conversations

Every millisecond matters in real-time voice, and at Hippocratic AI's scale latency gains compound directly into better patient experience and per-node efficiency. Production deployments run across multiple frameworks, including SGLang and vLLM, with ongoing evaluation of emerging frameworks for additional latency headroom, alongside a hardware roadmap spanning NVIDIA, AMD, and future-generation accelerators.

May 18, 2026

/

Modular Team

,  

🚨

News

Product

Translating to Mojo via AI Agents

At Modular, we’re always experimenting with the latest agentic programming tools, integrating the best ones into our workflows, and learning quite a few lessons along the way. One thing we realized is that the Mojo language is ideally suited to the needs of modern AI coding agents.

May 13, 2026

/

Brad Larson

,  

Modular Team

,  

🚨

News

Product

Inkwell: Why Your Inference Platform Matters As Much As Your Model

Inkwell is a web app that lets users create interactive storybooks with a custom character along infinite branching paths. When the user opens a story, the first page of text and image art streams in - text appears character-by-character via WebSocket within the first second, the illustration paints in as you read, and by the time you tap a choice, the next page is already written and illustrated. Creating a user experience around the seamless generation of new content requires an inference layer that can perform at scale.

May 12, 2026

/

Tim Davis

,  

🚨

News

Engineering

Why LLM Inference Needs a New Kind of Router - Part 1

HTTP routing has been a solved problem for many years. Round-robin, consistent hashing, least-connections. Pick one, put it in front of a pool of identical servers, and the traffic spreads pretty evenly.

May 8, 2026

/

Aayush Deshpande

,  

Deep Dhillon

,  

Alexandr Nikitin

,  

Michael Dunn-OConnor

,  

No items found within this category

We couldn’t find anything. Try changing or resetting your filters.

Build the future of AI with Modular

View Editions
  • Person with blonde hair using a laptop with an Apple logo.

    Sign up today

    Signup to our Cloud Platform today to get started easily.

    Sign Up
  • Magnifying glass emoji with black handle and round clear lens.

    Browse open models

    Browse our model catalog, or deploy your own custom model

    Browse models