Gemma 4 just dropped on Modular, Day Zero! Read More →

Blog

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Illustration of a smiling astronaut and a cheerful orange flame character floating in front of a neon-lit triangular background.

Democratizing AI Compute Series

Go behind the scenes of the AI industry with Chris Lattner

Latest

🚨

News

Engineering

Why LLM Inference Needs a New Kind of Router - Part 1

HTTP routing has been a solved problem for many years. Round-robin, consistent hashing, least-connections. Pick one, put it in front of a pool of identical servers, and the traffic spreads pretty evenly.

May 8, 2026

/

Aayush Deshpande

,  

Deep Dhillon

,  

Alexandr Nikitin

,  

Michael Dunn-OConnor

,  

🚨

News

Product

Modular 26.3: Mojo 1.0 Beta, MAX Video Gen, and more

Surprise: Mojo 1.0 is officially in beta! Modular’s 26.3 release includes new features and modalities, but the headline is that we’ve officially hit beta for Mojo 1.0, with a clear plan to finalize Mojo 1.0 in the coming months. We share details below, alongside other key announcements in our 26.3 release including video generation in MAX with Wan 2.2 and MAX framework updates.

May 7, 2026

/

Modular Team

,  

🚨

News

Community

Modverse #54: AMD AI DevDay, New Modular Offices, and a Community That Keeps Shipping

There was a lot to celebrate in April: the community shipped GPU renderers, FFmpeg bindings, raylib wrappers, BLAS routines, and a 2D graphics API, just to name a few. The team connected with tons of developers at AMD AI DevDay and our joint meetup with AMD, two new Modular offices opened on two different continents, and Gemma 4 launched with same-day support on NVIDIA and AMD. Here’s the April roundup.

May 4, 2026

/

Caroline Frasca

,  

🚨

News

Case Study

How Frontier Coding Agents Built a Video Diffusion Pipeline on MAX

In a clear demonstration of how rapidly AI coding agents are becoming capable of challenging systems engineering work, two of the five agents produced a working MAX pipeline. The models we tested were:

April 16, 2026

/

Rajan Agarwal

,  

Evan Chu

,  

Tim Davis

,  

Eric Johnson

,  

🚨

News

Engineering

TileTensor Part 1 - Safer, More Efficient GPU Kernels

Suppose you want to load a 2D tile of a matrix, where the tile is stored in shared memory in a specific interleaved layout to avoid bank conflicts. This example uses a toy XOR swizzle to illustrate the class of bugs; real kernels use hardware- and layout-specific swizzles and vectorized accesses. Without a layout abstraction, here is how you would launch a kernel with a block size of (32,8):

April 13, 2026

/

Lukas Hermann

,  

🚨

News

Company

Modular Opens Edinburgh & San Francisco Offices

Modular is growing! We've opened offices in two cities with deep ties to the work we do.

April 10, 2026

/

Conor Bronsdon

,  

🚨

News

Engineering

Structured Mojo Kernels Part 4 - Portability and the Road Ahead

GPU portability has a mixed track record. “Write once, run everywhere” usually means “write once, run slowly everywhere.” CUTLASS does not attempt portability beyond NVIDIA hardware and is usually limited within a generation of the hardware. Triton provides portability but performance degrades on non-NVIDIA targets. The conventional wisdom is that you have to choose between being portable or being fast.

April 3, 2026

/

Fabio Riccardi

,  

Modular Kernel Team

,  

🚨

News

Product

Day Zero Launch: Fastest Performance for Gemma 4 on NVIDIA and AMD

Our benchmarks show 15% higher throughput when compared to vLLM on NVIDIA B200.

April 2, 2026

/

Modular Team

,  

🚨

News

Company

Modverse #54: From GTC to Edinburgh, a Community Building Momentum

This edition covers one of the busiest stretches in Modular's recent history: four days at GTC, a new office on another continent, fresh community builds, and a release that expands what MAX and Mojo🔥 can do. Here's everything that's been happening across the ecosystem.

March 31, 2026

/

Inaara Walji

,  

🚨

News

Engineering

Software Pipelining for GPU Kernels: Part 1 - The Pipeline Problem

Flash Attention is a simple algorithm: tiled back-to-back matmuls with an online softmax algorithm in between. The algorithm fits in a few dozen lines of pseudocode. Yet Flash Attention 4's production kernel is 2,875 lines, and the hardest part to get right isn't the math. It's the async execution and pipelining synchronization, all hand-derived from a schedule that no standard debugging tool can verify.

March 30, 2026

/

Yingbo Ma

,  

No items found within this category

We couldn’t find anything. Try changing or resetting your filters.

Build the future of AI with Modular

View Editions
  • Person with blonde hair using a laptop with an Apple logo.

    Sign up today

    Signup to our Cloud Platform today to get started easily.

    Sign Up
  • Magnifying glass emoji with black handle and round clear lens.

    Browse open models

    Browse our model catalog, or deploy your own custom model

    Browse models