Gemma 4 just dropped on Modular, Day Zero! Read More →

April 2, 2026

Day Zero Launch: Fastest Performance for Gemma 4 on NVIDIA and AMD

Modular Team

Product

Today, Google DeepMind released the Gemma 4 family of models, the latest state-of-the-art open models from the same team behind Google Gemini models. We’re thrilled to be day zero launch partners and invite developers and enterprises to try out Gemma 4’s multimodal capabilities on Modular Cloud with state of the art performance on both NVIDIA and AMD.

Our benchmarks show 15% higher throughput when compared to vLLM on NVIDIA B200.

Meet the Gemma 4 family

Modular is hosting all variants of Google’s new Gemma 4 family of models. All are natively multimodal, supporting text, images, and video with dynamic resolution and aspect ratio.

Gemma 4 31B is a 31-billion-parameter dense model featuring a redesigned architecture that improves both efficiency and long-context quality. With a 256K context window, it's built for demanding tasks that require deep reasoning across large inputs.

Gemma 4 26B A4B is a Mixture-of-Experts (MoE)model with 26B total parameters but only 4B activated per forward pass, meaning you get the quality of a much larger model at a fraction of the compute cost. It also supports a 256K context window and is designed to fit the memory footprint of high-end servers.

Whether you're running OCR, video and image understanding, or 256K-context workflows, the same MAX-powered engine that handles your initial tests runs your production Modular Cloud endpoint, so there are no surprises when you scale.

High performance Gemma 4 on NVIDIA and AMD

Modular Cloud runs on MAX, our inference framework that unifies GPU kernels, graph compilation, and high-performance serving in a single hardware-agnostic platform. Modular allows teams to move fast: we optimized Gemma 4 to state-of-the-art across NVIDIA and AMD and brought up a production-ready endpoint on Modular Cloud within a few days.

Performance is a given in MAX, and on NVIDIA B200, Gemma 4 on MAX is 15% faster than vLLM while experiencing no accuracy degradation. Stay tuned for more details coming soon.

Accuracy MAX (B200)vLLM* (B200)MAX (MI355)
MMLU Pro84.72%(not measured)84.94%
GSM8K Llama COT95.94%94.69%94.37%
ChartQA84.69%84.38%84.38%

*Benchmarked using the officially provided vllm-0.18.2rc1.dev73

Try Gemma 4 on Modular Cloud Today

Gemma 4 is one of the most capable open models available - but capability only matters if you can ship it. Modular Cloud gives you a straight line from first API call to production endpoint, optimized across NVIDIA and AMD hardware, with no infrastructure archaeology required.

Try it immediately - use this chat window where we've opened up a 10-prompt free experience.

Loading...

With Modular Cloud, you can:

  • Transition from playground to production without switching stacks - The same MAX-powered engine that runs your tests runs your production endpoint.
  • Choose the right GPU for your workload - Run on NVIDIA or AMD, whichever fits your cost and throughput requirements. The stack picks the right configuration and batch sizing so you're not guessing.
  • Ship multimodal and long-context features faster - OCR, video + image understanding, and 256K-context workflows are supported out of the box.

Start building on Modular Cloud →

Or, try it for free yourself with MAX.

Interested in running MAX on your own infrastructure or in your own cloud? Book a demo and we'll walk through a deployment built around your models, hardware, and performance goals.

Discover what Modular can do for you

Request a demo

Read more from Modular

View all blogs

Build the future of AI with Modular

View Editions
  • Person with blonde hair using a laptop with an Apple logo.

    Sign up today

    Signup to our Cloud Platform today to get started easily.

    Sign Up
  • Magnifying glass emoji with black handle and round clear lens.

    Browse open models

    Browse our model catalog, or deploy your own custom model

    Browse models
No items found.