Gemma 4 26B A4B: Optimized MoE Inference on NVIDIA & AMD

Gemma 4 26B A4B is a Mixture-of-Experts (MoE)model with 26B total parameters but only 4B activated per forward pass, meaning you get the quality of a much larger model at a fraction of the compute cost. It also supports a 256K context window and is designed to fit the memory footprint of high-end servers.

Deploy dedicated endpoint

Try in Playground

Example Usage

Output

  Speculative decoding uses a smaller draft model to predict multiple
  tokens ahead, then verifies them against the full model in a single
  pass. Accepted tokens skip individual generation steps, improving
  throughput without sacrificing accuracy. It's most effective when the
  draft model closely matches the target model's distribution.

Code to use

Python


  from openai import OpenAI
  
  client = OpenAI(
      base_url="https://model.api.modular.com",
      api_key="<your_api_token>",
  )
  
  response = client.chat.completions.create(
      model="google/gemma-4-26B-A4B-it",
      messages=[
          {"role": "system", "content": "You are a helpful assistant."},
          {"role": "user", "content": "Explain speculative decoding in 3 sentences."},
      ],
      stream=True,
  )
  
  for chunk in response:
      if chunk.choices[0].delta.content:
          print(chunk.choices[0].delta.content, end="")

Get your API token

Output

  “The image shows a fox wearing a backpack in a forest.”

Input image

Cute orange cartoon fox with a small brown backpack sitting on grass in a bright, colorful forest.

Code to use

  import base64
  from openai import OpenAI
  
  client = OpenAI(
      base_url="https://model.api.modular.com",
      api_key="<your_api_token>",
  )
  
  with open("fox.png", "rb") as image_file:
      image_data = base64.b64encode(image_file.read()).decode("utf-8")
  
  response = client.chat.completions.create(
      model="google/gemma-4-26B-A4B-it",
      messages=[
          {
              "role": "user",
              "content": [
                  {
                      "type": "text",
                      "text": "Describe this image in one sentence.",
                  },
                  {
                      "type": "image_url",
                      "image_url": {
                          "url": f"data:image/png;base64,{image_data}"
                      },
                  },
              ],
          }
      ],
  )
  
  print(response.choices[0].message.content)

Model Details

Developed by
Google
Model family
google/gemma-4-26B-A4B-it
Modality
LLM,
Vision,
Context Window
256K
Total Params
31.3B
Precision
FP4
Deployment options
Shared, Dedicated, Self-hosted

Why choose Gemma 4 26B A4B on Modular?

High performance, out of the box
Run leading open models with strong default performance and the ability to optimize down to the kernel — extracting more from every GPU.
Lower Infrastructure Costs
Deploy efficiently across NVIDIA and AMD hardware to reduce GPU count, increase throughput, and avoid expensive closed-model licensing.
Easy Integration
Integrate through an OpenAI-compatible endpoint, swap models freely, and scale across clouds or hardware without redesigning your application stack.

Want to self-host this model with our open source infrastructure?

Read How

🔥 Trending models

DeepSeek V4 Pro

DeepSeek V4 Pro is a 1.6T MoE model with 49B active parameters and a 1M context window, featuring hybrid attention for efficient long-context inference.

LLM

FLUX.2 Klein 9B

FLUX.2 [klein] 9B is a 9 billion parameter rectified flow transformer capable of generating images from text descriptions and supports multi-reference editing capabilities.

Image

GLM-5.2

GLM-5.2 is Zhipu AI's newest open-weights model, optimized for coding, agentic workloads, and sustained execution of ultra-long-horizon tasks. Built on the GLM-5.1 Mixture-of-Experts architecture (754B total parameters, ~40B active) and expanded to a 1M-token context window, it is designed for long-running agent tasks, large coding workloads, and long-context understanding.

LLM

Kimi K2.7 Code

Kimi K2.7 Code is Moonshot AI's coding-focused agentic model, built on the Kimi K2.6 architecture. It shares the same 1T-parameter Mixture-of-Experts design (32B active per token, 384 experts, MLA attention) with a MoonViT vision encoder and a 256K-token context window. K2.7 Code delivers substantial gains on real-world long-horizon software engineering tasks while reducing thinking-token usage by approximately 30% compared with K2.6. Thinking and preserve_thinking are always enabled for consistent reasoning across multi-turn agentic sessions.

LLM

Vision

Similar models

Gemma 4 31B

Gemma 4 31B is a 31-billion-parameter dense model featuring a redesigned architecture that improves both efficiency and long-context quality. With a 256K context window, it's built for demanding tasks that require deep reasoning across large inputs.

LLM

Vision

MiniMax M3

MiniMax M3 is an open-weight, natively multimodal frontier model from MiniMax with ~428B total parameters and ~23B activated parameters. It combines frontier-level coding and agentic performance, an ultra-long context window of up to 1M tokens, and mixed-modality training across text, image, and video. It introduces MiniMax Sparse Attention (MSA) to make million-token context computationally viable, delivering up to 9x prefill and 15x decode speedups over M2 at 1M context.

LLM

Vision

GLM-5.2

LLM

gpt-oss-120b

gpt-oss-120b by OpenAI is a 117B MoE model with 5.1B active parameters and 128 experts, featuring reasoning capabilities.

LLM

View all models

Get started with Modular

Request a demo
Schedule a demo of Modular and explore a custom end-to-end deployment built around your models, hardware, and performance goals.
- Distributed, large-scale online inference endpoints
- Highest-performance to maximize ROI and latency
- Deploy in Modular cloud or your cloud
- View all features with a custom demo
Book a demo
Talk with our sales lead Jay!
30min demo. Evaluate with your workloads. Ask us anything.

Talk to us!
Book a demo for a personalized walkthrough of Modular in your environment. Learn how teams use it to simplify systems and tune performance at scale.
- Custom 30 min walkthrough of our platform
- Cover specific model or deployment needs
- Flexible pricing to fit your specific needs
Book a demo
Talk with our sales lead Jay!
Start using MAX
( FREE )
Run any open source model in 5 minutes, then benchmark it. Scale it to millions yourself (for free!).
Install MAX
What is MAX?
Start using Mojo
( FREE )
Install Mojo and get up and running in minutes. A simple install, familiar tooling, and clear docs make it easy to start writing code immediately.
Install Mojo🔥
What is Mojo🔥?

Gemma 4 26B A4B: Optimized MoE Inference on NVIDIA & AMD

Example Usage

Why choose Gemma 4 26B A4B on Modular?

🔥 Trending models

Similar models

Get started with Modular

Start using Mojo