Gemma 4 just dropped on Modular, Day Zero! Read More →

Gemma 4 26B A4B logo

Gemma 4 26B A4B

Gemma 4 26B A4B is a Mixture-of-Experts (MoE) model with 26B total parameters but only 4B activated per forward pass, meaning you get the quality of a much larger model at a fraction of the compute cost. It also supports a 256K context window and is designed to fit the memory footprint of high-end servers.

Example Usage

Input
Python

  from openai import OpenAI
  
  client = OpenAI(
      base_url="https://api.modular.com",
      api_key="<your_api_token>",
  )
  
  response = client.chat.completions.create(
      model="google/gemma-4-26B-A4B-it",
      messages=[
          {"role": "system", "content": "You are a helpful assistant."},
          {"role": "user", "content": "Explain speculative decoding in 3 sentences."},
      ],
      stream=True,
  )
  
  for chunk in response:
      if chunk.choices[0].delta.content:
          print(chunk.choices[0].delta.content, end="")
Output

  Speculative decoding uses a smaller draft model to predict multiple
  tokens ahead, then verifies them against the full model in a single
  pass. Accepted tokens skip individual generation steps, improving
  throughput without sacrificing accuracy. It's most effective when the
  draft model closely matches the target model's distribution.
Model Details
  • Developed by
    Google
  • Model family
    google/gemma-4-26B-A4B-it
  • Modality
    LLM,
    Vision,
  • Context Window
    256K
  • Total Params
    26.8B
  • Precision
    BF16
  • Deployment options
    Shared, Dedicated, Self-hosted

Why choose Gemma 4 26B A4B on Modular?

  • High performance, out of the box

    Run leading open models with strong default performance and the ability to optimize down to the kernel — extracting more from every GPU.

  • Lower Infrastructure Costs

    Deploy efficiently across NVIDIA and AMD hardware to reduce GPU count, increase throughput, and avoid expensive closed-model licensing.

  • Easy Integration

    Integrate through an OpenAI-compatible endpoint, swap models freely, and scale across clouds or hardware without redesigning your application stack.

Gemma 4 26B A4B
Want to self-host this model with our open source infrastructure?
Read How

🔥 Trending models

Similar models

Get started with Modular

  • Request a demo

    Schedule a demo of Modular and explore a custom end-to-end deployment built around your models, hardware, and performance goals.

    • Distributed, large-scale online inference endpoints

    • Highest-performance to maximize ROI and latency

    • Deploy in Modular cloud or your cloud

    • View all features with a custom demo

    Book a demo

    Talk with our sales lead Jay!

    30min demo.  Evaluate with your workloads.  Ask us anything.

  • Talk to us!

    Book a demo for a personalized walkthrough of Modular in your environment. Learn how teams use it to simplify systems and tune performance at scale.

    • Custom 30 min walkthrough of our platform

    • Cover specific model or deployment needs

    • Flexible pricing to fit your specific needs

    Book a demo

    Talk with our sales lead Jay!

  • Start using MAX

    ( FREE )

    Run any open source model in 5 minutes, then benchmark it. Scale it to millions yourself (for free!).

  • Start using Mojo

    ( FREE )

    Install Mojo and get up and running in minutes. A simple install, familiar tooling, and clear docs make it easy to start writing code immediately.