Gemma 4 12B A2B: Cost-Efficient MoE Inference

Gemma 3 12B by Google DeepMind is a dense 12B parameter model supporting text and vision.

Deploy dedicated endpoint

Try in Playground

Example Usage

Input

Python


  from openai import OpenAI
  
  client = OpenAI(
      base_url="https://model.api.modular.com",
      api_key="<your_api_token>",
  )
  
  response = client.chat.completions.create(
      model="google/gemma-3-12b-it",
      messages=[
          {"role": "system", "content": "You are a helpful assistant."},
          {"role": "user", "content": "Explain speculative decoding in 3 sentences."},
      ],
      stream=True,
  )
  
  for chunk in response:
      if chunk.choices[0].delta.content:
          print(chunk.choices[0].delta.content, end="")

Get your API token

Output

  Speculative decoding uses a smaller draft model to predict multiple
  tokens ahead, then verifies them against the full model in a single
  pass. Accepted tokens skip individual generation steps, improving
  throughput without sacrificing accuracy. It's most effective when the
  draft model closely matches the target model's distribution.

Input

Cute orange cartoon fox with a small brown backpack sitting on grass in a bright, colorful forest.

Output

  “The image shows a fox wearing a backpack in a forest.”

Model Details

Developed by
Google DeepMind
Model family
google/gemma-3-12b-it
Modality
LLM,
Vision,
Context Window
128K
Total Params
12B
Precision
BF16 / QAT-INT4
Deployment options
Shared, Dedicated, Self-hosted

Why choose Gemma 3 12B on Modular?

High performance, out of the box
Run leading open models with strong default performance and the ability to optimize down to the kernel — extracting more from every GPU.
Lower Infrastructure Costs
Deploy efficiently across NVIDIA and AMD hardware to reduce GPU count, increase throughput, and avoid expensive closed-model licensing.
Easy Integration
Integrate through an OpenAI-compatible endpoint, swap models freely, and scale across clouds or hardware without redesigning your application stack.

Want to self-host this model with our open source infrastructure?

Read How

🔥 Trending models

DeepSeek V4 Flash

DeepSeek V4 Flash is a 284B MoE model with 13B active parameters and a 1M context window, designed for fast, cost-efficient inference.

LLM

DeepSeek V4 Pro

DeepSeek V4 Pro is a 1.6T MoE model with 49B active parameters and a 1M context window, featuring hybrid attention for efficient long-context inference.

LLM

FLUX.2 Klein 9B

FLUX.2 [klein] 9B is a 9 billion parameter rectified flow transformer capable of generating images from text descriptions and supports multi-reference editing capabilities.

Image

GLM-5.1

GLM-5.1 by Zhipu AI is a 744B MoE model with 40B active parameters, optimized for long-horizon agentic engineering tasks.

LLM

Similar models

DeepSeek V4 Pro

DeepSeek V4 Pro is a 1.6T MoE model with 49B active parameters and a 1M context window, featuring hybrid attention for efficient long-context inference.

LLM

GLM-5

GLM-5 by Zhipu AI is a 744B MoE model with 44B active parameters, featuring strong reasoning capabilities across multilingual tasks.

LLM

Llama 4 Scout

Llama 4 Scout by Meta is a 109B MoE model with 17B active parameters and 16 experts, supporting text and vision with a 10M context window.

LLM

Vision

Qwen2.5 72B

Qwen2.5 72B by Alibaba is a dense 72B parameter model for text generation.

LLM

View all models

Get started with Modular

Request a demo
Schedule a demo of Modular and explore a custom end-to-end deployment built around your models, hardware, and performance goals.
- Distributed, large-scale online inference endpoints
- Highest-performance to maximize ROI and latency
- Deploy in Modular cloud or your cloud
- View all features with a custom demo
Book a demo
Talk with our sales lead Jay!
30min demo. Evaluate with your workloads. Ask us anything.

Talk to us!
Book a demo for a personalized walkthrough of Modular in your environment. Learn how teams use it to simplify systems and tune performance at scale.
- Custom 30 min walkthrough of our platform
- Cover specific model or deployment needs
- Flexible pricing to fit your specific needs
Book a demo
Talk with our sales lead Jay!
Start using MAX
( FREE )
Run any open source model in 5 minutes, then benchmark it. Scale it to millions yourself (for free!).
Install MAX
What is MAX?
Start using Mojo
( FREE )
Install Mojo and get up and running in minutes. A simple install, familiar tooling, and clear docs make it easy to start writing code immediately.
Install Mojo🔥
What is Mojo🔥?