Gemma 4 just dropped on Modular, Day Zero! Read More →

Gemma 3 12B logo

Gemma 4 12B A2B: Cost-Efficient MoE Inference

Gemma 3 12B by Google DeepMind is a dense 12B parameter model supporting text and vision.

Example Usage

Input
Python

  from openai import OpenAI
  
  client = OpenAI(
      base_url="https://model.api.modular.com",
      api_key="<your_api_token>",
  )
  
  response = client.chat.completions.create(
      model="google/gemma-3-12b-it",
      messages=[
          {"role": "system", "content": "You are a helpful assistant."},
          {"role": "user", "content": "Explain speculative decoding in 3 sentences."},
      ],
      stream=True,
  )
  
  for chunk in response:
      if chunk.choices[0].delta.content:
          print(chunk.choices[0].delta.content, end="")
Output

  Speculative decoding uses a smaller draft model to predict multiple
  tokens ahead, then verifies them against the full model in a single
  pass. Accepted tokens skip individual generation steps, improving
  throughput without sacrificing accuracy. It's most effective when the
  draft model closely matches the target model's distribution.
Input
Cute orange cartoon fox with a small brown backpack sitting on grass in a bright, colorful forest.
Output

  “The image shows a fox wearing a backpack in a forest.”
  
  
  
  
Model Details
  • Developed by
    Google DeepMind
  • Model family
    google/gemma-3-12b-it
  • Modality
    LLM,
    Vision,
  • Context Window
    128K
  • Total Params
    12B
  • Precision
    BF16 / QAT-INT4
  • Deployment options
    Shared, Dedicated, Self-hosted

Why choose Gemma 3 12B on Modular?

  • High performance, out of the box

    Run leading open models with strong default performance and the ability to optimize down to the kernel — extracting more from every GPU.

  • Lower Infrastructure Costs

    Deploy efficiently across NVIDIA and AMD hardware to reduce GPU count, increase throughput, and avoid expensive closed-model licensing.

  • Easy Integration

    Integrate through an OpenAI-compatible endpoint, swap models freely, and scale across clouds or hardware without redesigning your application stack.

Gemma 3 12B
Want to self-host this model with our open source infrastructure?
Read How

🔥 Trending models

Similar models

Get started with Modular

  • Request a demo

    Schedule a demo of Modular and explore a custom end-to-end deployment built around your models, hardware, and performance goals.

    • Distributed, large-scale online inference endpoints

    • Highest-performance to maximize ROI and latency

    • Deploy in Modular cloud or your cloud

    • View all features with a custom demo

    Book a demo

    Talk with our sales lead Jay!

    30min demo.  Evaluate with your workloads.  Ask us anything.

  • Talk to us!

    Book a demo for a personalized walkthrough of Modular in your environment. Learn how teams use it to simplify systems and tune performance at scale.

    • Custom 30 min walkthrough of our platform

    • Cover specific model or deployment needs

    • Flexible pricing to fit your specific needs

    Book a demo

    Talk with our sales lead Jay!

  • Start using MAX

    ( FREE )

    Run any open source model in 5 minutes, then benchmark it. Scale it to millions yourself (for free!).

  • Start using Mojo

    ( FREE )

    Install Mojo and get up and running in minutes. A simple install, familiar tooling, and clear docs make it easy to start writing code immediately.