Gemma 4 just dropped on Modular, Day Zero! Read More →

Gemma 4 31B logo

Gemma 4 31B: Dense Vision-Language Inference on Modular

Gemma 4 31B is a 31-billion-parameter dense model featuring a redesigned architecture that improves both efficiency and long-context quality. With a 256K context window, it's built for demanding tasks that require deep reasoning across large inputs.

Example Usage

Output

  Speculative decoding uses a smaller draft model to predict multiple
  tokens ahead, then verifies them against the full model in a single
  pass. Accepted tokens skip individual generation steps, improving
  throughput without sacrificing accuracy. It's most effective when the
  draft model closely matches the target model's distribution.
Code to use
Python

  from openai import OpenAI
  
  client = OpenAI(
      base_url="https://model.api.modular.com",
      api_key="<your_api_token>",
  )
  
  response = client.chat.completions.create(
      model="google/gemma-4-31B-it",
      messages=[
          {"role": "system", "content": "You are a helpful assistant."},
          {"role": "user", "content": "Explain speculative decoding in 3 sentences."},
      ],
      stream=True,
  )
  
  for chunk in response:
      if chunk.choices[0].delta.content:
          print(chunk.choices[0].delta.content, end="")
Output

  “The image shows a fox wearing a backpack in a forest.”
Input image
Cute orange cartoon fox with a small brown backpack sitting on grass in a bright, colorful forest.
Code to use

  import base64
  from openai import OpenAI
  
  client = OpenAI(
      base_url="https://model.api.modular.com",
      api_key="<your_api_token>",
  )
  
  with open("fox.png", "rb") as image_file:
      image_data = base64.b64encode(image_file.read()).decode("utf-8")
  
  response = client.chat.completions.create(
      model="google/gemma-4-31B-it",
      messages=[
          {
              "role": "user",
              "content": [
                  {
                      "type": "text",
                      "text": "Describe this image in one sentence.",
                  },
                  {
                      "type": "image_url",
                      "image_url": {
                          "url": f"data:image/png;base64,{image_data}"
                      },
                  },
              ],
          }
      ],
  )
  
  print(response.choices[0].message.content)
Model Details
  • Developed by
    Google
  • Model family
    google/gemma-4-31B-it
  • Modality
    LLM,
    Vision,
  • Context Window
    256K
  • Total Params
    31.3B
  • Precision
    BF16
  • Deployment options
    Shared, Dedicated, Self-hosted

Why choose Gemma 4 31B it on Modular?

  • High performance, out of the box

    Run leading open models with strong default performance and the ability to optimize down to the kernel — extracting more from every GPU.

  • Lower Infrastructure Costs

    Deploy efficiently across NVIDIA and AMD hardware to reduce GPU count, increase throughput, and avoid expensive closed-model licensing.

  • Easy Integration

    Integrate through an OpenAI-compatible endpoint, swap models freely, and scale across clouds or hardware without redesigning your application stack.

Gemma 4 31B it
Want to self-host this model with our open source infrastructure?
Read How

🔥 Trending models

Similar models

Get started with Modular

  • Request a demo

    Schedule a demo of Modular and explore a custom end-to-end deployment built around your models, hardware, and performance goals.

    • Distributed, large-scale online inference endpoints

    • Highest-performance to maximize ROI and latency

    • Deploy in Modular cloud or your cloud

    • View all features with a custom demo

    Book a demo

    Talk with our sales lead Jay!

    30min demo.  Evaluate with your workloads.  Ask us anything.

  • Talk to us!

    Book a demo for a personalized walkthrough of Modular in your environment. Learn how teams use it to simplify systems and tune performance at scale.

    • Custom 30 min walkthrough of our platform

    • Cover specific model or deployment needs

    • Flexible pricing to fit your specific needs

    Book a demo

    Talk with our sales lead Jay!

  • Start using MAX

    ( FREE )

    Run any open source model in 5 minutes, then benchmark it. Scale it to millions yourself (for free!).

  • Start using Mojo

    ( FREE )

    Install Mojo and get up and running in minutes. A simple install, familiar tooling, and clear docs make it easy to start writing code immediately.