Mistral Small 3.1 24B: Efficient Vision Inference on Modular

Mistral Small 3.1 by Mistral AI is a dense 24B parameter model supporting text and vision.

Deploy dedicated endpoint

Try in Playground

Example Usage

Input

Python


  from openai import OpenAI
  
  client = OpenAI(
      base_url="https://model.api.modular.com",
      api_key="<your_api_token>",
  )
  
  response = client.chat.completions.create(
      model="mistralai/Mistral-Small-3.1-24B-Instruct-2503",
      messages=[
          {"role": "system", "content": "You are a helpful assistant."},
          {"role": "user", "content": "Explain speculative decoding in 3 sentences."},
      ],
      stream=True,
  )
  
  for chunk in response:
      if chunk.choices[0].delta.content:
          print(chunk.choices[0].delta.content, end="")

Get your API token

Output

  Speculative decoding uses a smaller draft model to predict multiple
  tokens ahead, then verifies them against the full model in a single
  pass. Accepted tokens skip individual generation steps, improving
  throughput without sacrificing accuracy. It's most effective when the
  draft model closely matches the target model's distribution.

Input

Cute orange cartoon fox with a small brown backpack sitting on grass in a bright, colorful forest.

Output

  “The image shows a fox wearing a backpack in a forest.”

Model Details

Developed by
Mistral AI
Model family
mistralai/Mistral-Small-3.1-24B-Instruct-2503
Modality
LLM,
Vision,
Context Window
128K
Total Params
24B
Precision
BF16 / FP8
Deployment options
Shared, Dedicated, Self-hosted

Why choose Mistral Small 3.1 24B on Modular?

High performance, out of the box
Run leading open models with strong default performance and the ability to optimize down to the kernel — extracting more from every GPU.
Lower Infrastructure Costs
Deploy efficiently across NVIDIA and AMD hardware to reduce GPU count, increase throughput, and avoid expensive closed-model licensing.
Easy Integration
Integrate through an OpenAI-compatible endpoint, swap models freely, and scale across clouds or hardware without redesigning your application stack.

Want to self-host this model with our open source infrastructure?

Read How

🔥 Trending models

DeepSeek V4 Flash

DeepSeek V4 Flash is a 284B MoE model with 13B active parameters and a 1M context window, designed for fast, cost-efficient inference.

LLM

DeepSeek V4 Pro

DeepSeek V4 Pro is a 1.6T MoE model with 49B active parameters and a 1M context window, featuring hybrid attention for efficient long-context inference.

LLM

FLUX.2 Klein 9B

FLUX.2 [klein] 9B is a 9 billion parameter rectified flow transformer capable of generating images from text descriptions and supports multi-reference editing capabilities.

Image

GLM-5.1

GLM-5.1 by Zhipu AI is a 744B MoE model with 40B active parameters, optimized for long-horizon agentic engineering tasks.

LLM

Similar models

GLM-5

GLM-5 by Zhipu AI is a 744B MoE model with 44B active parameters, featuring strong reasoning capabilities across multilingual tasks.

LLM

Nemotron 3 Nano

Nemotron 3 Nano by NVIDIA is a 31.6B MoE model with 3.2B active parameters, featuring reasoning and a 1M context window.

LLM

Qwen3-Omni-30B-A3B

Qwen3-Omni-30B-A3B by Alibaba is a 30B omni-modal MoE model with 3B active parameters, supporting text, audio, vision, and video.

LLM

Audio

Vision

Gemma 3 27B

Gemma 3 27B by Google DeepMind is a dense 27B parameter model supporting text and vision.

LLM