BATCH INFERENCE

High throughput, low cost batch inference tailored for enterprise scale

We provide fast turnaround times, exceptional throughput, and significant cost savings for complex batch inference workloads of any scale, all while ensuring high accuracy.

80% cheaper

2x faster inference

85% cheaper

Industry leading speed and accuracy at the lowest cost with our unique ability to procure and utilize dynamically priced low cost GPUs.

Trillion-token scale

Higher rate limits & throughput than other providers. Scale instantly to 3,000 GPUs and trillions of tokens if needed, or pace it out throughout the year.

Keep your data private

We connect with your data storage, process the batches, and save the results back. Your data is never uploaded or stored on our servers.

Where we stand out

The best available pricing

Unlike other providers, our inference prices are market-based. The token price tracks the underlying market-based compute cost. In other words, we give you the lowest available price for the fastest GPU. We also run across multiple hardwares to give you the best price performance.

up to

67%

Cheaper content
summarization

The best available accuracy

Our MAX inference engine consistently outperforms other providers across key benchmarks like DocVQA, MathVista, and ChartQA on accuracy. Let us know what benchmarking you prefer, and we’ll deliver custom accuracy benchmarks on the first batch workload of your contract.

up to

10%

Higher accuracy than other providers

Simple Process

Discovery Call

A short call to understand your workload, choose your right batch size, model, and price.

Kick off workloads

In as soon as 24 hours, we’ll kick off the first workload, after signing a contract.

Track batches in realtime

Onboard our scalable batch API and monitor progress of your submitted batches.

Invoice & Payment

Receive usage analytics for the entire job with your invoice.

Built for trillion-token, sensitive, multimodal use cases

We natively support very large scale batch inference, with far higher rate limits & throughput than other providers. Unlike other services, we don't force you to upload petabytes of data on our servers. We natively handle multimodal use cases, supporting any open source or proprietary model you want to run batches on.

Bespoke enterprise support

Our batch inference is designed for large scale, mostly enterprise, use cases. That lets us be more hands on than traditional, self-serve providers.

Want a custom deployment?
Need to hit specific latency, throughput, or cost requirements?
Is there a model that's performing better in your evals, but we're not serving it?

We'll work with you to get the best possible outcomes.

Trusted hardware providers

Modular has teamed up with San Francisco Compute to deliver the most competitively priced OpenAI-compatible batch inference available. Our solution is over 85% cheaper than most open-source models. Through this exclusive partnership, we utilize their advanced hardware pricing technology, enabling us to efficiently run MAX at a trillion-token scale.

PROVIDED BY

Kickoff a workload now

Hop on a quick call to get started, and then start sending your batches.

View pricing & models

Thank you for your submission.

Your report has been received and is being reviewed by the Sales team. A member from our team will reach out to you shortly.

Thank you,

Modular Sales Team

Supported models

We currently support the following models below the cost of every current provider on average. Exact prices, latency, and throughput depend on the use case & current market conditions.

Model

Hugging Face Name

Size

Price 1M Tokens

gpt‑oss‑120b

New

Coming Soon

openai/gpt-oss-120b

120B

$0.04 input

$0.20 output

gpt‑oss‑20b

New

Coming Soon

openai/gpt-oss-20b

20B

$0.02 input

$0.08 output

Llama‑3.1‑405B‑Instruct

New

Coming Soon

meta-llama/Llama-3.1-405B-Instruct

405B

$0.50 input

$1.50 output

Llama‑3.3‑70B‑Instruct

New

Coming Soon

meta-llama/Llama-3.3-70B-Instruct

70B

$0.052 input

$0.156 output

Llama‑3.1‑8B‑Instruct

New

Coming Soon

meta-llama/Meta-Llama-3.1-8B-Instruct

$0.008 input

$0.02 output

Llama 3.2 Vision

New

Coming Soon

meta-llama/Llama-3.2-11B-Vision-Instruct

11B

$0.072 input

$0.072 output

Qwen‑2.5‑72B‑Instruct

New

Coming Soon

Qwen/Qwen2.5-72B-Instruct

72B

$0.065 input

$1.25 output

Qwen2.5-VL 72B

New

Coming Soon

Qwen/Qwen2.5-VL-72B-Instruct

72B

$0.125 input

$0.325 output

Qwen2.5-VL 32B

New

Coming Soon

Qwen/Qwen2.5-VL-32B-Instruct

32B

$0.125 input

$0.325 output

Qwen3 32B

New

Coming Soon

Qwen/Qwen3-32B

32B

$0.05 input

$0.15 output

Qwen3 A3B 30B

New

Coming Soon

Qwen/Qwen3-30B-A3B-Instruct-2507

30B

$0.05 input

$0.15 output

Qwen 3‑14B

New

Coming Soon

Qwen/Qwen3-14B

14B

$0.04 input

$0.12 output

Qwen 3‑8B

New

Coming Soon

Qwen/Qwen3-8B

$0.014 input

$0.055 output

QwQ‑32B

New

Coming Soon

Qwen/QwQ-32B

32B

$0.075 input

$0.225 output

Gemma‑3‑27B‑in‑chat

New

Coming Soon

google/gemma-3-27b-it

27B

$0.05 input

$0.15 output

Gemma‑3‑12B‑in‑chat

New

Coming Soon

google/gemma-3-12b-it

12B

$0.04 input

$0.08 output

Gemma‑3‑4B‑in‑chat

New

Coming Soon

google/gemma-3-4b-it

$0.016 input

$0.032 output

Mistral Small 3.2 2506

New

Coming Soon

mistralai/Mistral-Small-3.2-24B-Instruct-2506

24B

$0.04 input

$0.08 output

Mistral Nemo 2407

New

Coming Soon

mistralai/Mistral-Nemo-Instruct-2407

12B

$0.02 input

$0.06 output

InternVL3‑78B

New

Coming Soon

OpenGVLab/InternVL3-78B

78B

$0.125 input

$0.325 output

InternVL3‑38B

New

Coming Soon

OpenGVLab/InternVL3-38B

38B

$0.125 input

$0.325 output

InternVL3‑14B

New

Coming Soon

OpenGVLab/InternVL3-14B

14B

$0.072 input

$0.072 output

InternVL3‑9B

New

Coming Soon

OpenGVLab/InternVL3-9B

$0.05 input

$0.05 output

DeepSeek‑R1

New

Coming Soon

deepseek-ai/DeepSeek-R1

671B

$0.28 input

$1.00 output

DeepSeek‑V3

New

Coming Soon

deepseek-ai/DeepSeek-V3

671B

$0.112 input

$0.456 output

Llama‑4‑Maverick‑17B‑128E‑Instruct

New

Coming Soon

deepseek-ai/DeepSeek-V3

671B

$0.112 input

$0.456 output

Llama‑4‑Maverick‑17B‑128E‑Instruct

New

Coming Soon

meta-llama/Llama-4-Maverick-17B-128E-Instruct

400B

$0.075 input

$0.425 output

Llama‑4‑Scout‑17B‑Instruct

New

Coming Soon

meta-llama/Llama-4-Scout-17B-16E-Instruct

109B

$0.05 input

$0.25 output

Qwen3 Coder A35B 480B

New

Coming Soon

Qwen/Qwen3-Coder-480B-A35B-Instruct

480B

$0.32 input

$1.25 output

Qwen3 A22B 2507 235B

New

Coming Soon

Qwen/Qwen3-235B-A22B-Instruct-2507

480B

$0.32 input

$1.25 output

Kimi K2

New

Coming Soon

moonshotai/Kimi-K2-Instruct

$0.30 input

$1.25 output

GLM 4.5

New

Coming Soon

zai-org/GLM-4.5

358B

$0.30 input

$1.10 output

GLM 4.5 Air

New

Coming Soon

zai-org/GLM-4.5-Air

110B

$0.16 input

$0.88 output

GLM 4.5V

New

Coming Soon

zai-org/GLM-4.5V

108B

$0.30 input

$0.90 output

Get a personalized quote

Scales for enterprises

Dedicated enterprise support

We are a team of the world's best AI infrastructure leaders who are reinventing and rebuilding accelerated compute for everyone.

About Us

Infinitely scalable to reduce your TCO

Optimize costs and performance with multi-node inference at massive scale across cloud or on-prem environments.

Enterprise grade SLA

Our performance is backed with an enterprise grade SLA, ensuring reliability, accountability, and peace of mind.

actually flies on the GPU

@ Sanika

"after wrestling with CUDA drivers for years, it felt surprisingly… smooth. No, really: for once I wasn’t battling obscure libstdc++ errors at midnight or re-compiling kernels to coax out speed. Instead, I got a peek at writing almost-Pythonic code that compiles down to something that actually flies on the GPU."

pure iteration power

@ Jayesh

"This is about unlocking freedom for devs like me, no more vendor traps or rewrites, just pure iteration power. As someone working on challenging ML problems, this is a big thing."

impressed

@ justin_76273

“The more I benchmark, the more impressed I am with the MAX Engine.”

performance is insane

@ drdude81

“I tried MAX builds last night, impressive indeed. I couldn't believe what I was seeing... performance is insane.”

easy to optimize

@ dorjeduck

“It’s fast which is awesome. And it’s easy. It’s not CUDA programming...easy to optimize.”

potential to take over

@ svpino

“A few weeks ago, I started learning Mojo 🔥 and MAX. Mojo has the potential to take over AI development. It's Python++. Simple to learn, and extremely fast.”

was a breeze!

@ NL

“Max installation on Mac M2 and running llama3 in (q6_k and q4_k) was a breeze! Thank you Modular team!”

high performance code

@ jeremyphoward

"Mojo is Python++. It will be, when complete, a strict superset of the Python language. But it also has additional functionality so we can write high performance code that takes advantage of modern accelerators."

one language all the way

@ fnands

“Tired of the two language problem. I have one foot in the ML world and one foot in the geospatial world, and both struggle with the 'two-language' problem. Having Mojo - as one language all the way through would be awesome.”

works across the stack

@ scrumtuous

“Mojo can replace the C programs too. It works across the stack. It’s not glue code. It’s the whole ecosystem.”

completely different ballgame

@ scrumtuous

“What @modular is doing with Mojo and the MaxPlatform is a completely different ballgame.”

AI for the next generation

@ mytechnotalent

“I am focusing my time to help advance @Modular. I may be starting from scratch but I feel it’s what I need to do to contribute to #AI for the next generation.”

surest bet for longterm

@ pagilgukey

“Mojo and the MAX Graph API are the surest bet for longterm multi-arch future-substrate NN compilation”

potential to take over

@ svpino

“A few weeks ago, I started learning Mojo 🔥 and MAX. Mojo has the potential to take over AI development. It's Python++. Simple to learn, and extremely fast.”

12x faster without even trying

@ svpino

“Mojo destroys Python in speed. 12x faster without even trying. The future is bright!”

feeling of superpowers

@ Aydyn

"Mojo gives me the feeling of superpowers. I did not expect it to outperform a well-known solution like llama.cpp."

very excited

@ strangemonad

“I'm very excited to see this coming together and what it represents, not just for MAX, but my hope for what it could also mean for the broader ecosystem that mojo could interact with.”

impressive speed

@ Adalseno

"It worked like a charm, with impressive speed. Now my version is about twice as fast as Julia's (7 ms vs. 12 ms for a 10 million vector; 7 ms on the playground. I guess on my computer, it might be even faster). Amazing."

amazing achievements

@ Eprahim

“I'm excited, you're excited, everyone is excited to see what's new in Mojo and MAX and the amazing achievements of the team at Modular.”

Community is incredible

@ benny.n

“The Community is incredible and so supportive. It’s awesome to be part of.”

excited to see this coming together

@ strangemonad

“I'm very excited to see this coming together and what it represents, not just for MAX, but my hope for what it could also mean for the broader ecosystem that mojo could interact with.”

everyone is excited

@ Eprahim

“I'm excited, you're excited, everyone is excited to see what's new in Mojo and MAX and the amazing achievements of the team at Modular.”

one language all the way through

@ fnands

huge increase in performance

@ Aydyn

"C is known for being as fast as assembly, but when we implemented the same logic on Mojo and used some of the out-of-the-box features, it showed a huge increase in performance... It was amazing."

The future is bright!

@ mytechnotalent

Mojo destroys Python in speed. 12x faster without even trying. The future is bright!

Show more quotes

Build the future of AI with Modular

View Documentation

Latest Blog Posts

Modular 25.7: Faster Inference, Safer GPU Programming, and a More Unified Developer Experience

November 20, 2025

"TTS 1 Max" (powered by Modular Platform) Ranked #1 Speech Model on Artificial Analysis

November 7, 2025

PyTorch and LLVM in 2025 — Keeping up With AI Innovation

November 6, 2025

Achieving State-of-the-Art Performance on AMD MI355 — in Just 14 Days

October 17, 2025

Modular Raises $250M to scale AI's Unified Compute Layer

September 24, 2025

Modular 25.6: Unifying the latest GPUs from NVIDIA, AMD, and Apple

September 22, 2025

Modverse #51: Modular x Inworld x Oracle, Modular Meetup Recap and Community Projects

September 19, 2025

Matrix Multiplication on Blackwell: Part 4 - Breaking SOTA

September 19, 2025

Matrix Multiplication on Blackwell: Part 3 - The Optimizations Behind 85% of SOTA Performance

September 12, 2025

Matrix Multiplication on Blackwell: Part 2 - Using Hardware Features to Optimize Matmul

September 5, 2025

Matrix Multiplication on Blackwell: Part 1 - Introduction

August 28, 2025

Modverse #50: Modular Platform 25.5, Community Meetups, and Mojo's Debut in the Stack Overflow Developer Survey

August 21, 2025

Modular Platform 25.5: Introducing Large Scale Batch Inference

August 5, 2025

SF Compute and Modular Partner to Revolutionize AI Inference Economics

July 31, 2025

AI Agents for AWS Marketplace

July 16, 2025

Modverse #49: Modular Platform 25.4, Modular 🤝 AMD, and Modular Hack Weekend

July 9, 2025

Inside Modular Hack Weekend: Top Projects and Community Highlights

July 3, 2025

How is Modular Democratizing AI Compute? (Democratizing AI Compute, Part 11)

June 20, 2025

Modular 25.4: One Container, AMD and NVIDIA GPUs, No Lock-In

June 18, 2025

Modular + AMD: Unleashing AI performance on AMD GPUs

June 10, 2025

Introducing Mammoth: Enterprise-Scale GenAI Deployments Made Simple

June 10, 2025

Modverse #48: Modular Platform 25.3, MAX AI Kernels, and the Modular GPU Kernel Hackathon

May 29, 2025

Exploring Metaprogramming in Mojo

May 27, 2025

Modular GPU Kernel Hackathon Highlights: Innovation, Community, & Mojo🔥

May 20, 2025

Modular’s bet to break out of the Matrix (Democratizing AI Compute, Part 10)

May 8, 2025

Modular Platform 25.3: 450K+ Lines of Open Source Code and pip Packaging

May 6, 2025

A New, Simpler License for MAX and Mojo

April 23, 2025

Why do HW companies struggle to build AI software? (Democratizing AI Compute, Part 9)

April 22, 2025

Modverse #47: MAX 25.2 and an evening of GPU programming at Modular HQ

April 17, 2025

What about the MLIR compiler infrastructure? (Democratizing AI Compute, Part 8)

April 8, 2025

What about Triton and Python eDSLs? (Democratizing AI Compute, Part 7)

March 26, 2025

MAX 25.2: Unleash the power of your H200's–without CUDA!

March 25, 2025

What about TVM, XLA, and AI compilers? (Democratizing AI Compute, Part 6)

March 12, 2025

What about OpenCL and CUDA C++ alternatives? (Democratizing AI Compute, Part 5)

March 5, 2025

Modverse #46: MAX 25.1, MAX Builds, and Democratizing AI Compute

February 27, 2025

CUDA is the incumbent, but is it any good? (Democratizing AI Compute, Part 4)

February 20, 2025

MAX 25.1 - Introducing MAX Builds

February 18, 2025

How did CUDA succeed? (Democratizing AI Compute, Part 3)

February 12, 2025

Paged Attention & Prefix Caching Now Available in MAX Serve

February 6, 2025

What exactly is “CUDA”? (Democratizing AI Compute, Part 2)

February 5, 2025

DeepSeek's Impact on AI (Democratizing AI Compute, Part 1)

January 30, 2025

Agentic Building Blocks: Creating AI Agents with MAX Serve and OpenAI Function Calling

January 30, 2025

Use MAX with Open WebUI for RAG and Web Search

January 23, 2025

Hands-on with Mojo 24.6

January 21, 2025

Evaluating Llama Guard with MAX 24.6 and Hugging Face

December 19, 2024

MAX GPU: State of the Art Throughput on a New GenAI platform

December 17, 2024

Introducing MAX 24.6: A GPU Native Generative AI Platform

December 17, 2024

Build a Continuous Chat Interface with Llama 3 and MAX Serve

December 17, 2024

Understanding SIMD: Infinite Complexity of Trivial Problems

October 25, 2024

Community Spotlight: Writing Mojo with Cursor

October 10, 2024

Hands-on with Mojo 24.5

October 1, 2024

MAX 24.5 - With SOTA CPU Performance for Llama 3.1

September 13, 2024

Announcing stack-pr: an open source tool for managing stacked PRs on GitHub

July 23, 2024

Debugging in Mojo🔥

July 16, 2024

Develop locally, deploy globally

July 9, 2024

Take control of your AI

July 9, 2024

Bring your own PyTorch model

July 9, 2024

A brief guide to the Mojo n-body example

July 3, 2024

What's new in MAX 24.4? MAX on macOS, fast local Llama3, native quantization and GGUF support

June 25, 2024

What’s new in Mojo 24.4? Improved collections, new traits, os module features and core language enhancements

June 17, 2024

MAX 24.4 - Introducing quantization APIs and MAX on macOS

June 7, 2024

Deep dive into ownership in Mojo

June 4, 2024

What ownership is really about: a mental model approach

May 29, 2024

Fast⚡k-means clustering in Mojo🔥: a guide to porting Python to Mojo🔥 for accelerated k-means clustering

May 20, 2024

Developer Voices: Deep Dive with Chris Lattner on Mojo

May 8, 2024

What’s New in Mojo 24.3: Community Contributions, Pythonic Collections and Core Language Enhancements

May 2, 2024

MAX 24.3 - Introducing MAX Engine Extensibility

May 2, 2024

Row-major vs. Column-major Matrices: A Performance Analysis in Mojo and NumPy

April 10, 2024

What’s new in Mojo 24.2: Mojo Nightly, Enhanced Python Interop, OSS stdlib and more

April 2, 2024

MAX 24.2 is Here! What’s New?

March 28, 2024

The Next Big Step in Mojo🔥 Open Source

March 28, 2024

Semantic Search with MAX Engine

March 21, 2024

How to Be Confident in Your Performance Benchmarking

March 19, 2024

Mojo🔥 ❤️ Pi 🥧: Approximating Pi with Mojo🔥 using Monte Carlo methods

March 14, 2024

Evaluating MAX Engine inference accuracy on the ImageNet dataset

March 13, 2024

Announcing MAX Developer Edition Preview

February 29, 2024

Getting started with MAX Developer Edition

February 29, 2024

MAX is here! What does that mean for Mojo🔥?

February 29, 2024

What are dunder methods? A guide in Mojo🔥

February 26, 2024

Mojo🔥 ♥️ Python: Calculating and plotting a Valentine’s day ♥️ using Mojo and Python

February 15, 2024

Mojo vs. Rust: what are the differences?

February 12, 2024

What is loop unrolling? How you can speed up Mojo🔥 code with @unroll

January 29, 2024

Mojo🔥 SDK v0.7 now available for download!

January 25, 2024

Mojo 🔥 lightning talk ⚡️ one language for all AI programming!

January 23, 2024

Modular to bring NVIDIA Accelerated Computing to the MAX Platform

December 4, 2023

Modular partners with Amazon Web Services (AWS) to bring MAX to AWS services

December 4, 2023

Key announcements from ModCon 2023

December 4, 2023

Mojo 🔥 Traits Have Arrived!

December 3, 2023

Mojo 🔥 Advent of Code 2023

November 26, 2023

ModCon 2023 sessions you don’t want to miss!

November 22, 2023

ModCon Mojo 🔥 Contest

November 21, 2023

What’s new in Mojo SDK v0.5?

November 14, 2023

Welcome Mostafa Hagog to Modular

November 6, 2023

Mojo🔥 is now available on Mac

October 19, 2023

Mojo 🔥 - A systems programming language presented at LLVM 2023

October 15, 2023

Community Spotlight: How I built llama2.🔥 by Aydyn Tairov

October 13, 2023

Using Mojo🔥 with Python🐍

October 2, 2023

How to setup a Mojo🔥 development environment with Docker containers

September 28, 2023

AI Regulation: step with care, and great tact

September 26, 2023

Mojo🔥 - It’s finally here!

September 7, 2023

High throughput, low cost batch inference tailored for enterprise scale

Where we stand out

The best available pricing

The best available accuracy

Simple Process

Built for trillion-token, sensitive, multimodal use cases

Bespoke enterprise support

Trusted hardware providers

Supported models

Scales for enterprises

Developer Approved

Latest Blog Posts