Modular has partnered with SF Compute to address a fundamental asymmetry in the AI ecosystem: while model capabilities advance exponentially, the economic structures governing compute costs remain anchored in legacy paradigms.

We’re excited to launch the Large Scale Inference Batch API – a high-throughput, asynchronous interface built for large-scale offline inference tasks like data labeling, summarization, and synthetic generation. At launch, it supports 20+ state-of-the-art models across language, vision, and multimodal domains, from efficient 1B models to 600B+ frontier systems. Powered by Modular’s high-performance inference stack and SF Compute’s real-time spot market, the API delivers up to 80% lower cost than typical market alternatives.

Try it today - we’re offering 10M’s of batch inference tokens for free to the first 100 new customers that get started now.

The Best Price-Performance for the Rest of Us

The economics of AI inference are fundamentally broken - characterized by underutilized hardware, rigid pricing, and infrastructure built for traditional AI workloads. The collaboration between SF Compute and Modular rethinks this from first principles, combining a real-time spot market for GPUs with an industry leading AI serving stack to unlock entirely new token economics for deploying AI inference at scale.

Model ID	Hugging Face Name	Size
DeepSeek‑R1	deepseek-ai/DeepSeek-R1	671B
DeepSeek‑V3	deepseek-ai/DeepSeek-V3	671B
DeepSeek‑R1‑Distill‑Llama‑70B	deepseek-ai/DeepSeek-R1-Distill-Llama-70B	70B
Llama‑3‑70B‑chat	meta-llama/Llama-3-70b-chat-hf	70B
Llama‑3.1‑405B‑Instruct	meta-llama/Meta-Llama-3.1-405B-Instruct	405B
Llama‑3.1‑70B‑Instruct	meta-llama/Meta-Llama-3.1-70B-Instruct	70B
Llama‑3.1‑8B‑Instruct	meta-llama/Meta-Llama-3.1-8B-Instruct	8B
Llama‑3.3‑70B‑Instruct	meta-llama/Meta-Llama-3.3-70B-Instruct	70B
Llama‑4‑Scout‑17B‑Instruct	meta-llama/Llama-4-Scout-17B-16E-Instruct	109B
Llama‑4‑Maverick‑17B‑128E‑Instruct	meta-llama/Llama-4-Maverick-17B-128E-Instruct	400B
Llama 3.2 Vision	meta-llama/Llama-3.2-11B-Vision-Instruct	11B
Mistral‑7B‑Instruct	mistralai/Mistral-7B-Instruct-v0.1	7B
Mixtral‑8x7B‑Instruct	mistralai/Mixtral-8x7B-Instruct-v0.1	56B
Mistral‑Small‑24B‑Instruct	mistralai/Mistral-Small-24B-Instruct-2501	24B
Qwen‑2.5‑72B‑Instruct	Qwen/Qwen2.5-72B-Instruct	72.7B
Qwen‑2.5‑7B‑Instruct	Qwen/Qwen2.5-7B-Instruct	7B
Qwen 3‑14B	Qwen/Qwen3-14B	14.8B
Qwen 3‑8B	Qwen/Qwen3-8B	8.2B
QwQ‑32B	Qwen/QwQ-32B	32.5B
InternVL3‑9B	OpenGVLab/InternVL3-9B	9B
InternVL3‑14B	OpenGVLab/InternVL3-14B	14B
InternVL3‑38B	OpenGVLab/InternVL3-38B	38B
InternVL3‑78B	OpenGVLab/InternVL3-78B	78B
Gemma‑3‑12B‑in‑chat	google/gemma-3-12b-it	12B
Gemma‑3‑27B‑in‑chat	google/gemma-3-27b-it	27B

‍

Under the hood, SF Compute provides real-time access to thousands of NVIDIA H100, H200, and AMD MI300/325X GPUs (coming soon) via its dynamic pricing marketplace. Spot rates are often below $1.40/hour–far below the current $6-$8/on-demand standard–and is even lower than the 1-3 year locked-in reserve pricing rate of traditional clouds. The Modular Platform complements this with compiler-native execution, GenAI-specific serving optimizations, and the world’s most performant AI kernels - achieving up to 60% higher throughput relative to existing industry infrastructure (NVIDIA, AMD).

Together, we are able to deliver a structural shift in how inference is monetized. While incumbents rely on fixed, over-provisioned infrastructure to preserve margins, we optimize for volume, efficiency, and developer value - collapsing the cost stack and returning the gains to users.

Modular + SF Compute: The Power of Unification

For years, AI development has been constrained by rigid hardware silos and inflexible cloud infrastructure. NVIDIA’s CUDA stack dominated the ecosystem, while AMD’s ROCm struggled to gain adoption despite competitive hardware. Meanwhile, traditional cloud platforms enforced fixed provisioning, long-term contracts, and static pricing models. The result: artificial scarcity, vendor lock-in, and concentrated pricing power that stifles innovation and inflates the true cost of AI deployment.

Our architecture treats this as an engineering problem rather than market reality. By combining SF Compute’s unified cloud marketplace with Modular’s hardware abstraction platform, we’ve built true fungibility across compute vendors. This required solving several uniquely challenging technical challenges:

Hardware unification: The Modular Platform provides unified model cluster, serving and kernel APIs - delivering industry leading performance without sacrificing portability across heterogeneous hardware platforms.
Cloud unification: SF Compute’s platform abstracts physical infrastructure behind a programmable spot market and dynamic scheduler, enabling seamless allocation across heterogeneous compute backends.
Intelligent placement & routing: Together they automatically place workloads based on a multidimensional optimizer - factoring latency, bandwidth, model characteristics, and real-time market pricing.

The result: H100s, H200s, MI300/MI325Xs (coming soon), and next-gen accelerators compete in a single market on pure price-performance. For developers, hardware complexity disappears - models are routed to the best-fit resources, transparently. This isn’t just cost reduction – it’s a fundamental redefinition of how AI gets built and deployed.

Unlocking AI Innovation

The launch of the Large Scale Inference Batch API marks the first step in a larger transformation of AI infrastructure economics. Our roadmap targets key inefficiencies across the stack with innovations designed to unlock dramatically better cost-performance:

Long-running, online inference support for persistent, low-latency applications
Expanded model and hardware compatibility across the API and marketplace
Additional multi-cluster-level optimizations to further reduce compute overhead for large workloads

Start saving money today!

Ready to operate in a world where inference economics no longer constrain your ambitions? Reach out to the SF Compute team to get a quote for your use case.

SF Compute and Modular Partner to Revolutionize AI Inference Economics

The Best Price-Performance for the Rest of Us

Modular + SF Compute: The Power of Unification

Unlocking AI Innovation

Start saving money today!

Next blog post:

SF Compute and Modular Partner to Revolutionize AI Inference Economics

The Best Price-Performance for the Rest of Us

Modular + SF Compute: The Power of Unification

Unlocking AI Innovation

Start saving money today!

Next blog post:

Sign up for our newsletter