July 31, 2025

SF Compute and Modular Partner to Revolutionize AI Inference Economics

Modular Team

SF Compute Team

Modular has partnered with SF Compute to address a fundamental asymmetry in the AI ecosystem: while model capabilities advance exponentially, the economic structures governing compute costs remain anchored in legacy paradigms. 

We’re excited to launch the Large Scale Inference Batch API – a high-throughput, asynchronous interface built for large-scale offline inference tasks like data labeling, summarization, and synthetic generation. At launch, it supports 20+ state-of-the-art models across language, vision, and multimodal domains, from efficient 1B models to 600B+ frontier systems. Powered by Modular’s high-performance inference stack and SF Compute’s real-time spot market, the API delivers up to 80% lower cost than typical market alternatives.

Try it today - we’re offering 10M’s of batch inference tokens for free to the first 100 new customers that get started now.

The Best Price-Performance for the Rest of Us

The economics of AI inference are fundamentally broken - characterized by underutilized hardware, rigid pricing, and infrastructure built for traditional AI workloads. The collaboration between SF Compute and Modular rethinks this from first principles, combining a real-time spot market for GPUs with an industry leading AI serving stack to unlock entirely new token economics for deploying AI inference at scale.

Model ID Hugging Face Name Size
DeepSeek‑R1 deepseek-ai/DeepSeek-R1 671B
DeepSeek‑V3 deepseek-ai/DeepSeek-V3 671B
DeepSeek‑R1‑Distill‑Llama‑70B deepseek-ai/DeepSeek-R1-Distill-Llama-70B 70B
Llama‑3‑70B‑chat meta-llama/Llama-3-70b-chat-hf 70B
Llama‑3.1‑405B‑Instruct meta-llama/Meta-Llama-3.1-405B-Instruct 405B
Llama‑3.1‑70B‑Instruct meta-llama/Meta-Llama-3.1-70B-Instruct 70B
Llama‑3.1‑8B‑Instruct meta-llama/Meta-Llama-3.1-8B-Instruct 8B
Llama‑3.3‑70B‑Instruct meta-llama/Meta-Llama-3.3-70B-Instruct 70B
Llama‑4‑Scout‑17B‑Instruct meta-llama/Llama-4-Scout-17B-16E-Instruct 109B
Llama‑4‑Maverick‑17B‑128E‑Instruct meta-llama/Llama-4-Maverick-17B-128E-Instruct 400B
Llama 3.2 Vision meta-llama/Llama-3.2-11B-Vision-Instruct 11B
Mistral‑7B‑Instruct mistralai/Mistral-7B-Instruct-v0.1 7B
Mixtral‑8x7B‑Instruct mistralai/Mixtral-8x7B-Instruct-v0.1 56B
Mistral‑Small‑24B‑Instruct mistralai/Mistral-Small-24B-Instruct-2501 24B
Qwen‑2.5‑72B‑Instruct Qwen/Qwen2.5-72B-Instruct 72.7B
Qwen‑2.5‑7B‑Instruct Qwen/Qwen2.5-7B-Instruct 7B
Qwen 3‑14B Qwen/Qwen3-14B 14.8B
Qwen 3‑8B Qwen/Qwen3-8B 8.2B
QwQ‑32B Qwen/QwQ-32B 32.5B
InternVL3‑9B OpenGVLab/InternVL3-9B 9B
InternVL3‑14B OpenGVLab/InternVL3-14B 14B
InternVL3‑38B OpenGVLab/InternVL3-38B 38B
InternVL3‑78B OpenGVLab/InternVL3-78B 78B
Gemma‑3‑12B‑in‑chat google/gemma-3-12b-it 12B
Gemma‑3‑27B‑in‑chat google/gemma-3-27b-it 27B

Under the hood, SF Compute provides real-time access to thousands of NVIDIA H100, H200, and AMD MI300/325X GPUs (coming soon) via its dynamic pricing marketplace. Spot rates are often below $1.40/hour–far below the current $6-$8/on-demand standard–and is even lower than the 1-3 year locked-in reserve pricing rate of traditional clouds. Modular’s Platform complements this with compiler-native execution, GenAI-specific serving optimizations, and the world’s most performant AI kernels - achieving up to 60% higher throughput relative to existing industry infrastructure (NVIDIA, AMD).

Together, we are able to deliver a structural shift in how inference is monetized. While incumbents rely on fixed, over-provisioned infrastructure to preserve margins, we optimize for volume, efficiency, and developer value - collapsing the cost stack and returning the gains to users.

Modular + SF Compute: The Power of Unification

For years, AI development has been constrained by rigid hardware silos and inflexible cloud infrastructure. NVIDIA’s CUDA stack dominated the ecosystem, while AMD’s ROCm struggled to gain adoption despite competitive hardware. Meanwhile, traditional cloud platforms enforced fixed provisioning, long-term contracts, and static pricing models. The result: artificial scarcity, vendor lock-in, and concentrated pricing power that stifles innovation and inflates the true cost of AI deployment.

Our architecture treats this as an engineering problem rather than market reality. By combining SF Compute’s unified cloud marketplace with Modular’s hardware abstraction platform, we’ve built true fungibility across compute vendors. This required solving several uniquely challenging technical challenges:

  • Hardware unification: Modular’s Platform provides unified model cluster, serving and kernel APIs  - delivering industry leading performance without sacrificing portability across heterogeneous hardware platforms.
  • Cloud unification: SF Compute’s platform abstracts physical infrastructure behind a programmable spot market and dynamic scheduler, enabling seamless allocation across heterogeneous compute backends.
  • Intelligent placement & routing: Together they automatically place workloads based on a multidimensional optimizer - factoring latency, bandwidth, model characteristics, and real-time market pricing.

The result: H100s, H200s, MI300/MI325Xs (coming soon), and next-gen accelerators compete in a single market on pure price-performance. For developers, hardware complexity disappears - models are routed to the best-fit resources, transparently. This isn’t just cost reduction – it’s a fundamental redefinition of how AI gets built and deployed.

Unlocking AI Innovation

The launch of the Large Scale Inference Batch API marks the first step in a larger transformation of AI infrastructure economics. Our roadmap targets key inefficiencies across the stack with innovations designed to unlock dramatically better cost-performance:

  • Long-running, online inference support for persistent, low-latency applications
  • Expanded model and hardware compatibility across the API and marketplace
  • Additional multi-cluster-level optimizations to further reduce compute overhead for large workloads

Start saving money today!

Ready to operate in a world where inference economics no longer constrain your ambitions? Reach out to the SF Compute team to get a quote for your use case.

Sign up for our newsletter

Get all our latest news, announcements and updates delivered directly to your inbox. Unsubscribe at anytime.

Thank you for your submission.

Your report has been received and is being reviewed by the Sales team. A member from our team will reach out to you shortly.

Thank you,

Modular Sales Team

Company

Modular Team
,
Company
SF Compute Team
,

Modular Team

Company

Our mission is to have real, positive impact in the world by reinventing the way AI technology is developed and deployed into production with a next-generation developer platform.