San Francisco Compute Case Study

When selling GPUs as a commodity meets the fastest inference engine - cost savings can skyrocket.

Customers

20+

AI Models

80%

Cheaper Batch Inference

Unified Batch API

"Modular’s team is world class. Their stack slashed our inference costs by 80%, letting our customer dramatically scale up. They’re fast, reliable, and real engineers who take things seriously. We’re excited to partner with them to bring down prices for everyone, to let AI bring about wide prosperity."

Evan Conrad

CEO - San Francisco Compute

Problem

AI teams didn't just need cheaper GPUs — they needed smarter infrastructure. Inference at scale was too expensive, and throwing more hardware at the problem wasn't working.

San Francisco Compute (SF Compute) operates a GPU marketplace that enables AI companies to access large-scale GPU clusters by the hour, for training and inference workloads without forcing expensive long-term contracts.

As a platform serving everyone from AI startups to research labs, they're experiencing growing demands from customers who not only needed raw compute power, but also want competitively priced AI inference. This led to a lightbulb moment - what if SF Compute could offer the world's best AI infrastructure, on their compute marketplace, at the most competitive price in the market?

Solving this required more than just adding more GPUs or optimizing scheduling; it demanded a complete reimagining of how AI inference could be accelerated at the hardware level - scaling optimized batch workloads into a GPU marketplace that dynamically allocates compute by the hour.

Solution

SF Compute and Modular partnered together, to build the world's cheapest, large volume batch API across leading industry AI models - we call it the SF Compute Large Scale Inference API, powered by Modular. It's a high-throughput, asynchronous batch inference interface that supports over 20+ state-of-the-art models across language, vision, and multimodal domains, ranging from efficient 7B parameter models to 600B+ frontier systems including DeepSeek-R1, Llama3.3-70B, QwQ, Qwen, InternVL and many more. By combining Modular’s high-efficiency inference stack with SF Compute’s real-time spot market, the API delivers inference at up to 80% lower cost than the current market baseline.

By combining SF Compute’s unified cloud marketplace with Modular’s hardware abstraction platform, we’ve built true fungibility across compute vendors. This required solving several uniquely challenging technical hurdles:

- Hardware unification: Modular’s Platform provides unified model cluster, serving and kernel development APIs, delivering industry leading performance without sacrificing portability across heterogeneous hardware platforms.

- Cloud unification: SF Compute’s platform abstracts physical infrastructure behind a programmable spot market and dynamic scheduler, enabling seamless allocation across heterogeneous compute backends.

- Intelligent placement & routing: Together, they automatically allocate workloads using a dynamic workload router that adapts to current infrastructure load, bandwidth availability, model performance profiles, and real-time market pricing. For developers, that means faster inference, no tuning, and no hardware headaches.

The result: H100s, H200s, MI300/325Xs (coming soon), and next-gen accelerators compete in a single market on pure price-performance. For developers, hardware complexity disappears - models are routed to the best-fit resources, transparently. This isn’t just cost reduction – it’s a fundamental redefinition of how AI gets built and deployed.

Results

Under the hood, SF Compute provides real-time access to thousands of NVIDIA H100, H200, and soon AMD MI300/325X GPUs via its dynamic pricing marketplace. Spot rates are often below $1.40/hour - far lower than the typical $6–$8/hour on-demand rates - and even lower than the 1–3 year locked-in pricing offered by traditional clouds.

The Modular Platform complements this with compiler-native execution, speculative decoding, and the world’s most performant hardware agnostic AI kernels - routinely achieving 90%+ GPU utilization. With our combined powers, this stack delivers up to 80% lower cost per token compared to existing providers 🚀.

This isn’t just competitive pricing - it’s a structural shift in how inference is monetized. While incumbents rely on fixed, over-provisioned infrastructure to preserve margins, we optimize for volume, efficiency, and developer value - collapsing the cost stack and returning the gains to users.

Below is a list of all supported models available for batch inference. Ready to deploy your model for up to 80% less? Start using the Batch Inference API. Get Early Access Today!

Model ID	Hugging Face Name	Size
DeepSeek‑R1	deepseek-ai/DeepSeek-R1	671B
DeepSeek‑V3	deepseek-ai/DeepSeek-V3	671B
DeepSeek‑R1‑Distill‑Llama‑70B	deepseek-ai/DeepSeek-R1-Distill-Llama-70B	70B
Llama‑3‑70B‑chat	meta-llama/Llama-3-70b-chat-hf	70B
Llama‑3.1‑405B‑Instruct	meta-llama/Meta-Llama-3.1-405B-Instruct	405B
Llama‑3.1‑70B‑Instruct	meta-llama/Meta-Llama-3.1-70B-Instruct	70B
Llama‑3.1‑8B‑Instruct	meta-llama/Meta-Llama-3.1-8B-Instruct	8B
Llama‑3.3‑70B‑Instruct	meta-llama/Meta-Llama-3.3-70B-Instruct	70B
Llama‑4‑Scout‑17B‑Instruct	meta-llama/Llama-4-Scout-17B-16E-Instruct	109B
Llama‑4‑Maverick‑17B‑128E‑Instruct	meta-llama/Llama-4-Maverick-17B-128E-Instruct	400B
Llama 3.2 Vision	meta-llama/Llama-3.2-11B-Vision-Instruct	11B
Mistral‑7B‑Instruct	mistralai/Mistral-7B-Instruct-v0.1	7B
Mixtral‑8x7B‑Instruct	mistralai/Mixtral-8x7B-Instruct-v0.1	56B
Mistral‑Small‑24B‑Instruct	mistralai/Mistral-Small-24B-Instruct-2501	24B
Qwen‑2.5‑72B‑Instruct	Qwen/Qwen2.5-72B-Instruct	72.7B
Qwen‑2.5‑7B‑Instruct	Qwen/Qwen2.5-7B-Instruct	7B
Qwen 3‑14B	Qwen/Qwen3-14B	14.8B
Qwen 3‑8B	Qwen/Qwen3-8B	8.2B
QwQ‑32B	Qwen/QwQ-32B	32.5B
InternVL3‑9B	OpenGVLab/InternVL3-9B	9B
InternVL3‑14B	OpenGVLab/InternVL3-14B	14B
InternVL3‑38B	OpenGVLab/InternVL3-38B	38B
InternVL3‑78B	OpenGVLab/InternVL3-78B	78B
Gemma‑3‑12B‑in‑chat	google/gemma-3-12b-it	12B
Gemma‑3‑27B‑in‑chat	google/gemma-3-27b-it	27B

‍

Want a model added? Talk to SF Compute — they will respond in hours, not weeks.

About

San Francisco Compute

SF Compute operates a revolutionary GPU marketplace that provides on-demand access to large-scale GPU clusters, enabling AI companies to rent exactly the compute capacity they need - whether it's 1,024 GPUs for a week or 8 GPUs for 2 hours. They are fundamentally transforming how AI infrastructure is consumed with a liquid market for compute as a commodity.

Read more about their flexible infrastructure approach on SF Compute's Inference page.

Start building with Modular

Talk to us

Case Studies

Customers

2x cost savings with the fastest text-to-speech model ever

We made state-of-the-art speech synthesis scalable, and achieved a truly remarkable improvement both for the latency and throughput.

Read Case Study

Revolutionizing your own research to production

Modular allows Qwerky AI to do advanced AI research, to write optimized code and deploy across NVIDIA, AMD, and other types of silicon.

Read Announcement

Unlocking fast AMD compute for all

AI inference has a cost problem. Hardware alone isn't enough - customers need software that can extract every ounce of performance from these chips. TensorWave and Modular team up to shatter the cost-performance ceiling for AI inference.

Read Announcement

Modular partners with AWS to democratize AI Infrastructure

Modular partnered with AWS to bring MAX to AWS Marketplace, offering SOTA performance for GenAI workloads across GPUs types.

Read Announcement

Modular partners with NVIDIA to accelerate AI compute everywhere

Modular’s Platform provides state-of-the-art support for NVIDIA Blackwell, Hopper, Ampere, Ada Lovelace and NVIDIA Grace Superchips.

Read Case Study

Unleashing AI performance on AMD GPUs with Modular's Platform

Modular partners with AMD to bring the AI ecosystem more choice with state-of-the-art performance on AMD Instinct GPUs.

Read Announcement

Scales for enterprises

Dedicated enterprise support

We are a team of the world's best AI infrastructure leaders who are reinventing and rebuilding accelerated compute for everyone.

About Us

Infinitely scalable to reduce your TCO

Optimize costs and performance with multi-node inference at massive scale across cloud or on-prem environments.

Enterprise grade SLA

Our performance is backed with an enterprise grade SLA, ensuring reliability, accountability, and peace of mind.

Problem

Solution

Results

About

San Francisco Compute

Case Studies

Customers

Partnerships

Scales for enterprises

Quick start resources