INFERENCE SOLUTIONS
Shared Endpoints. Fast performance.
Our compute, our infrastructure, our GPUs - rapidly experiment with the latest models on a $ / token basis. The easiest way to integrate and get started fast. Pay only for what you use.
Why choose Shared Endpoints?
Shared endpoints on $/token pricing
Pay only for what you use. Shared endpoints scale to zero when idle and burst to meet demand - no reserved capacity, no minimum spend. Ideal for prototyping, dev/test, and variable-traffic production workloads where predictable per-token pricing beats committed compute.
SCALE TO ZERO. PAY PER TOKEN. NO MINIMUMS.
NVIDIA and AMD GPU selection
Choose the GPU that fits your workload's price-performance profile. MAX compiles natively for both NVIDIA and AMD - switch between vendors as pricing and availability shift. No other shared inference endpoint offers AMD. That's a pricing lever only Modular can give you.
GPU VENDOR CHOICE = PRICING LEVERAGE
Forward-deployed engineers
Your dedicated Modular engineer profiles your production traffic, identifies latency bottlenecks, writes custom MAX architectures and Mojo kernels, and pushes optimizations to your deployment. Not quarterly business reviews - weekly optimization cycles. Not support tickets - engineers who ship code to your stack.
CUSTOM ENGINEERING, NOT GENERIC OPTIMIZATION
Custom model deployment
Bring your own model - fine-tuned, custom architecture, or proprietary weights. We can convert those to highly optimized MAX graphs. Upload and Modular Cloud compiles and serves it with the same $/token pricing. Custom Mojo kernels available for novel architectures. OpenAI-compatible API endpoint out of the box.
ANY MODEL. CUSTOM KERNELS. MANAGED INFRA.
Compiler-optimized, not wrapper-optimized
Other providers wrap vLLM or TensorRT and call it optimization. Modular's MLIR compiler fuses the entire inference path - graph, runtime, memory, scheduling - into a single compiled unit. Compilation is a deeper lever than configuration. That's why MAX is 2x faster.
FULL GRAPH COMPILATION VS. RUNTIME TUNING
90% smaller runtime, faster scaling
MAX runtime is under 700MB. Alternatives ship 7GB+. That means new replicas start in seconds, not minutes. Model swaps are near-instant. Storage and bandwidth costs drop dramatically at scale. Cold starts that feel warm.
<700MB VS 7GB+. 10X FASTER COLD STARTS.
Top AI models, or your custom ones
Our forward-deployed engineers optimize every deployment for SOTA performance - whether you're running a top open model or a custom model.
- Build with popular models
- Deepseek R1
Frontier-class models (V3, R1) built for complex reasoning, coding, and math — at dramatically lower inference cost than comparable proprietary models.
- LLM
- Coding / Reasoning
- KimiK2.5
Moonshot AI's 1T parameter MoE model optimized for agentic tasks, tool use, and coding.
- LLM
- Coding / Multimodal
MiniMaxLarge-scale MoE model (456B params) optimized for long-context tasks up to 1M tokens.
- LLM
- Coding
- Build by specific use case
Modular vs. the competition
- Hardware Portability
GPU portability. NVIDIA + AMD in the same deployment, meaning more options and lower TCO.
- Embedded Performance Engineering
Forward-deployed engineers who write custom Mojo kernels, on top of BentoCloud’s proven scalable operations.
- Unified GPU Pricing
Simple pricing for $ / token for shared endpoints, and $ / minute for dedicated ones.
- Vertically Integrated Stack
SOTA dynamic cloud orchestration. Compiler-aware auto-scaling. MAX understands model memory, batching state, KV-cache. Mojo provides portable SOTA kernels.
- 10x Lighter Runtime
<700MB runtime. 10x faster cold starts. Simpler operations.
- Alternatives
- Vendor Lock-In
NVIDIA-only. Zero GPU vendor choice across every managed cloud competitor.
- Generic Platform Optimizations
No per-customer engineering. No dedicated engineers on your account. Generic optimizations applied everywhere.
- Blackbox infrastructure & pricing
No visibility into quantization, batching, or what's been done to your model. You're paying for a black box.
- Runtime Wrappers
CUDA research (ATLAS, Megakernel). vLLM/TensorRT wrappers. Runtime optimization, not compilation.
- Multi-GB Runtime
7GB+ runtimes. Slow cold starts. Heavy container overhead.
Compare deployment options
Self-Hosted | Our Cloud | Your Cloud | |
|---|---|---|---|
Support | Active community and fast responses in Discord, Discourse, Github | Dedicated support, engineering team, standard and custom SLAs/SLOs | Dedicated support, engineering team, standard and custom SLAs/SLOs |
Models | Hundreds of models in our model repo, view top performers | Top performers available for dedicated endpoint, custom model deployment | Top performers available for dedicated endpoint, custom model deployment |
Platform access | Deploy MAX and Mojo yourself anywhere you want. Build with open source | Access Modular Platform with a console for deploying, scaling and managing your AI endpoints. | Access Modular Platform with a console for deploying, scaling and managing your AI endpoints. |
Scalability | Scale on your own with the MAX container | Auto-scaling, scale to zero, burst capacity | Auto-scaling, proven at Fortune 500 scale. |
Deployment location | Self-deployed, anywhere | Our cloud | Your cloud or hybrid |
Compute hardware | NVIDIA, AMD, and Apple Silicon & more on hardware you own | NVIDIA & AMD GPUs in our cloud. More hardware types coming soon | NVIDIA & AMD GPUs, Intel, AMD & ARM CPUs - deployed in your cloud. |
Custom kernels | Your engineers write custom kernels for your workloads. | Modular engineers tune kernels for your workloads | Modular engineers write custom kernels for your workloads |
Forward Deployed Engineers | Available with support plan | Included | Included; working in your environment |
Security & Compliance | SOC 2 Type I certified | SOC 2 Type I certified (Type II in progress) | SOC 2 Type I certified (Type II in progress) |
Billing & Pricing | Free | Per token (shared) Per minute (dedicated) | Per minute deployed. Use your AWS/GCP/Azure credits and commits |
Enterprise Contract |
Get started with Modular
Schedule a demo of Modular and explore a custom end-to-end deployment built around your models, hardware, and performance goals.
Distributed, large-scale online inference endpoints
Highest-performance to maximize ROI and latency
Deploy in Modular cloud or your cloud
View all features with a custom demo

Book a demo
Talk with our sales lead Jay!
30min demo. Evaluate with your workloads. Ask us anything.