Bring your own cloud (BYOC)

Your Cloud, Our Engineers, Any GPU.

Built on BentoCloud’s production-proven BYOC infrastructure - now powered by MAX’s compiler and Mojo’s custom kernels. Modular orchestrates the stack while inference runs in your VPC. You own the hardware, data, and cloud credits.

Request access

Book a demo

Diagram showing modular components Control plane, Scheduling, Monitoring, Scaling over a cloud framework that includes MAX Runtime, Compiled Models, Mojo Kernels, and GPU options from NVIDIA, AMD, and Apple.

Diagram showing a modular system with components Control plane, Scheduling, Monitoring, Scaling connecting to a cloud/VPC environment containing MAX Runtime, Compiled Models, Mojo Kernels, and GPU options NVIDIA, AMD, Apple.

How Modular BYOC works

Built on BentoCloud’s battle-tested architecture: Modular’s control plane (outside your VPC) handles endpoint management, scaling policies, monitoring dashboards, and model registry. Your data plane (inside your VPC) runs MAX containers with compiled models. Inference inputs and outputs never leave your network. Already running at scale for Fortune 500 companies.

Data stays in your VPC. Always. Proven at Fortune 500 scale.

Why choose Modular in your cloud?

Simple per minute pricing

Simple $/minute pricing for dedicated endpoints. One rate regardless of whether you're on NVIDIA B200s or AMD MI355Xs. No per-token surprises, no hidden quantization tradeoffs, no cold start penalties. Predictable costs you can forecast before you deploy.

One metric. Any GPU. Predictable costs.

Production-proven IaC setup (via BentoCloud)

Modular provisions infrastructure in your cloud account using BentoCloud’s battle-tested IaC automation: K8s clusters, container registries, networking, storage. Already running at scale for Fortune 500 companies. Works with AWS, GCP, Azure, and OCI. SOC 2 Type 2 certified

Proven at scale. SOC 2. Your account.

Forward-deployed engineers on your account

Your dedicated Modular engineers don’t just monitor dashboards. They profile your inference workloads, identify bottlenecks, write MAX models & custom Mojo kernels, and push optimizations directly to your deployment. Continuous performance improvement, not break-fix support.

Engineers who ship code to your stack

GPU portability inside your VPC

Run NVIDIA B200s, AMD MI355Xs, or both in the same BYOC deployment. MAX compiles for each target automatically. If AMD offers better spot pricing in your region, shift workloads without rewriting anything. No other BYOC provider supports multi-vendor GPU.

NVIDIA + AMD in the same deployment

Auto-scaling with compiler awareness

BYOC auto-scaling isn’t generic K8s HPA. MAX-aware scaling understands model memory requirements, continuous batching state, and KV-cache utilization to make smarter scale-up/down decisions. Scale to zero when idle. Burst to meet demand.

Compiler-aware, not just CPU/memory-aware

Use your cloud credits and commits

BYOC runs in your cloud account. AWS reserved instances, GCP committed use discounts, Azure reservations, startup credits - all apply directly to your BYOC inference spend. MCU pricing layers on top. No double-billing.

Cloud credits + MCU pricing

Why choose Your Cloud?

Your VPC, your rules, zero infra work
Inference runs inside your cloud account, under your compliance policies. Compliance satisfied by default because the data never leaves your environment. Modular handles the serving stack so you don't have to.
EX: STARTUPS, NEW AI PRODUCTS, RAPID PROTOTYPING
Inference economics no one else can match
Per-minute pricing. AMD GPU support. Forward-deployed engineers who tune your batch size, kernel config, and GPU selection for your specific workload. You control the cloud spend - we make every dollar go further.
EX: HIGH-VOLUME INFERENCE, COST-SENSITIVE AI
Your models, your environment, managed for you
Deploy fine-tuned or proprietary models in your own infrastructure with custom Mojo kernel support. Compiler optimization and GPU portability are built in - self-hosted control without the operational burden.
EX: CUSTOM MODEL SERVING AT SCALE
One API swap, not a rewrite
OpenAI-compatible endpoints mean your existing code works on day one. Swap GPT for Llama, Qwen, or your own fine-tune - same API contract, better economics, all running inside your perimeter.
EX: GPT TO OPEN MODEL MIGRATION

Modular vs. the competition

- Hardware Portability
  GPU portability. NVIDIA + AMD in the same deployment, meaning more options and lower TCO.
- Embedded Performance Engineering
  Forward-deployed engineers who write custom Mojo kernels, on top of BentoCloud’s proven scalable operations.
- Unified GPU Pricing
  Simple pricing for $ / token for shared endpoints, and $ / minute for dedicated ones.
- Vertically Integrated Stack
  SOTA dynamic cloud orchestration. Compiler-aware auto-scaling. MAX understands model memory, batching state, KV-cache. Mojo provides portable SOTA kernels.
- 10x Lighter Runtime
  <700MB runtime. 10x faster cold starts. Simpler operations.
Alternatives
- Vendor Lock-In
  NVIDIA-only. Zero GPU vendor choice across every managed cloud competitor.
- Generic Platform Optimizations
  No per-customer engineering. No dedicated engineers on your account. Generic optimizations applied everywhere.
- Blackbox infrastructure & pricing
  No visibility into quantization, batching, or what's been done to your model. You're paying for a black box.
- Runtime Wrappers
  CUDA research (ATLAS, Megakernel). vLLM/TensorRT wrappers. Runtime optimization, not compilation.
- Multi-GB Runtime
  7GB+ runtimes. Slow cold starts. Heavy container overhead.

Compare deployment options

	Self-Hosted	Our Cloud	Your Cloud
Support	Active community and fast responses in Discord, Discourse, Github	Dedicated support, engineering team, standard and custom SLAs/SLOs	Dedicated support, engineering team, standard and custom SLAs/SLOs
Models	Hundreds of models in our model repo, view top performers	Top performers available for dedicated endpoint, custom model deployment	Top performers available for dedicated endpoint, custom model deployment
AI Skills	Use our open AI skills to easily write models, or optimize code	Our engineers can help train your team & migrate your workloads	Our engineers can help train your team & migrate your workloads
Platform access	Deploy MAX and Mojo yourself anywhere you want. Build with open source	Access Modular Platform with a console for deploying, scaling and managing your AI endpoints.	Access Modular Platform with a console for deploying, scaling and managing your AI endpoints.
Scalability	Scale on your own with the MAX container	Auto-scaling, scale to zero, burst capacity	Auto-scaling, proven at Fortune 500 scale.
Deployment location	Self-deployed, anywhere	Our cloud	Your cloud or hybrid
Compute hardware	NVIDIA, AMD, and Apple Silicon & more. Scaling restrictions apply.	NVIDIA & AMD GPUs in our cloud. More hardware coming soon.	NVIDIA & AMD GPUs, Intel, AMD & ARM CPUs - deployed in your cloud.
Custom kernels	Your engineers write custom kernels for your workloads.	Modular engineers tune kernels for your workloads	Modular engineers write custom kernels for your workloads
Forward Deployed Engineers	Available with support plan	Included	Included; working in your environment
Security & Compliance	SOC 2 Type 2 certified	SOC 2 Type 2 certified	SOC 2 Type 2 certified
Billing & Pricing	Free	Per token (shared) Per minute (dedicated)	Per minute deployed. Use your AWS/GCP/Azure credits and commits
License	Community License	Terms of Service	Enterprise Contract

Get started with Modular

Request a demo
Schedule a demo of Modular and explore a custom end-to-end deployment built around your models, hardware, and performance goals.
- Distributed, large-scale online inference endpoints
- Highest-performance to maximize ROI and latency
- Deploy in Modular cloud or your cloud
- View all features with a custom demo
Book a demo
Talk with our sales lead Jay!
30min demo. Evaluate with your workloads. Ask us anything.

Talk to us!
Book a demo for a personalized walkthrough of Modular in your environment. Learn how teams use it to simplify systems and tune performance at scale.
- Custom 30 min walkthrough of our platform
- Cover specific model or deployment needs
- Flexible pricing to fit your specific needs
Book a demo
Talk with our sales lead Jay!
Start using MAX
( FREE )
Run any open source model in 5 minutes, then benchmark it. Scale it to millions yourself (for free!).
Install MAX
What is MAX?
Start using Mojo
( FREE )
Install Mojo and get up and running in minutes. A simple install, familiar tooling, and clear docs make it easy to start writing code immediately.
Install Mojo🔥
What is Mojo🔥?