Self hosted
Full Control, Full Performance, Full Engineering.
Deploy MAX and Mojo free, running on your infrastructure. Your hardware, your cloud, your rules. You control the deployment built on open-source infrastructure - profile workloads, write custom models & kernels, optimize every layer of your AI. Contribute back.

Why self-host with Modular?
A vertically integrated AI software stack
MAX controls the full inference path - from model graph through MLIR compilation to kernel execution on hardware. No stitching together vLLM, TensorRT, and CUDA. One stack, one container, one team to work with. Every layer is optimized together, not bolted on.
FROM MODEL GRAPH TO GPU KERNEL, ONE UNIFIED STACK
Run on compute you own
NVIDIA, AMD, Apple Silicon GPUs along with Intel, ARM, AMD CPUs - MAX compiles and runs on all of them from the same container image. Use your existing reserved instances, cloud credits, and committed use discounts. AWS, GCP, Azure, OCI, on-prem. Your hardware, your cloud account.
ANY GPU. ANY CLOUD. YOUR CLOUD CREDITS.
Write your own custom models
MAX's PyTorch-like model APIs let you define and deploy proprietary architectures - novel attention mechanisms, state-space models, hybrid designs. Compile them for any supported hardware without maintaining separate codebases. Your model is your moat. Keep it yours.
YOUR ARCHITECTURE. ANY HARDWARE. NO VENDOR DEPENDENCY.
Deploy custom kernels in Mojo
Need a novel attention pattern? Custom quantization scheme? Write it in Mojo with Python-compatible syntax and get better-than-CUDA performance across NVIDIA, AMD, and Apple Silicon. 20-30 lines of Mojo replace hundreds of lines of CUDA - and they run on every accelerator.
20-30 LINES OF MOJO VS HUNDREDS OF LINES OF CUDA
Production-grade serving built in
Continuous batching, KV-cache optimization, speculative decoding, auto-hardware detection, and OpenAI-compatible API endpoints - all compiled into a single container under 700MB. No assembly required. pip install modular, then max serve.
PIP INSTALL MODULAR. MAX SERVE. YOU'RE LIVE.
Join the open-source AI revolution
MAX is open source. Mojo is open-source. Inspect the code, contribute to the project, build on top of it. No black-box runtime, no proprietary lock-in, no trust-us performance claims. You can read every kernel, every optimization pass, every line of the serving infrastructure.
OPEN SOURCE. OPEN MODELS. OPEN KERNELS. NO BLACK BOXES.
Who self-hosts with Modular?
If you’ve invested in NVIDIA and AMD hardware, Modular is the only inference platform that runs on both from one container. Maximize utilization across your entire fleet.
Defense, intelligence, and regulated financial services need inference that runs without any external connectivity. Self-hosted MAX is a single container with zero call-home dependencies.
NVIDIA, AMD, Custom Silicon clusters for inference. No other self-hosted platform runs GPU-accelerated inference on diverse hardware pools.
If your moat is a proprietary model architecture, you need self-hosted inference with kernel-level programmability. Mojo lets you write custom ops that compile for any hardware target.
Modular vs. the competition
- Hardware Portability
One container runs on NVIDIA, AMD, and Apple Silicon. Same model binary, any GPU.
- Embedded Performance Engineers
Forward-deployed engineers embedded in your team. Custom Mojo kernels for your workloads. Hands-on optimization, not support tickets.
- Lightweight Runtime
<700MB container. 10x faster cold starts. Dramatically simpler container orchestration.
- Kernel-Level Programmability
Mojo kernel programmability. Write custom GPU ops for novel architectures.
- True Air-Gapped Deployment
Full air-gap support. Zero external dependencies. No control plane.
- Alternatives
- Vendor Lock-In
You deploy seperate infrastructure to support NVIDIA, AMD, Apple, Intel, ARM & more. Complexity is crazy.
- Config Tuning Support
“Dedicated engineers” who tune vLLM/TensorRT configs remotely.
- Heavy Runtime Stack
7GB+ runtime. Slow cold starts. Heavy storage and bandwidth overhead at scale.
- Framework Configs Only
No kernel-level access. Limited to framework-level model configs.
- Fragmented infrastructure
Relying on vLLM, SGLang and others means you rely on a spaghetti of dependencies and lack support.
Compare deployment options
Self-Hosted | Our Cloud | Your Cloud | |
|---|---|---|---|
Support | Active community and fast responses in Discord, Discourse, Github | Dedicated support, engineering team, standard and custom SLAs/SLOs | Dedicated support, engineering team, standard and custom SLAs/SLOs |
Models | Hundreds of models in our model repo, view top performers | Top performers available for dedicated endpoint, custom model deployment | Top performers available for dedicated endpoint, custom model deployment |
Platform access | Deploy MAX and Mojo yourself anywhere you want. Build with open source | Access Modular Platform with a console for deploying, scaling and managing your AI endpoints. | Access Modular Platform with a console for deploying, scaling and managing your AI endpoints. |
Scalability | Scale on your own with the MAX container | Auto-scaling, scale to zero, burst capacity | Auto-scaling, proven at Fortune 500 scale. |
Deployment location | Self-deployed, anywhere | Our cloud | Your cloud or hybrid |
Compute hardware | NVIDIA, AMD, and Apple Silicon & more on hardware you own | NVIDIA & AMD GPUs in our cloud. More hardware types coming soon | NVIDIA & AMD GPUs, Intel, AMD & ARM CPUs - deployed in your cloud. |
Custom kernels | Your engineers write custom kernels for your workloads. | Modular engineers tune kernels for your workloads | Modular engineers write custom kernels for your workloads |
Forward Deployed Engineers | Available with support plan | Included | Included; working in your environment |
Security & Compliance | SOC 2 Type I certified | SOC 2 Type I certified (Type II in progress) | SOC 2 Type I certified (Type II in progress) |
Billing & Pricing | Free | Per token (shared) Per minute (dedicated) | Per minute deployed. Use your AWS/GCP/Azure credits and commits |
Enterprise Contract |
Get started with Modular
Schedule a demo of Modular and explore a custom end-to-end deployment built around your models, hardware, and performance goals.
Distributed, large-scale online inference endpoints
Highest-performance to maximize ROI and latency
Deploy in Modular cloud or your cloud
View all features with a custom demo

Book a demo
Talk with our sales lead Jay!
30min demo. Evaluate with your workloads. Ask us anything.
