Modular acquires BentoML to deliver production AI in the cloud!  - Read more

Self hosted

Full Control,  Full Performance, Full Engineering.

Deploy MAX and Mojo free, running on your infrastructure. Your hardware, your cloud, your rules. You control the deployment built on open-source infrastructure - profile workloads, write custom models & kernels, optimize every layer of your AI. Contribute back.

Diagram showing cloud deployment with your application using Framework-MAX for high-performance inference and Language-Mojo for systems-level performance on GPU, CPU, and xPU hardware.Cloud-based application interface showing Framework-MAX for high-performance inference and Language-Mojo for systems-level performance on GPU, CPU, and xPU hardware.

Why self-host with Modular?

  • A vertically integrated AI software stack

    MAX controls the full inference path - from model graph through MLIR compilation to kernel execution on hardware. No stitching together vLLM, TensorRT, and CUDA. One stack, one container, one team to work with. Every layer is optimized together, not bolted on.

    FROM MODEL GRAPH TO GPU KERNEL, ONE UNIFIED STACK

  • Run on compute you own

    NVIDIA, AMD, Apple Silicon GPUs along with Intel, ARM, AMD CPUs - MAX compiles and runs on all of them from the same container image. Use your existing reserved instances, cloud credits, and committed use discounts. AWS, GCP, Azure, OCI, on-prem. Your hardware, your cloud account.

    ANY GPU. ANY CLOUD. YOUR CLOUD CREDITS.

  • Write your own custom models

    MAX's PyTorch-like model APIs let you define and deploy proprietary architectures - novel attention mechanisms, state-space models, hybrid designs. Compile them for any supported hardware without maintaining separate codebases. Your model is your moat. Keep it yours.

    YOUR ARCHITECTURE. ANY HARDWARE. NO VENDOR DEPENDENCY.

  • Deploy custom kernels in Mojo

    Need a novel attention pattern? Custom quantization scheme? Write it in Mojo with Python-compatible syntax and get better-than-CUDA performance across NVIDIA, AMD, and Apple Silicon. 20-30 lines of Mojo replace hundreds of lines of CUDA - and they run on every accelerator.

    20-30 LINES OF MOJO VS HUNDREDS OF LINES OF CUDA

  • Production-grade serving built in

    Continuous batching, KV-cache optimization, speculative decoding, auto-hardware detection, and OpenAI-compatible API endpoints - all compiled into a single container under 700MB. No assembly required. pip install modular, then max serve.

    PIP INSTALL MODULAR. MAX SERVE. YOU'RE LIVE.

  • Join the open-source AI revolution

    MAX is open source. Mojo is open-source. Inspect the code, contribute to the project, build on top of it. No black-box runtime, no proprietary lock-in, no trust-us performance claims. You can read every kernel, every optimization pass, every line of the serving infrastructure.

    OPEN SOURCE. OPEN MODELS. OPEN KERNELS. NO BLACK BOXES.

  • Who self-hosts with Modular?

    • Companies with existing GPU fleets

      If you’ve invested in NVIDIA and AMD hardware, Modular is the only inference platform that runs on both from one container. Maximize utilization across your entire fleet.

    • Air-gapped and classified environments

      Defense, intelligence, and regulated financial services need inference that runs without any external connectivity. Self-hosted MAX is a single container with zero call-home dependencies.

    • Teams with diverse silicon infrastructure

      NVIDIA, AMD, Custom Silicon clusters for inference. No other self-hosted platform runs GPU-accelerated inference on diverse hardware pools.

    • AI-native companies with custom models

      If your moat is a proprietary model architecture, you need self-hosted inference with kernel-level programmability. Mojo lets you write custom ops that compile for any hardware target.

    Modular vs. the competition

      • Hardware Portability

        One container runs on NVIDIA, AMD, and Apple Silicon. Same model binary, any GPU.

      • Embedded Performance Engineers

        Forward-deployed engineers embedded in your team. Custom Mojo kernels for your workloads. Hands-on optimization, not support tickets.

      • Lightweight Runtime

        <700MB container. 10x faster cold starts. Dramatically simpler container orchestration.

      • Kernel-Level Programmability

        Mojo kernel programmability. Write custom GPU ops for novel architectures.

      • True Air-Gapped Deployment

        Full air-gap support. Zero external dependencies. No control plane.

    • Alternatives
      • Vendor Lock-In

        You deploy seperate infrastructure to support NVIDIA, AMD, Apple, Intel, ARM & more. Complexity is crazy.

      • Config Tuning Support

        “Dedicated engineers” who tune vLLM/TensorRT configs remotely.

      • Heavy Runtime Stack

        7GB+ runtime. Slow cold starts. Heavy storage and bandwidth overhead at scale.

      • Framework Configs Only

        No kernel-level access. Limited to framework-level model configs.

      • Fragmented infrastructure

        Relying on vLLM, SGLang and others means you rely on a spaghetti of dependencies and lack support.

    Compare deployment options

    Self-Hosted

    Our Cloud

    Your Cloud

    Support

    Active community and fast responses in Discord, Discourse, Github

    Dedicated support, engineering team, standard and custom SLAs/SLOs

    Dedicated support, engineering team, standard and custom SLAs/SLOs

    Models

    Hundreds of models in our model repo, view top performers

    Top performers available for dedicated endpoint, custom model deployment

    Top performers available for dedicated endpoint, custom model deployment

    Platform access

    Deploy MAX and Mojo yourself anywhere you want. Build with open source

    Access Modular Platform with a console for deploying, scaling and managing your AI endpoints.

    Access Modular Platform with a console for deploying, scaling and managing your AI endpoints.

    Scalability

    Scale on your own with the MAX container

    Auto-scaling, scale to zero, burst capacity

    Auto-scaling, proven at Fortune 500 scale.

    Deployment location

    Self-deployed, anywhere

    Our cloud

    Your cloud or hybrid

    Compute hardware

    NVIDIA, AMD, and Apple Silicon & more on hardware you own

    NVIDIA & AMD GPUs in our cloud. More hardware types coming soon

    NVIDIA & AMD GPUs, Intel, AMD & ARM CPUs - deployed in your cloud.

    Custom kernels

    Your engineers write custom kernels for your workloads.

    Modular engineers tune kernels for your workloads

    Modular engineers write custom kernels for your workloads

    Forward Deployed Engineers

    Available with support plan

    Included

    Included; working in your environment

    Security & Compliance

    SOC 2 Type I certified

    SOC 2 Type I certified (Type II in progress)

    SOC 2 Type I certified (Type II in progress)

    Billing & Pricing

    Free

    Per token (shared) Per minute (dedicated)

    Per minute deployed. Use your AWS/GCP/Azure credits and commits

    License

    Enterprise Contract

    Get started with Modular

    • Request a demo

      Schedule a demo of Modular and explore a custom end-to-end deployment built around your models, hardware, and performance goals.

      • Distributed, large-scale online inference endpoints

      • Highest-performance to maximize ROI and latency

      • Deploy in Modular cloud or your cloud

      • View all features with a custom demo

      Book a demo

      Talk with our sales lead Jay!

      30min demo.  Evaluate with your workloads.  Ask us anything.