Modular acquires BentoML to deliver production AI in the cloud!  - Read more

Bring your own cloud (BYOC)

Your Cloud, Our Engineers, Any GPU.

Built on BentoCloud’s production-proven BYOC infrastructure - now powered by MAX’s compiler and Mojo’s custom kernels. Modular orchestrates the stack while inference runs in your VPC. You own the hardware, data, and cloud credits.

Diagram showing modular components Control plane, Scheduling, Monitoring, Scaling over a cloud framework that includes MAX Runtime, Compiled Models, Mojo Kernels, and GPU options from NVIDIA, AMD, and Apple.Diagram showing a modular system with components Control plane, Scheduling, Monitoring, Scaling connecting to a cloud/VPC environment containing MAX Runtime, Compiled Models, Mojo Kernels, and GPU options NVIDIA, AMD, Apple.

How Modular BYOC works

Built on BentoCloud’s battle-tested architecture: Modular’s control plane (outside your VPC) handles endpoint management, scaling policies, monitoring dashboards, and model registry. Your data plane (inside your VPC) runs MAX containers with compiled models. Inference inputs and outputs never leave your network. Already running at scale for Fortune 500 companies. SOC 2 Type II certified.

Data stays in your VPC. Always. Proven at Fortune 500 scale.

Why choose Modular in your cloud?

  • Modular Compute Unit (MCU) pricing

    One billing metric that normalizes across GPU types, model sizes, and workload patterns. MCUs make it simple to forecast costs regardless of whether you’re running on NVIDIA H100s or AMD MI300Xs. Transparent, predictable, cloud-credit compatible.

    One metric. Any GPU. Predictable costs.

  • Production-proven IaC setup (via BentoCloud)

    Modular provisions infrastructure in your cloud account using BentoCloud’s battle-tested IaC automation: K8s clusters, container registries, networking, storage. Already running at scale for Fortune 500 companies. Works with AWS, GCP, Azure, and OCI. SOC 2 Type II certified.

    Proven at scale. SOC 2 Type II. Your account.

  • Forward-deployed engineers on your account

    Your dedicated Modular engineers don’t just monitor dashboards. They profile your inference workloads, identify bottlenecks, write custom Mojo kernels, and push optimizations directly to your deployment. Continuous performance improvement, not break-fix support.

    Engineers who ship code to your stack

  • GPU portability inside your VPC

    Run NVIDIA H100s, AMD MI300Xs, or both in the same BYOC deployment. MAX compiles for each target automatically. If AMD offers better spot pricing in your region, shift workloads without rewriting anything. No other BYOC provider supports multi-vendor GPU.

    NVIDIA + AMD in the same deployment

  • Auto-scaling with compiler awareness

    BYOC auto-scaling isn’t generic K8s HPA. MAX-aware scaling understands model memory requirements, continuous batching state, and KV-cache utilization to make smarter scale-up/down decisions. Scale to zero when idle. Burst to meet demand.

    Compiler-aware, not just CPU/memory-aware

  • Use your cloud credits and commits

    BYOC runs in your cloud account. AWS reserved instances, GCP committed use discounts, Azure reservations, startup credits - all apply directly to your BYOC inference spend. MCU pricing layers on top. No double-billing.

    Cloud credits + MCU pricing

  • Why choose our Cloud?

    • Fastest time to production

      Inference data stays in your VPC, in your region, under your compliance policies. HIPAA, GDPR, SOC 2, FedRAMP - BYOC meets data residency requirements while Modular handles the infrastructure complexity.

      EX: STARTUPS, NEW AI PRODUCTS, RAPID PROTOTYPING

    • Companies optimizing inference economics

      $/minute pricing + AMD GPU option + forward-deployed engineering creates a cost structure no competitor can match. Your engineer finds the optimal GPU, batch size, and kernel configuration for your workload.

      EX: HIGH-VOLUME INFERENCE, COST-SENSITIVE AI

    • AI-native companies with custom models

      Deploy fine-tuned or proprietary models on managed infrastructure with custom Mojo kernel support. Same capability as self-hosted, but without the operational overhead. Compiler optimization and GPU portability baked in.

      EX: CUSTOM MODEL SERVING AT SCALE

    • Teams transitioning from proprietary APIs

      Moving from OpenAI or Anthropic APIs to open-source models? Modular Cloud's OpenAI-compatible endpoints make the switch seamless. Same API contract. Better economics. Forward-deployed engineers help tune the replacement.

      EX: GPT TO LLAMA/QWEN MIGRATION

    Modular vs. the competition

      • Hardware Portability

        GPU portability. NVIDIA + AMD in the same deployment, meaning more options and lower TCO.

      • Embedded Performance Engineering

        Forward-deployed engineers who write custom Mojo kernels, on top of BentoCloud’s proven scalable operations.

      • Unified GPU Pricing

        Simple pricing for $ / token for shared endpoints, and $ / minute for dedicated ones.

      • Vertically Integrated Stack

        SOTA dynamic cloud orchestration. Compiler-aware auto-scaling. MAX understands model memory, batching state, KV-cache. Mojo provides portable SOTA kernels.

      • 10x Lighter Runtime

        <700MB runtime. 10x faster cold starts. Simpler operations.

    • Alternatives
      • Vendor Lock-In

        NVIDIA-only. Zero GPU vendor choice across every managed cloud competitor.

      • Generic Platform Optimizations

        No per-customer engineering. No dedicated engineers on your account. Generic optimizations applied everywhere.

      • Blackbox infrastructure & pricing

        No visibility into quantization, batching, or what's been done to your model. You're paying for a black box.

      • Runtime Wrappers

        CUDA research (ATLAS, Megakernel). vLLM/TensorRT wrappers. Runtime optimization, not compilation.

      • Multi-GB Runtime

        7GB+ runtimes. Slow cold starts. Heavy container overhead.

    Compare deployment options

    Self-Hosted

    Our Cloud

    Your Cloud

    Support

    Active community and fast responses in Discord, Discourse, Github

    Dedicated support, engineering team, standard and custom SLAs/SLOs

    Dedicated support, engineering team, standard and custom SLAs/SLOs

    Models

    Hundreds of models in our model repo, view top performers

    Top performers available for dedicated endpoint, custom model deployment

    Top performers available for dedicated endpoint, custom model deployment

    Platform access

    Deploy MAX and Mojo yourself anywhere you want. Build with open source

    Access Modular Platform with a console for deploying, scaling and managing your AI endpoints.

    Access Modular Platform with a console for deploying, scaling and managing your AI endpoints.

    Scalability

    Scale on your own with the MAX container

    Auto-scaling, scale to zero, burst capacity

    Auto-scaling, proven at Fortune 500 scale.

    Deployment location

    Self-deployed, anywhere

    Our cloud

    Your cloud or hybrid

    Compute hardware

    NVIDIA, AMD, and Apple Silicon & more on hardware you own

    NVIDIA & AMD GPUs in our cloud. More hardware types coming soon

    NVIDIA & AMD GPUs, Intel, AMD & ARM CPUs - deployed in your cloud.

    Custom kernels

    Your engineers write custom kernels for your workloads.

    Modular engineers tune kernels for your workloads

    Modular engineers write custom kernels for your workloads

    Forward Deployed Engineers

    Available with support plan

    Included

    Included; working in your environment

    Security & Compliance

    SOC 2 Type I certified

    SOC 2 Type I certified (Type II in progress)

    SOC 2 Type I certified (Type II in progress)

    Billing & Pricing

    Free

    Per token (shared) Per minute (dedicated)

    Per minute deployed. Use your AWS/GCP/Azure credits and commits

    License

    Enterprise Contract

    Get started with Modular

    • Request a demo

      Schedule a demo of Modular and explore a custom end-to-end deployment built around your models, hardware, and performance goals.

      • Distributed, large-scale online inference endpoints

      • Highest-performance to maximize ROI and latency

      • Deploy in Modular cloud or your cloud

      • View all features with a custom demo

      Book a demo

      Talk with our sales lead Jay!

      30min demo.  Evaluate with your workloads.  Ask us anything.