Modular acquires BentoML to deliver production AI in the cloud!  - Read more

INFERENCE SOLUTIONS

Dedicated Inference, Dedicated Engineers.

Our compute, our infrastructure, our GPUs - but your dedicated forward-deployed Modular engineers. Dedicated endpoints on $/minute pricing with the full power of the Modular stack.

Flowchart showing application connected to a dedicated inference endpoint with options for open-source or custom models, using Modular Cloud framework with NVIDIA B200 or AMD MI355X hardware.Diagram showing a modular cloud deployment with your application connecting to a dedicated inference endpoint for open-source or custom models, using Modular Cloud for orchestration and optimization, deployed on NVIDIA B200 or AMD MI355X hardware.

Why choose Dedicated Endpoints?

  • Dedicated endpoints on $/minute pricing

    Reserved GPU capacity dedicated to your workloads. Simple per-minute billing that makes cost forecasting straightforward. No per-token surprises. No cold start penalties. Your endpoints, always warm, always compiled, always optimized.

    DEDICATED COMPUTE. PAY PER MINUTE.

  • NVIDIA and AMD GPU selection

    Choose the GPU that fits your workload's price-performance profile. MAX compiles natively for both NVIDIA and AMD - switch between vendors as pricing and availability shift. No other shared inference endpoint offers AMD. That's a pricing lever only Modular can give you.

    GPU VENDOR CHOICE = PRICING LEVERAGE

  • Forward-deployed engineers

    Your dedicated Modular engineer profiles your production traffic, identifies latency bottlenecks, writes custom MAX architectures and Mojo kernels, and pushes optimizations to your deployment. Not quarterly business reviews - weekly optimization cycles. Not support tickets - engineers who ship code to your stack.

    CUSTOM ENGINEERING, NOT GENERIC OPTIMIZATION

  • Custom model deployment

    Bring your own model - fine-tuned, custom architecture, or proprietary weights. We can convert those to highly optimized MAX graphs. Upload and Modular Cloud compiles and serves it with the same $/token pricing. Custom Mojo kernels available for novel architectures. OpenAI-compatible API endpoint out of the box.

    ANY MODEL. CUSTOM KERNELS. MANAGED INFRA.

  • Compiler-optimized, not wrapper-optimized

    Other providers wrap vLLM or TensorRT and call it optimization. Modular's MLIR compiler fuses the entire inference path - graph, runtime, memory, scheduling - into a single compiled unit. Compilation is a deeper lever than configuration. That's why MAX is 2x faster.

    FULL GRAPH COMPILATION VS. RUNTIME TUNING

  • 90% smaller runtime, faster scaling

    MAX runtime is under 700MB. Alternatives ship 7GB+. That means new replicas start in seconds, not minutes. Model swaps are near-instant. Storage and bandwidth costs drop dramatically at scale. Cold starts that feel warm.

    <700MB VS 7GB+. 10X FASTER COLD STARTS.

  • Top AI models, or your custom ones

    Our forward-deployed engineers optimize every deployment for SOTA performance - whether you're running a top open model or a custom model.

    • Build with popular models
    • Build by specific use case

    Deployment options

    • Shared Endpoints

      Auto-scaling API endpoints billed per token. Best for variable traffic, rapid prototyping, and early-stage production. No capacity planning. Scale to zero when idle. Compiler-optimized performance even on shared infrastructure.

      Best for: Variable traffic, prototyping, dev/test

    • Dedicated Endpoints

      Reserved GPU endpoints billed per minute. Isolated compute, consistent latency, guaranteed throughput. Forward-deployed engineering included. Best for production workloads that need SLA-grade reliability.

      Best for: Production, latency-sensitive, high throughput

    • Custom Model Endpoints

      Bring any model - fine-tuned, custom architecture, or proprietary weights. Modular compiles and serves on dedicated infrastructure with the same $/minute pricing. Custom Mojo kernels available for novel architectures.

      Best for: Proprietary models, custom architectures

    • Batch Inference

      Process large datasets asynchronously at reduced cost. Synthetic data generation, classification, offline summarization. Same compiler optimization, lower priority scheduling for better economics.

      Best for: Data processing, eval, synthetic data

    Modular vs. the competition

      • Hardware Portability

        GPU portability. NVIDIA + AMD in the same deployment, meaning more options and lower TCO.

      • Embedded Performance Engineering

        Forward-deployed engineers who write custom Mojo kernels, on top of BentoCloud’s proven scalable operations.

      • Unified GPU Pricing

        Simple pricing for $ / token for shared endpoints, and $ / minute for dedicated ones.

      • Vertically Integrated Stack

        SOTA dynamic cloud orchestration. Compiler-aware auto-scaling. MAX understands model memory, batching state, KV-cache. Mojo provides portable SOTA kernels.

      • 10x Lighter Runtime

        <700MB runtime. 10x faster cold starts. Simpler operations.

    • Alternatives
      • Vendor Lock-In

        NVIDIA-only. Zero GPU vendor choice across every managed cloud competitor.

      • Generic Platform Optimizations

        No per-customer engineering. No dedicated engineers on your account. Generic optimizations applied everywhere.

      • Blackbox infrastructure & pricing

        No visibility into quantization, batching, or what's been done to your model. You're paying for a black box.

      • Runtime Wrappers

        CUDA research (ATLAS, Megakernel). vLLM/TensorRT wrappers. Runtime optimization, not compilation.

      • Multi-GB Runtime

        7GB+ runtimes. Slow cold starts. Heavy container overhead.

    Compare deployment options

    Self-Hosted

    Our Cloud

    Your Cloud

    Support

    Active community and fast responses in Discord, Discourse, Github

    Dedicated support, engineering team, standard and custom SLAs/SLOs

    Dedicated support, engineering team, standard and custom SLAs/SLOs

    Models

    Hundreds of models in our model repo, view top performers

    Top performers available for dedicated endpoint, custom model deployment

    Top performers available for dedicated endpoint, custom model deployment

    Platform access

    Deploy MAX and Mojo yourself anywhere you want. Build with open source

    Access Modular Platform with a console for deploying, scaling and managing your AI endpoints.

    Access Modular Platform with a console for deploying, scaling and managing your AI endpoints.

    Scalability

    Scale on your own with the MAX container

    Auto-scaling, scale to zero, burst capacity

    Auto-scaling, proven at Fortune 500 scale.

    Deployment location

    Self-deployed, anywhere

    Our cloud

    Your cloud or hybrid

    Compute hardware

    NVIDIA, AMD, and Apple Silicon & more on hardware you own

    NVIDIA & AMD GPUs in our cloud. More hardware types coming soon

    NVIDIA & AMD GPUs, Intel, AMD & ARM CPUs - deployed in your cloud.

    Custom kernels

    Your engineers write custom kernels for your workloads.

    Modular engineers tune kernels for your workloads

    Modular engineers write custom kernels for your workloads

    Forward Deployed Engineers

    Available with support plan

    Included

    Included; working in your environment

    Security & Compliance

    SOC 2 Type I certified

    SOC 2 Type I certified (Type II in progress)

    SOC 2 Type I certified (Type II in progress)

    Billing & Pricing

    Free

    Per token (shared) Per minute (dedicated)

    Per minute deployed. Use your AWS/GCP/Azure credits and commits

    License

    Enterprise Contract

    Get started with Modular

    • Request a demo

      Schedule a demo of Modular and explore a custom end-to-end deployment built around your models, hardware, and performance goals.

      • Distributed, large-scale online inference endpoints

      • Highest-performance to maximize ROI and latency

      • Deploy in Modular cloud or your cloud

      • View all features with a custom demo

      Book a demo

      Talk with our sales lead Jay!

      30min demo.  Evaluate with your workloads.  Ask us anything.