Modular acquires BentoML to deliver production AI in the cloud!  - Read more

Image Generation at Full Speed on any hardware

Generate production-quality images with compiler-optimized inference. FLUX, Stable Diffusion, and custom diffusion models — running natively on NVIDIA, AMD, and Apple Silicon.

Three images showing a cute astronaut and a small fiery creature named MAX & Mojo in space scenarios and futuristic environments.Three images showing a space astronaut with a glowing fire character named Mojo; they depict flying at lightspeed, sitting in a cockpit, and fixing futuristic equipment.

MODEL Spotlight

Flux-2

High-speed open-source image generation without sacrificing quality, optimized for rapid iteration and LoRA fine-tuning workflows. Generate images in under 1s with Modular.

  • In-context image editing with simple text instructions

  • Character consistency preservation across multiple edits

  • Local and global editing capabilities in one unified model

  • Text-to-image generation with state-of-the-art prompt following

  • Typography handling for text editing within images

Run Flux-2 in Modular Cloud
Short astronaut in a futuristic space suit inside a sleek small cockpit of a spaceship hovering above a snowy distant planet.

Why run image generation on Modular?

Every other image generation platform is locked to NVIDIA. Modular compiles diffusion models to run natively on any GPU — same model, same code, different silicon.

  • Native image and video support - not an afterthought

    vLLM and SGLang were built for text. They don't support diffusion models out of the box. Running FLUX, Stable Diffusion, or video generation on those stacks means bolting on separate infrastructure, separate pipelines, separate ops. MAX serves image and video models natively alongside LLMs - same container, same API, same compiled performance.

    VLLM AND SGLANG DON'T DO THIS. MAX DOES.

  • Up to 4x PyTorch performance

    MAX's MLIR compiler fuses the entire diffusion pipeline - UNet/DiT, VAE, text encoder, scheduler - into a single optimized graph. The result is up to 4x faster inference than native PyTorch. Not from wrapping a runtime. From compiling every stage together and eliminating the overhead between them.

    4X PYTORCH. COMPILED, NOT WRAPPED.

  • Compiler-fused diffusion pipelines

    Diffusion inference isn't a single model call - it's dozens of denoising steps through a UNet or DiT, plus VAE decoding, text encoding, and scheduling. MAX compiles the full pipeline as one graph, fusing operations across steps and eliminating memory round-trips between stages. That's where the 4x comes from.

    FULL PIPELINE COMPILED AS ONE GRAPH. THAT'S THE 4X.

  • Hardware-portable diffusion

    Every other image generation platform is locked to NVIDIA. MAX compiles diffusion models to run natively on NVIDIA and AMD from the same container. Same model, same code, different silicon. When AMD offers better spot pricing for your batch generation workloads, shift without changing anything.

    NVIDIA + AMD. SAME DIFFUSION PIPELINE. SAME CONTAINER.

  • Scale image generation in Our Cloud

    Image generation traffic is bursty - a design tool launch, an API integration going live, a campaign spike. Modular Cloud's compiler-aware auto-scaling spins up replicas in seconds with a <700MB runtime. Scale to zero when idle. Dedicated endpoints on $/minute pricing for production, serverless for prototyping. Forward-deployed engineers tune batch sizes and scheduling for your specific traffic patterns.

    BURST TO MEET DEMAND. SCALE TO ZERO WHEN QUIET.

  • FLUX, Stable Diffusion, and custom models ready to serve

    Pre-optimized support for FLUX, Stable Diffusion XL, SD3, and more - deploy on Modular Cloud in minutes. Fine-tuned a diffusion model on your own data? Bring custom LoRAs, custom architectures, or proprietary models. MAX compiles and serves them with the same 4x performance advantage.

    FLUX AND SDXL TODAY. YOUR CUSTOM MODEL TOMORROW.

  • Built for production voice applications

    • Real-time content generation

      Generate marketing visuals, product mockups, and social content at sub-second speeds. Serve from the GPU vendor with the best price-performance ratio at any given moment.

      Cute orange flame character giving thumbs up inside a futuristic spaceship cockpit floating over a cloudy landscape.
    • Avatar and character generation

      Consistent, low-latency avatar generation for gaming, virtual worlds, and social platforms. Run the same pipeline on NVIDIA in the cloud and Apple Silicon on-device.

      Half astronaut in a white spacesuit with black visor, half digital blue hologram made of glowing squares.
    • Custom model deployment

      Fine-tuned LoRAs, custom architectures, novel schedulers — deploy any diffusion model with Mojo kernels for maximum performance on any hardware target.

      Smartphone screen showing a white whale outline on a grid background with a small orange app icon of a white flame in the top right corner.

    Modular vs. the competition

      • Up to 4x PyTorch Performance

        MLIR compiler fuses the full diffusion pipeline - UNet/DiT, VAE, text encoder, scheduler - into one optimized graph. Up to 4x faster than native PyTorch. Compiled, not wrapped.

      • Native Image + Video Serving

        MAX serves diffusion models natively alongside LLMs in the same container and API. vLLM and SGLang don't support image or video generation at all. No bolted-on pipelines.

      • Hardware Portability

        Run FLUX, SDXL, and custom diffusion models on NVIDIA and AMD from the same container. Shift workloads based on price-performance. No other platform offers portability like us.

      • Faster & Lighter Runtime

        Replicas spin up in seconds, not minutes. At image generation scale - where traffic is bursty and cold starts kill UX - this is the difference between usable and unusable.

      • Fast, reliably cloud serving

        Image generation traffic is bursty - a design tool launch, an API integration going live, a campaign spike. Our Cloud's aware auto-scaling spins up replicas in seconds.

    • Alternatives
      • 1x PyTorch Performance

        Runtime wrappers over PyTorch and ComfyUI. No compiler-level optimization. No graph fusion across pipeline stages. You get whatever PyTorch gives you.

      • Text-Only Inference Stacks

        vLLM and SGLang were built for LLMs. Diffusion models require separate infrastructure, separate pipelines, separate ops. Two stacks to maintain. Two surfaces to break.

      • Hardware lock-in

        Every image generation alternative is locked to a single GPU vendor. Switching hardware means rewriting your stack. No pricing leverage. No supply flexibility.

      • 7GB+ Runtime, Slow Cold Starts

        Heavy dependency chains. Minutes to cold start. When a traffic spike hits your image API, users wait. At scale, that overhead compounds into real cost and real churn.

      • No Kernel Access

        Limited to framework-level configuration. Can't customize schedulers, attention, or quantization at the kernel level. What you get is what you get.

    Get started with Modular

    • Request a demo

      Schedule a demo of Modular and explore a custom end-to-end deployment built around your models, hardware, and performance goals.

      • Distributed, large-scale online inference endpoints

      • Highest-performance to maximize ROI and latency

      • Deploy in Modular cloud or your cloud

      • View all features with a custom demo

      Book a demo

      Talk with our sales lead Jay!

      30min demo.  Evaluate with your workloads.  Ask us anything.