GenAI-native serving and modeling, built for performance.

Build once, deploy anywhere with a single programmable stack for high-performance GenAI on any hardware

Get started

Request a demo

Serve Models
Deploy DeepSeek, Gemma, Qwen, and hundreds more with a high-speed OpenAI-compatible endpoint, on NVIDIA or AMD and on any cloud.
View Models
Customize or Build Models
Load fine-tuned weights to existing models or build custom models with a PyTorch-like Python API that’s uniquely built to deliver top performance in production.
Optimize Kernels with ease
Get the best GPU kernel performance without the GPU programming pain of the past—write in Mojo🔥, a pythonic language purpose-built for portability across any hardware.
Full Guide
Zero Vendor Lock-in
MAX doesn't depend on PyTorch, CUDA, or ROCm, so there's nothing to bundle, patch, or keep in sync. The result: dramatically smaller containers and faster cold starts.

Your entire AI infrastructure in a single dependency

GPU agnostic

The same code runs on NVIDIA, AMD, and Apple Silicon. When new generations of hardware enter the datacenter, MAX is the fastest to take advantage and deliver top performance. Hardware will only get more exciting - be ready for it with MAX.

sync.🔥

  # Compile-time warp synchronization per hardware
  
  @always_inline("nodebug")
  fn syncwarp(mask: Int = -1):
      """Synchronizes threads within a warp using a barrier."""
  
      @parameter
      if is_nvidia_gpu():
          __mlir_op.`nvvm.bar.warp.sync`(
              __mlir_op.`index.casts`[_type = __mlir_type.i32](
                  mask._mlir_value
              )
          )
  
      elif is_amd_gpu():
          # In AMD GPU this is a nop (everything executed in lock-step).
          return

Open source & extensible

All of the MAX Python API, all of the model pipelines, and all the GPU kernels (for NVIDIA, AMD, and Apple) are open sourced for you to learn from and contribute to.

Modular Github

arch.py

  # Registers Qwen2 models with composable components
  
  qwen2_arch = SupportedArchitecture(
      name="Qwen2ForCausalLM",  # Supports HuggingFace model class names
      task=PipelineTask.TEXT_GENERATION,
      example_repo_ids=["Qwen/Qwen2.5-7B-Instruct", "Qwen/QwQ-32B"],
      default_weights_format=WeightsFormat.safetensors,
      default_encoding=SupportedEncoding.bfloat16,
      supported_encodings={
          SupportedEncoding.float32: [KVCacheStrategy.PAGED],
          SupportedEncoding.bfloat16: [KVCacheStrategy.PAGED],
      },
      pipeline_model=Qwen2Model,  # Implement a custom model
      tokenizer=TextTokenizer,
      rope_type=RopeType.normal,
      weight_adapters={  # Reuse composable converters & utilities
          WeightsFormat.safetensors: llama3.weight_adapters.convert_safetensor_state_dict,
          WeightsFormat.gguf: llama3.weight_adapters.convert_gguf_state_dict,
      },
  )

Measurable performance
See the numbers for yourself. MAX includes max benchmark, an open-source benchmarking tool adapted from vLLM. Run it against your endpoint with datasets like ShareGPT or arxiv-summarization, or bring your own. Export shareable YAML configs for reproducible results.
Benchmark tutorial
171%
Improved throughput
Gemma3-27B | AMD-MI355x | Sonnet decode heavy

500+ Models

Instant access to the most popular OSS models - optimized for cost, speed, and quality. Search our library of open source models and deploy in seconds.

Run a model in 3 minutes

max_quickstart.py

  # OpenAI-compatible API for OSS models
  
  from openai import OpenAI
  
  client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
  
  completion = client.chat.completions.create(
      model="[choose from library]",
      messages=[
          {
              "role": "user",
              "content": "Who won the world series in 2020?"
          },
      ],
  )
  
  print(completion.choices[0].message.content)

Get started with MAX

Start using MAX
( FREE )
Run any open source model in 5 minutes, then benchmark it. Scale it to millions yourself (for free!), or book a demo anytime you want to upgrade to our managed enterprise product.
Install MAX now
MAX for Enterprise
( Book a demo )
Schedule a demo of MAX Enterprise and explore a custom end-to-end deployment built around your models, hardware, and performance goals.
- Distributed, large-scale online inference endpoints
- Highest-performance to maximize ROI and latency
- Deploy in Modular cloud or your cloud
- View all features with a custom demo
Book a demo
Talk with our sales lead Jay!
30min demo. Evaluate with your workloads. Ask us anything.

Popular MAX tech talks

(Re)introducing MAX with Chris Lattner
14:31
MAX Graph Compilation to Execution
46:56
MAX’s Graph Compiler Internals with Feras Boulala
30:01

Developer Approved

12x faster without even trying

svpino

“Mojo destroys Python in speed. 12x faster without even trying. The future is bright!”

impressive speed

Adalseno

"It worked like a charm, with impressive speed. Now my version is about twice as fast as Julia's (7 ms vs. 12 ms for a 10 million vector; 7 ms on the playground. I guess on my computer, it might be even faster). Amazing."

actually flies on the GPU

Sanika

"after wrestling with CUDA drivers for years, it felt surprisingly… smooth. No, really: for once I wasn’t battling obscure libstdc++ errors at midnight or re-compiling kernels to coax out speed. Instead, I got a peek at writing almost-Pythonic code that compiles down to something that actually flies on the GPU."

impressed

justin_76273

“The more I benchmark, the more impressed I am with the MAX Engine.”

high performance code

jeremyphoward

"Mojo is Python++. It will be, when complete, a strict superset of the Python language. But it also has additional functionality so we can write high performance code that takes advantage of modern accelerators."

performance is insane

drdude81

“I tried MAX builds last night, impressive indeed. I couldn't believe what I was seeing... performance is insane.”

Community is incredible

benny.n

“The Community is incredible and so supportive. It’s awesome to be part of.”

surest bet for longterm

pagilgukey

“Mojo and the MAX Graph API are the surest bet for longterm multi-arch future-substrate NN compilation”

works across the stack

scrumtuous

“Mojo can replace the C programs too. It works across the stack. It’s not glue code. It’s the whole ecosystem.”

one language all the way through

fnands

“Tired of the two language problem. I have one foot in the ML world and one foot in the geospatial world, and both struggle with the 'two-language' problem. Having Mojo - as one language all the way through is be awesome.”

The future is bright!

mytechnotalent

Mojo destroys Python in speed. 12x faster without even trying. The future is bright!

potential to take over

svpino

“A few weeks ago, I started learning Mojo 🔥 and MAX. Mojo has the potential to take over AI development. It's Python++. Simple to learn, and extremely fast.”

easy to optimize

dorjeduck

“It’s fast which is awesome. And it’s easy. It’s not CUDA programming...easy to optimize.”

huge increase in performance

Aydyn

"C is known for being as fast as assembly, but when we implemented the same logic on Mojo and used some of the out-of-the-box features, it showed a huge increase in performance... It was amazing."

feeling of superpowers

Aydyn

"Mojo gives me the feeling of superpowers. I did not expect it to outperform a well-known solution like llama.cpp."

completely different ballgame

scrumtuous

“What @modular is doing with Mojo and the MaxPlatform is a completely different ballgame.”

amazing achievements

Eprahim

“I'm excited, you're excited, everyone is excited to see what's new in Mojo and MAX and the amazing achievements of the team at Modular.”

very excited

strangemonad

“I'm very excited to see this coming together and what it represents, not just for MAX, but my hope for what it could also mean for the broader ecosystem that mojo could interact with.”

pure iteration power

Jayesh

"This is about unlocking freedom for devs like me, no more vendor traps or rewrites, just pure iteration power. As someone working on challenging ML problems, this is a big thing."

was a breeze!

“Max installation on Mac M2 and running llama3 in (q6_k and q4_k) was a breeze! Thank you Modular team!”

Show more quotes