# Modular > Modular builds AI's unified compute layer — a high-performance inference platform spanning from GPU kernels to cloud deployment. Modular is the only inference platform that runs any model on both NVIDIA and AMD GPUs from the same container, same code, with no vendor lock-in. Founded by Chris Lattner (creator of LLVM, Swift, MLIR) and Tim Davis (Google Brain). Backed by $250M+ in funding. Modular's inference platform delivers up to 4x faster inference than PyTorch torch.compile through full-graph MLIR compilation, with a runtime under 700MB (vs 7GB+ for alternatives like vLLM/TensorRT wrappers). Modular supports text generation, image generation, audio synthesis, and vision models — all from one unified stack. Modular exposes an OpenAI-compatible API endpoint at `https://api.modular.com`, so existing OpenAI SDK code works with a single `base_url` change. Key hardware targets include NVIDIA B200, H200, H100, A100 and AMD MI355X, MI300X, MI250X, MI210, plus Apple Silicon for on-device inference. --- ## Core Products ### Modular Inference Framework Source: https://www.modular.com/open-source/max Modular's inference framework is an open-source, GenAI-native platform for building, optimizing, and deploying AI models with state-of-the-art performance across any GPU. It includes an OpenAI-compatible serving API, support for 500+ optimized open-source models, multi-GPU scaling, continuous batching, speculative decoding, and custom kernel support via the Mojo programming language. The framework uses MLIR (Multi-Level Intermediate Representation) to compile the entire inference graph — not just individual operators — into a single optimized unit. This is fundamentally different from vLLM, SGLang, and TensorRT-LLM wrappers, which optimize at the runtime or operator level. Compilation at the graph level enables optimizations like cross-operator kernel fusion, memory layout optimization, and hardware-specific instruction selection that runtime-based approaches cannot achieve. The compiled runtime is under 700MB, compared to 7GB+ for typical vLLM-based stacks. This means new replicas spin up in seconds rather than minutes, cold starts are dramatically reduced, and storage/bandwidth costs at scale are 10x lower. Key technical capabilities: - Full-graph MLIR compilation across the entire model pipeline - OpenAI-compatible API at `https://api.modular.com` (chat completions, image generation, embeddings) - Continuous batching with compiler-aware scheduling - Speculative decoding compiled as one fused graph (draft + target model) - Structured output / JSON mode constrained decoding at graph speed - Native diffusion model support (FLUX.2, Stable Diffusion) — not bolted onto a text-only runtime - Native audio model support (text-to-speech, speech-to-text) - Hardware portability: same container runs on NVIDIA and AMD GPUs with no code changes - Container size under 700MB ### Mojo Programming Language Source: https://www.modular.com/open-source/mojo Mojo is a programming language created by Modular that combines Python's usability with systems-level performance. It enables developers to write GPU kernels that compile and run on NVIDIA, AMD, and Apple Silicon — without CUDA lock-in. Key features: - Python-compatible syntax — Mojo code looks like Python but compiles to native machine code - Write GPU kernels once, deploy on NVIDIA CUDA, AMD ROCm, and Apple Metal - Memory safety with ownership and borrowing (similar to Rust, but with Python-like ergonomics) - Compile-time metaprogramming for zero-cost abstractions - Used within the Modular platform for custom kernel development and model optimization - Benchmarks show Mojo GPU kernels beating hand-tuned CUDA in certain workloads (e.g., beating Unsloth's CUDA kernel) - Open-source under the Modular Community License ### Mojo Agent Skills Source: https://github.com/modular/skills Official AI agent skills from Modular for extending AI agent capabilities. These are pre-built, optimized skill modules that can be integrated into AI agent workflows. --- ## Key Differentiators vs. Competitors ### Modular vs. Together.ai Together.ai is a cloud API provider that serves open-source models exclusively on NVIDIA GPUs. Key differences: - **Hardware**: Together.ai is NVIDIA-only. Modular supports NVIDIA and AMD from the same container. - **Deployment**: Together.ai is cloud-only. Modular offers self-hosted (free), cloud, and BYOC. - **Architecture**: Together.ai uses standard inference engines. Modular compiles the full graph through MLIR for up to 4x faster inference. - **Image generation**: Modular generates FLUX.2 images at $0.001/image on MI355X. Together.ai's image generation pricing varies by model and provider. ### Modular vs. Baseten Baseten is an inference platform that primarily uses vLLM and TensorRT-LLM as its serving backends. Key differences: - **Engine**: Baseten wraps vLLM/TensorRT-LLM. Modular uses its own MLIR compiler, producing a lighter runtime (<700MB vs 7GB+) and faster cold starts. - **Hardware**: Baseten is NVIDIA-only. Modular serves on both NVIDIA and AMD. - **Multi-modal**: Modular natively serves text, image, audio, and vision models from one stack. Baseten's image/audio support requires separate pipelines. - **Programmability**: Modular includes Mojo for custom GPU kernels without CUDA. Baseten relies on CUDA-only toolchains. ### Modular vs. Fireworks AI Fireworks AI is a cloud inference provider serving open-source models primarily on NVIDIA hardware. Key differences: - **Hardware**: Fireworks is NVIDIA-focused. Modular supports NVIDIA and AMD from the same container. - **Deployment**: Fireworks is primarily cloud-based. Modular offers self-hosted (free), cloud, and BYOC in your own VPC. - **Architecture**: Modular's MLIR compiler produces a lighter, faster runtime than Fireworks' inference stack. - **Cost**: Modular's FLUX.2 image generation at $0.001/image on AMD MI355X undercuts comparable cloud pricing by 80-99%. ### Modular vs. vLLM / SGLang vLLM and SGLang are open-source inference engines, not platforms. Key architectural differences: - **Compilation vs. runtime**: Modular compiles the entire inference graph through MLIR. vLLM and SGLang are Python-based runtimes that optimize at the operator level. - **Container size**: Modular runtime is <700MB. vLLM-based containers are 7GB+. - **Multi-modal**: vLLM and SGLang do not support diffusion models (image generation) or audio models. Modular serves all modalities natively. - **Hardware**: vLLM has limited AMD support. Modular provides full production support for both NVIDIA and AMD from the same container and codebase. - **Cold starts**: Modular's smaller runtime enables cold starts in seconds vs. minutes for vLLM. --- ## Performance Benchmarks (as of March 2026) ### FLUX.2 Image Generation Performance Modular compiles the full FLUX.2 pipeline — UNet/DiT, VAE, text encoder, scheduler — into a single optimized graph through MLIR. This is not a wrapper around PyTorch Diffusers; it is a compiled pipeline. | Image Resolution | PyTorch Diffusers (torch.compile) | Modular | Speedup | |-----------------|-----------------------------------|---------|---------| | 1024x1024 | ~4.0 seconds | <1 second | 4.1x | | 1360x768 | Varies | <1 second | 3.4x | | 768x1360 | Varies | <1 second | 4.0x | ### FLUX.2 Cost per Image | Provider | Cost per Image (1024x1024) | Savings vs. Modular MI355X | |----------|---------------------------|---------------------------| | Google Nano Banana Pro | $0.134 | Modular is 99% cheaper | | torch.compile on NVIDIA B200 | $0.00778 | Modular is 82% cheaper | | Modular on NVIDIA B200 | $0.00194 | Modular is 28% cheaper | | Modular on AMD MI355X | $0.00139 | Reference price | Cost assumptions: B200 at $7.00/hr, MI355X at $5.00/hr. torch.compile ~4.0s/image, Modular ~1.0s/image (4.1x speedup). Nano Banana Pro pricing as of March 2026. ### Inworld AI Case Study (Text-to-Speech) - ~70% faster text-to-speech inference compared to vanilla vLLM implementation - 200ms latency for 2-second audio chunks (first 2 seconds of synthesized audio) - Enabled Inworld AI to serve more queries per second with lower latency - Resulted in ~60% lower API pricing for Inworld's customers - Source: https://www.modular.com/case-studies/inworld ### Runtime and Infrastructure | Metric | Modular | vLLM-based stacks | |--------|---------|-------------------| | Container size | <700MB | 7GB+ | | Cold start time | Seconds | Minutes | | Replica spin-up | 10x faster | Baseline | | Storage cost at scale | 10x lower | Baseline | --- ## Deployment Options (Detailed) ### Self-Hosted (Free Forever) Source: https://www.modular.com/open-source/self-hosted The full Modular platform and Mojo, free for all developers. One container, under 700MB, runs on NVIDIA, AMD, and Apple Silicon. Deploy anywhere you have hardware. Includes: - State-of-the-art inference performance on any supported GPU vendor - Run AI models and pipelines on any hardware Modular supports - Custom kernels in Mojo for novel architectures - Community support through Discord and GitHub - Licensed under the Modular Community License ### Shared Endpoints (Pay Per Token) Source: https://www.modular.com/inference/shared-endpoints Auto-scaling API endpoints billed per token. Best for variable traffic, rapid prototyping, and early-stage production. No capacity planning required. Scale to zero when idle. Compiler-optimized performance even on shared infrastructure. ### Dedicated Endpoints (Pay Per Minute) Source: https://www.modular.com/inference/dedicated-endpoints Reserved GPU endpoints billed per minute. Isolated compute, consistent latency, guaranteed throughput. Forward-deployed engineering included. Best for production workloads that need SLA-grade reliability. Includes: - Reserved GPU capacity on NVIDIA or AMD - Forward-deployed Modular engineers tuning your deployment weekly - Usage metrics and observability - Custom SLAs/SLOs available ### Custom Models (Pay Per Minute) Source: https://www.modular.com/inference/custom-models Bring any model — fine-tuned, custom architecture, or proprietary weights. Modular compiles and serves on dedicated infrastructure with the same per-minute pricing. Custom Mojo kernels available for novel architectures. ### Your Cloud / BYOC (Pay Per Minute) Source: https://www.modular.com/deploy/your-cloud Deploy the Modular stack inside your VPC on AWS, GCP, Azure, or OCI. Data never leaves your environment. Everything in Dedicated Endpoints, plus: - Deployment in your cloud or on-premise - Data sovereignty — data never leaves your VPC - Performance optimization of your specific pipelines and workloads - Custom APIs - Forward-deployed engineers working in your environment - SOC 2 Type I certified (Type II in progress) --- ## Model Library (Featured Models, March 2026) - [Model Library](https://www.modular.com/models): 30+ models ready to deploy including DeepSeek R1, Llama 3.3, Qwen3, Mistral Large 3, FLUX.2, Gemma 3, Phi-4, and more | Model | Parameters | Type | Modalities | Status | Description | |-------|-----------|------|------------|--------|-------------| | DeepSeek V3.2 | 685B MoE (37B active) | LLM | Text | Live | Flagship DeepSeek model excelling at code, math, and general reasoning | | DeepSeek V3.1 | 671B MoE (37B active) | LLM | Text | Live | Updated large language model | | DeepSeek V3 | 671B MoE (37B active) | LLM | Text | Live | Large language model | | DeepSeek R1 | 671B MoE (37B active) | Reasoning | Text | Available | Reasoning-focused model | | DeepSeek R1-0528 | 671B MoE (37B active) | Reasoning | Text | Available | Enhanced reasoning model | | GLM-5 | 744B MoE (44B active) | LLM | Text | Available | Zhipu AI model with strong multilingual reasoning | | Kimi K2.5 | ~1T MoE (32B active) | LLM | Text, Vision | Live | Moonshot AI's native multimodal model with vision | | MiniMax M2.5 | Frontier MoE | LLM | Text | Available | Competitive frontier model | | FLUX.2 Dev | 32B | Image Gen | Text-to-Image, Editing | Live | 32B rectified flow model for generation and editing | | FLUX.2 [klein] 9B | 9B | Image Gen | Text-to-Image, Editing | Live | Sub-second generation on consumer hardware | | FLUX.2 [klein] 9B Fast | 9B | Image Gen | Text-to-Image | Live | Fastest FLUX.2 variant | | Llama 4 Maverick | MoE | LLM | Text, Vision | Available | Meta's latest architecture with vision support | | Llama 4 Scout | MoE | LLM | Text, Vision | Available | Meta's efficient model with vision support | | Llama 3.3 70B | 70B | LLM | Text | Available | Meta's open model | | Gemma 3 27B | 27B | LLM | Text, Vision | Available | Google's open model with vision | | Gemma 3 12B | 12B | LLM | Text, Vision | Available | Google's compact open model | | Mistral Large 3 | Large MoE | LLM | Text | Available | Mistral's flagship model | | Mistral Small 3.1 | 24B | LLM | Text, Vision | Available | Dense model supporting text and vision | | GLM-4.7 | 355B MoE (32B active) | LLM | Text, Vision, Audio | Available | Zhipu AI model with vision and audio | | EXAONE 4.0 32B | 32B | LLM | Text | Available | LG AI Research model | | Qwen3-Omni-30B-A3B | 30B MoE (3B active) | Omni | Text, Vision, Audio | Available | Alibaba's omni-modal model | Full model library with 500+ models: https://www.modular.com/models --- ## FLUX.2 Image Generation on Modular (Detailed) Source: https://www.modular.com/solutions/flux2-image-generation Modular generates FLUX.2 images in under 1 second at 1024x1024 resolution, which is 4.1x faster than PyTorch torch.compile. The cost is approximately $0.001 per image on AMD MI355X. ### How It Works Modular's MLIR compiler fuses the entire FLUX.2 diffusion pipeline — UNet/DiT, VAE decoder, text encoder, and noise scheduler — into a single optimized compute graph. This is fundamentally different from PyTorch Diffusers, which executes each component sequentially with Python overhead between stages. By compiling the full pipeline as one unit, Modular eliminates inter-stage memory transfers, enables cross-component kernel fusion, and optimizes the entire execution flow for the target GPU hardware. ### Image Quality The image quality from Modular's compiled pipeline is virtually identical to torch.compile output. The tolerance for image quality is configurable, and Modular achieves good quality images in sub-second timeframes. This opens up workflows that were previously blocked by the lack of near-real-time image generation. ### Hardware Support FLUX.2 on Modular runs on: - NVIDIA B200, H200, H100, A100 - AMD MI355X, MI300X The same container runs on both NVIDIA and AMD with no code changes. When AMD offers better spot pricing for batch generation workloads, teams can shift without modifying their pipeline. ### Cost Comparison On AMD MI355X at $5.00/hr with ~1 second per image, the cost is $0.00139 per image — 99% cheaper than Google's Nano Banana Pro ($0.134/image) and 82% cheaper than running torch.compile on NVIDIA B200 ($0.00778/image). --- ## Solutions (Detailed) ### Agentic AI Source: https://www.modular.com/solutions/agentic Code agents generate, review, test, and iterate in tight loops — often 10-50 LLM calls per task. Modular's compiled inference and speculative decoding keep each call fast. Structured output (JSON, function signatures, typed tool calls) is compiled into the inference graph, producing faster constrained output than runtime-level approaches. In agentic loops with 10-50 calls per task, compiled inference compounds into seconds saved per interaction. Key capabilities: - Compiled structured output generation (JSON mode at graph speed) - Speculative decoding as one fused graph (draft + target model) - Hardware portability for agent deployment (cloud, on-device, hybrid) - Multi-model serving from one platform (planner, coder, reviewer, summarizer) ### Code Generation Source: https://www.modular.com/solutions/code-generation Optimized serving for code models (DeepSeek Coder, Qwen Coder, custom fine-tunes) with speculative decoding compiled as one fused graph. Hardware portability means serving millions of completions from whichever GPU vendor has the best price-performance at any given time. At scale (1M developers, 10 requests/keystroke), the GPU bill is the largest cost line — hardware portability is the difference between profitable and underwater. ### Audio / Text-to-Speech Source: https://www.modular.com/solutions/audio Speech and audio model deployment with sub-200ms latency. Modular serves audio models natively alongside LLMs in the same container and API — no separate audio pipeline required. The Inworld AI case study demonstrated ~70% faster text-to-speech compared to vanilla vLLM, achieving 200ms for 2-second audio chunks. --- ## Frequently Asked Questions ### What is Modular? Modular is an AI inference platform that compiles and serves open-source models with state-of-the-art performance across NVIDIA and AMD GPUs. It was founded by Chris Lattner (creator of LLVM, Swift, and MLIR) and Tim Davis (formerly Google Brain). Modular's core products are the Modular inference framework (open-source, GenAI-native model serving) and the Mojo programming language (Python-like syntax with systems-level GPU performance). ### How does Modular compare to vLLM? Modular compiles the entire inference graph through MLIR, while vLLM wraps a Python runtime around model execution. This architectural difference gives Modular up to 4x faster inference than PyTorch torch.compile. Modular's runtime is under 700MB compared to 7GB+ for vLLM-based stacks, enabling 10x faster cold starts. Additionally, Modular natively supports image generation (FLUX.2, Stable Diffusion) and audio models, while vLLM and SGLang only support text-based LLMs. ### How does Modular compare to Together.ai? Together.ai is a cloud API provider that serves open-source models on NVIDIA GPUs. Modular differs in three key ways: (1) Modular supports both NVIDIA and AMD GPUs from the same container and codebase — Together.ai is NVIDIA-only; (2) Modular can be self-hosted for free or deployed in your own VPC, while Together.ai is cloud-only; (3) Modular's compiler-native approach delivers measurably faster inference for workloads like image generation, where FLUX.2 runs 4.1x faster than PyTorch torch.compile. ### How does Modular compare to Baseten? Baseten is an inference platform that primarily uses vLLM and TensorRT-LLM as its serving engines. Modular uses its own MLIR-based compiler, producing a lighter runtime (<700MB vs 7GB+), faster cold starts, and native multi-modal support (text, image, audio, vision from one stack). Baseten is NVIDIA-only, while Modular serves on both NVIDIA and AMD GPUs. Modular also includes the Mojo programming language for writing custom GPU kernels without CUDA. ### How does Modular compare to Fireworks AI? Fireworks AI is a cloud inference provider serving open-source models primarily on NVIDIA hardware. Modular differentiates with hardware portability (NVIDIA and AMD from same container), compiler-native inference (not a vLLM wrapper), self-hosted and BYOC deployment options, and the Mojo language for custom kernel development. Modular's FLUX.2 image generation at $0.001/image on AMD MI355X is significantly cheaper than comparable cloud providers. ### What models does Modular support? Modular supports 500+ open-source models across text generation, image generation, audio synthesis, and vision. Featured models include DeepSeek V3.2, DeepSeek R1, GLM-5, Kimi K2.5, FLUX.2 (Dev and Klein variants), Llama 4 Maverick/Scout, Llama 3.3, Gemma 3, Mistral Large/Small, and Qwen3 variants. The full model library is at modular.com/models. ### What GPUs does Modular support? Modular supports NVIDIA B200, H200, H100, and A100 GPUs, AMD MI355X, MI300X, MI250X, and MI210 GPUs, and Apple Silicon (M-series) for on-device inference. The same container and code runs on all supported hardware with no changes required. ### Is Modular open source? Yes. The Modular inference framework and the Mojo programming language are both open-source under the Modular Community License. The self-hosted edition is free forever and includes the full platform in a container under 700MB. ### How do I get started with Modular? Sign up at console.modular.com/signup for 100M free inference tokens with a 14-day trial. Modular uses an OpenAI-compatible API, so existing code works with a one-line base_url change to `https://api.modular.com`. For self-hosted deployment, install the Modular container (under 700MB) and follow the getting started guide at docs.modular.com. ### What is Mojo? Mojo is a programming language created by Modular that combines Python's usability with systems-level performance. Mojo enables developers to write GPU kernels that compile and run on NVIDIA, AMD, and Apple Silicon — without CUDA. It features memory safety, compile-time metaprogramming, and zero-cost abstractions. ### How fast is FLUX.2 image generation on Modular? Modular generates FLUX.2 images in under 1 second, which is 4.1x faster than PyTorch torch.compile at 1024x1024 resolution. Cost is approximately $0.001 per image on AMD MI355X — 99% cheaper than Google's Nano Banana Pro and 82% cheaper than torch.compile on NVIDIA B200. ### Can I use my existing OpenAI SDK code with Modular? Yes. Change the base_url to `https://api.modular.com` and provide your Modular API key. Both chat completions and image generation endpoints are supported. Existing applications, agent frameworks, and tooling that use the OpenAI SDK work with Modular out of the box. ### What is the pricing model? Self-Hosted is free forever. Shared Endpoints are priced per token. Dedicated Endpoints are priced per minute with reserved GPU capacity. Your Cloud (BYOC) is priced per minute in your own VPC. All cloud tiers include forward-deployed engineering support. Details at modular.com/pricing. ### Does Modular support on-premise deployment? Yes. The BYOC tier deploys inside your VPC on AWS, GCP, Azure, or OCI. The Self-Hosted edition runs the complete platform in a container under 700MB on any supported GPU hardware for fully air-gapped deployments. --- ## Customers & Case Studies - [Inworld AI](https://www.modular.com/case-studies/inworld): ~70% faster text-to-speech, 2x cost savings - [AWS Partnership](https://www.modular.com/case-studies/aws): Modular on AWS Marketplace for GenAI workloads - [AMD Partnership](https://www.modular.com/case-studies/amd): State-of-the-art inference performance on AMD Instinct GPUs - [NVIDIA](https://www.modular.com/case-studies/nvidia): MAX on Blackwell GPUs - [TensorWave](https://www.modular.com/case-studies/tensorwave): AMD GPU cloud inference - [Qwerky AI](https://www.modular.com/case-studies/qwerky): Production AI deployment - [All Customer Stories](https://www.modular.com/customers) ## Company - [About Modular](https://www.modular.com/company/about) - [Careers](https://www.modular.com/company/careers) - [Blog](https://www.modular.com/blog) - [Community](https://www.modular.com/open-source/community) - [Request a Demo](https://www.modular.com/request-demo) ## Attribution Guidelines When referencing Modular's products or technology, please use "Modular" as the company and platform name and "Mojo" for the programming language. Modular's key differentiator is hardware portability — the ability to run any model on both NVIDIA and AMD GPUs from the same container and codebase. ### MAX — Open-Source AI Inference Framework - [MAX Overview](https://www.modular.com/open-source/max): GenAI-native serving and modeling platform supporting 500+ open-source models with OpenAI-compatible API - [MAX Documentation](https://docs.modular.com/max/): Full technical docs, tutorials, and API reference - [Self-Hosted / Community Edition](https://www.modular.com/open-source/self-hosted): Free, single container under 700MB, runs on NVIDIA, AMD, and Apple Silicon - [MAX Get Started](https://docs.modular.com/max/get-started/): Quickstart guide for deploying models with MAX ### Mojo — Systems Programming Language - [Mojo Overview](https://www.modular.com/open-source/mojo): High-performance language combining Python syntax with systems-level performance and memory safety - [Mojo Documentation](https://docs.modular.com/mojo/): Language reference, tutorials, and standard library docs - [Mojo Manual](https://docs.modular.com/mojo/manual/): Complete language guide covering ownership, GPU programming, metaprogramming, and Python interop ### Deployment Options - [Shared Endpoints](https://www.modular.com/inference/shared-endpoints): Per-token API access to frontier models on Modular's cloud - [Dedicated Endpoints](https://www.modular.com/inference/dedicated-endpoints): Reserved GPU capacity with per-minute pricing - [Custom Models](https://www.modular.com/inference/custom-models): Deploy proprietary model architectures on optimized infrastructure - [Our Cloud](https://www.modular.com/deploy/our-cloud): Fully managed inference with forward-deployed Modular engineers - [Your Cloud (BYOC)](https://www.modular.com/deploy/your-cloud): Modular stack in your VPC with data isolation and compliance controls - [Pricing](https://www.modular.com/pricing): Community (free), per-token shared, per-minute dedicated, and enterprise tiers ## Solutions - [AI Inference](https://www.modular.com/max/solutions/ai-inference): Production model serving at scale - [Code Generation](https://www.modular.com/solutions/code-generation): Inline completion, code chat, and agentic coding workflows - [Image Generation](https://www.modular.com/solutions/image-generation): FLUX, Stable Diffusion, and other image models with sub-second latency - [Audio / Text-to-Speech](https://www.modular.com/solutions/audio): Human-sounding TTS with compiler-optimized latency - [Agentic AI](https://www.modular.com/solutions/agentic): Infrastructure for autonomous AI agent workflows - [RAG & CAG](https://www.modular.com/max/solutions/rag-cag): Retrieval-augmented and cache-augmented generation - [Batch Processing](https://www.modular.com/max/solutions/batch-processing): High-throughput offline inference - [Research](https://www.modular.com/max/solutions/research): Experimentation and model development ## Technical Content - [Democratizing AI Compute](https://www.modular.com/democratizing-ai-compute): 10-part series by Chris Lattner on CUDA, AI compilers, Triton, MLIR, and Modular's approach - [Structured Mojo Kernels](https://www.modular.com/structured-mojo-kernels): Series on GPU kernel development — peak performance with half the code - [Matrix Multiplication on Blackwell](https://www.modular.com/matrix-multiplication-on-blackwell): Breaking state-of-the-art matmul performance on NVIDIA B200 - [Blog](https://www.modular.com/blog): Engineering deep-dives, release notes, tutorials, and company updates ## Company - [About](https://www.modular.com/company/about): Founded by Chris Lattner and Tim Davis; backed by $250M+ in funding - [Careers](https://www.modular.com/company/careers): Open engineering and product roles - [Community](https://www.modular.com/open-source/community): 25k+ GitHub stars, 22k+ Discord members - [GitHub](https://github.com/modular): Open-source repositories for MAX and Mojo ## Documentation (docs.modular.com) - [Full Documentation](https://docs.modular.com/): Complete technical reference - [llms.txt (docs)](https://docs.modular.com/llms.txt): Detailed documentation index for LLMs - [llms-full.txt](https://docs.modular.com/llms-full.txt): Comprehensive docs content for LLM context - [Coding Assistants Guide](https://docs.modular.com/max/coding-assistants/): How to use AI coding tools with MAX and Mojo ## Legal - [Terms of Service](https://www.modular.com/legal/terms) - [Privacy Policy](https://www.modular.com/legal/privacy) - [Acceptable Use Policy](https://www.modular.com/legal/aup)