Hippocratic AI + Modular to power real-time patient conversations. Read More →

May 29, 2026

Three trends from MLSys 2026

Michael Dunn-OConnor

Brian Zhang

Shouzheng Liu

Engineering

MLSys 2026 provided an excellent overview of the current state of inference across research and industry. With six sessions on LLM serving this year (twice as many as last year) the program covered opportunities and challenges at the core of Modular’s recent work. Modular was glad to sponsor the conference, and our team noted three trends that stood out across the talks, posters, and keynotes. These are all topics that Modular has been addressing from first principles, with the advantage of our unique stack.

Trend 1: Agents are writing everything from kernels to systems

Monday’s keynote set the tone. Mark Saroufim's When AI Starts Writing Systems Code showed examples of novice kernel developers using AI agents to write kernels that could place them near the top of competitive hackathons. He then comically undercut some of these agentic achievements by demonstrating how agents would cheat the benchmarks and optimize results that would never generalize outside of the test cases provided. Rather than waiting for a generation of agents that always play by the rules, Saroufim outlined the need for “zero trust” verification by creating comprehensive enough benchmarks to not rely on good faith submissions.

Lidong Zhou's Tuesday keynote The Next Horizon of Systems: From MLSys to System Intelligence argued that the systems community needs to plan for AI agents as the primary authors of low-level code. Zhou presented a Rust microkernel (Nanvix) where AI-generated specifications and proofs are verified module by module, with a pass rate on a 150-task proof generation benchmark that climbed from 2 percent (GPT-4o, prompt-based) to 91.3 percent (fine-tuned LLaMA-3.1 8B with self-debugging). The talk also documented the shortcuts the model takes when it cannot complete a proof: wrapping code in external_body to bypass the verifier, planting false postconditions, and shifting proof burden to callers.

The shared conclusion of these talks was that agentic engineering requires substantially greater rigor in specification, design, and validation.

Subsequent talks showcased specific applications of agentic engineering across kernels and systems. AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization describes a closed-loop system where an LLM proposes accelerator kernel variants, profiles them, and feeds the results back to itself. FlashInfer-Bench: Building the Virtuous Cycle for AI-driven LLM Systems in the ML for Systems session frames this as a feedback loop: the benchmark exists to give the agents something to optimize against.

The kernel-author pain that motivates all of this also showed up directly. FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling moved its implementation off CUDA C++ templates and into CuTe-DSL embedded in Python with the explicit goal of letting downstream developers extend the kernel without modifying the core framework. HipKittens: Fast and Furious AMD Kernels and ParallelKittens: Systematic and Practical Simplification of Multi-GPU AI Kernels both argued for a simpler set of abstractions for kernel engineering. The shared assumption is that human kernel authors are not going to write a new template forest for every accelerator generation, and that abstraction benefits agents at least as much as human developers.

Modular’s Solution

Modular engineers and the community are already writing Mojo code with agents and discovering how many of its features are optimal for agentic engineering. Mojo’s robust type system, efficient compilation, and clear error messages support the tight feedback loops of agentic development with human verification. Our blog post on Translating to Mojo via AI Agents provides a practical guide to this workflow. The official modular/skills package plugs into Claude Code, Cursor, and other coding agents and corrects any misconceptions and out-of-date patterns that models may produce. In the post, Brad Larson walks an agent from a CUDA softmax kernel (Szymon Ożóg's FastSoftmax) to a portable Mojo version that runs on NVIDIA, AMD, and Apple silicon in a single session. Automatika Robotics did the same to autonomous-navigation kernels for their EMOS / kompass-core workload and reported 15.973 ms versus a 16.358 ms SYCL/CUDA baseline on the agent's first pass, with no Mojo-side optimization. Another blog post from Ehsan Kermani demonstrates effective agentic engineering to create Mojo libraries that are thoroughly tested and meet a real community need.

The second connection is on the kernel side. The composable abstractions the Modular kernel team created are documented in our Structured Mojo Kernels blog series. This pattern breaks production kernels into three components (TileIO, TilePipeline, TileOp) with context managers that make incorrect synchronization unrepresentable. Rewriting the B200 matmul cut 14,683 lines down to 7,634 with consistent performance at 1770 TFLOPS. The Structured Mojo Kernels patterns apply to both NVIDIA and AMD and they are open source in the Modular repository. This is the direct answer to the FlashAttention-4 maintenance argument: a kernel layer that is concise enough for a human (or an agent) to reason about, with abstractions that hold up under hardware changes, without a tradeoff in performance.

Trend 2: KV cache became the dominant subsystem

KV cache was the single most discussed topic at the conference. The LLM Serving 3 session on Wednesday afternoon was effectively a KV cache session, with six papers in a row presented on the topic.

Yuhan Liu's talk on LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference made the architectural argument directly. The paper treats the KV cache as a first-class data structure rather than an internal byproduct of inference, and reports that real-world deployments have made KV cache usage outgrow GPU memory: over five weeks of telemetry, the dominant majority of stored KV cache no longer fit on the GPUs, while reuses per token grew by more than 19 percent across users. The system now supports eight storage backends (NFS, WEKA, GPU-Direct Storage, Mooncake Store, NIXL, S3, InfiniStore, Valkey) across four processor types (NVIDIA, AMD, Ascend, TPU) and two inference engines (vLLM, SGLang). That breadth of coverage is essential. The cache layer has to be portable because the storage and compute hardware underneath it is not homogenous.

Read together, these papers describe the same shift. The KV cache is increasingly distributed, heterogeneous, and complicated, with placement, transfer, eviction, and reuse policies that span GPU memory, host DRAM, local disk, and distributed storage. The trend has gone far enough that there is now a market for custom silicon dedicated to the cache: Netpreme is building a Memory Processing Unit with networked memory tiering whose pitch is extending XPU memory by 100x. When a subsystem starts pulling in its own hardware vendors, it is no longer an implementation detail.

The other techniques for optimizing the cache are quantization and sparsity, used together to tame the memory demand directly. Kitty: Accurate and Efficient 2-bit KV Cache Quantization with Dynamic Channel-wise Precision Boost improves throughput by enabling 8x larger batches under the same memory budget. The SGLang team's HiSparse: Turbocharging Sparse Attention with Hierarchical Memory (which Christos Kozyrakis flagged in his Friday keynote on inference efficiency) pairs sparse attention with a hierarchical memory tier: the GPU keeps a hot device buffer of frequently accessed KV regions while inactive entries offload to host memory. The result is up to 5x throughput on long-context workloads with GLM-5.1-FP8. Quantization shrinks what the cache costs to hold, sparsity shrinks what the engine has to attend over, and pairing them is the practical answer for inference at long context.

The common thread is that the cache is being pulled out of the engine and turned into a distributed system in its own right, with its own scheduling, its own storage tiering, and its own reuse semantics.

Modular’s Solution

This is the trend Brian Zhang described in his blog post The Five Eras of KVCache. The post traces the cache's evolution from a local in-engine optimization in 2023, through PagedAttention, prefix caching, and offloading, ending in an era where the cache is unified, distributed, and composable across heterogeneous infrastructure. The MLSys 2026 papers collectively document this ongoing effort.

For context on how distributed KV cache works at the cluster level, check out the inference routing series on the Modular blog. Why LLM Inference Needs a New Kind of Router makes the case that routing decisions and KV cache state are inseparable, that the router has to be cache-aware, and that cumulative chaining and bitmap indexing are what make that practical at scale. Zooming out, Kyle Caverly’s walkthrough of MAX Serve from prompt to response is a great primer to where the KV cache fits into the overall inference platform. The MLSys papers describe a multitude of approaches to efficient KV cache management. These papers validate the urgency of what Modular has been building for years. Modular Cloud implements a composable approach to large-scale distributed inference that works across hardware and is flexible to ever-changing optimizations.

Trend 3: Inference workloads are leveraging heterogeneous hardware

Heterogeneity at MLSys 2026 ran through all six LLM Serving sessions and both Industry Track sessions on serving. Esha Choukse's Day 1 invited talk Beyond Model Serving: Cross-Stack Co-Design for Agentic Systems set the frame: hardware diversity is a prerequisite for efficiently serving interactive, multimodal, and agentic systems.

Meta's Industry Track paper Optimizing Deployment Configurations for LLM Inference showed the benefits of heterogeneity even in single-model deployments. The paper documents 15 to 25 percent TCO (total cost of ownership) improvements from running prefill on one accelerator type and decode on another, because prefill is compute-bound (favoring high FLOP/s) and decode is memory-bandwidth-bound (favoring high HBM bandwidth). NVIDIA's Beyond the Buzz: A Pragmatic Take on Inference Disaggregation in the same session conceded that prefill-decode disaggregation is real and valuable, but only if rate matching, KV transfer, cache routing, and elastic scaling are solved simultaneously. BOute: Cost-Efficient LLM Serving with Heterogeneous LLMs and GPUs via Multi-Objective Bayesian Optimization treated the choice of which model runs on which accelerator as a search problem. TriInfer: Hybrid EPD Disaggregation for Efficient Multimodal Large Language Model Inference in the Multimodal and Generative Models session extended disaggregation past two phases into encode-prefill-decode for multimodal workloads.

The hardware-specific kernel work showed why this matters. SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips found that existing offloading frameworks utilize the NVLink-C2C interconnect on GH200 at less than 5 percent of its 900 GB/s capacity because they treat it as if it were PCIe. The bottleneck, the paper concluded, is in the software stack. SHIP: SRAM-Based Huge Inference Pipelines for Fast LLM Serving described Groq's LPU-based serving stack, where the entire model fits in SRAM and the compiler statically schedules collective communication at cycle granularity. TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference depends on PyTorch's SymmetricMemory API and NVLink4's NVSHARP engines, both of which exist precisely because vendor-specific collective communication paths no longer fit how kernel authors want to work.

Heterogeneous hardware provides an edge in multimodal inference and cost optimization, but it can be constrained by vendor-specific software stacks.

Modular’s Solution

MAX is built from the ground up for hardware plurality. One container runs on NVIDIA GPUs, AMD GPUs, and CPUs. The kernel layer is written in Mojo, which compiles to each hardware target rather than wrapping vendor-specific libraries. The runtime is hardware-aware but not hardware-specific.

When run on real workloads (e.g. Hippocratic AI), MAX delivers sub-500ms mean TTFT, roughly 30 percent faster P99 end-to-end, and 22 percent faster mean end-to-end against SGLang on NVIDIA B300 GPUs for 400B-plus parameter models. When compared against vLLM on B200, the Modular stack is 5.5x faster P50 TTFT on Kimi-K2.5 and 2.5x faster P99 TTFT on Gemma-4-31B-it, with 1.5x throughput on both. The Flux.2-dev image generation workload runs 6.9x faster than PyTorch Diffusers with torch.compile on B200 and 3.8x faster on AMD MI355x, in the same software stack. You can read more about MAX’s state-of-the-art performance in Modular’s MLSys lightning talk.

Efficient inference serving requires getting the full potential of all available hardware to meet ever-changing industry needs. The MLSys 2026 sessions look at components of the inference stack and optimize them in isolation: kernels, KVCache, serving, etc. Modular rewrote the entire stack as a single system, allowing us to perform holistic optimizations from kernel to cloud. This approach allows us to be at the leading edge of research and achieve optimizations that cut across the entire stack.

Thanks to all those who stopped at the Modular booth to discuss our stack and open roles at Modular. Please reach out on our forums if you’re interested in contributing to our open-source projects or collaborating on research. We look forward to MLSys 2027, both to see how research has developed and to share our own ongoing innovations. See you next year!


Read more from Modular

View all blogs

Build the future of AI with Modular

View Editions
  • Person with blonde hair using a laptop with an Apple logo.

    Sign up today

    Signup to our Cloud Platform today to get started easily.

    Sign Up
  • Magnifying glass emoji with black handle and round clear lens.

    Browse open models

    Browse our model catalog, or deploy your own custom model

    Browse models
No items found.