February 5, 2026

The Five Eras of KVCache

Brian Zhang

Engineering

Introduction

Key–Value Cache (KVCache) is a foundational building block of modern LLM serving systems. It stores past attention states so the model can generate new tokens efficiently without excessive re-computation.

There are two phases to LLM inference: Prefill and Decode. In the Prefill phase, the attention states are computed for each token in the input prompt. In the subsequent Decode phase, new tokens are generated one by one in an autoregressive fashion by attending on the Key-Value associated with previous tokens.

<a href=https://www.nature.com/articles/s41586-023-06647-8" />
https://www.nature.com/articles/s41586-023-06647-8

vLLM, SGLang, TensorRT-LLM, and MAX Serve are all built on top of increasingly sophisticated KVCache management. This blog explores the evolution and role of the KVCache in these inference engines.

Era 0: Pre-GenAI (<2017)

Before transformers took over, deep learning was dominated by stateless, feed-forward architectures like ResNet, YOLO, VGG, and Inception. These models did not require persistent state across inference steps, so the concept of a KVCache simply didn’t exist even in inference frameworks like ONNX or TensorRT.

Era 1: Continuous KVCache (2017)

The original transformer (2017) established the architecture that would eventually dominate ML. This design was a departure from prior models, requiring a KVCache to efficiently keep track of the state associated with each request. Nevertheless, the major step-change in intelligence enabled by transformers more than justified their added complexity.

At the time, early LLM serving engines implemented KV caches naively:

  • For each request, they preallocated a contiguous KV tensor with max_seq_len tokens.
  • The storage was 2 x num_layers × num_heads × head_dim × max_seq_len per request.

This Contiguous KVCache design was extremely wasteful, but still offered huge performance gains over recomputing attention keys/values for each token:

  • ✔ Simple
  • ✘ Memory usage scales aggressively due to the max_seq_len × batch_size factor
  • ✘ Constrained max_batch_size due to limited memory capacity
  • ✘ High memory fragmentation due to variable-length requests
  • ✘ Most request are far shorter than max_seq_len, leaving much wasted capacity

This was the approach of early inference engines like HuggingFace Transformers.

Era 2: PagedAttention (2023)

A breakthrough arrived with PagedAttention, introduced by vLLM. The key idea was to borrow a technique from Operating Systems by allocating KV in fixed-size pages that could be dynamically allocated as sequences grew.

Benefits:

  • ✔ Dramatically improves memory utilization and reduces fragmentation
  • ✔ Enables hundreds / thousands of concurrent requests
  • ✔ Drives up throughput via larger batch sizes
  • ✔ Allows for efficient KVCache reuse via Prefix Caching, a huge throughput multiplier for multi-turn chat workloads

PagedAttention became the de-facto standard for LLM serving, leading to new inference engines like TensorRT-LLM and SGLang.

Era 3: Heterogenous KVCaches (2024)

The world of ML and the LLM serving landscape is far more complex now. New optimizations along with modern multimodal and hybrid models require multiple different kinds of state, each with separate caching requirements. In this Era, the term “KVCache” is being stretched far beyond its original meaning.

  1. Speculative decoding accelerates LLM inference by having a small draft model generate multiple tokens ahead and then using a larger target model to verify and accept those tokens in a single pass. With this technique, a separate KVCache needs to be maintained for the draft and target model.
  2. Vision encoders in Vision–Language Models (VLMs) generate large image embeddings that can be cached and reused across requests. While this differs from the traditional notion of a “KVCache” or prefix caching, it follows the same underlying principle of memoizing expensive intermediate states. Models which benefit from this cache include QwenVL and InternVL.
  3. Quantized KVCache: Low precision datatypes like FP8 help reduce the storage requirements of the KVCache and rely on per-tensor/row/block scaling factors to preserve numerical range. This requires the KVCache implementation to also manage memory for these scaling factors.
  4. Sliding Window Attention (SWA) limits each token to attend only to the preceding window_size tokens instead of the entire sequence, reducing memory and compute. As a result, KVCache management and prefix caching must track which tokens fall within the current window, making cache hits and evictions more complex than in full attention.
    Fig 11. <a href=https://arxiv.org/pdf/2503.18292" />
    Fig 11. https://arxiv.org/pdf/2503.18292
  5. Mamba / State Space Models replaces attention with a recurrent state that updates a single large vector for each new token. This makes prefix caching more complex because serving systems must decide when and how to checkpoint or store the evolving state vector for future reuse.
  6. Composite Models are composed of multiple sub-models. For example, it is a common pattern to combine an LLM backbone with an audio decoder. Each of these sub-models may require maintaining separate KV caches.
  7. Hybrid Models combine multiple layer types within a single model, which often necessitates maintaining multiple KV caches to handle each layer’s distinct attention or state mechanism. Examples include:
    1. Sliding Window Attention + Full Attention (Gemma2/3, Ministral, GPT-OSS, Cohere)
    2. Mamba + Full Attention (Jamba, Bamba, Minimax)
    3. Local Chunked + Full Attention (Llama4)
Fig 1. <a href=https://arxiv.org/pdf/2503.18292" />
Fig 1. https://arxiv.org/pdf/2503.18292
  • This list is non-exhaustive. There are a ton of other ideas like Transfusion for joint text–image generation, dynamic KVCache compression, Cross-Attention (not to be confused with Cross-Layer Attention), etc.
  • This heterogeneity and diversity of KVCache’s with different shapes, lifetimes, and properties led to the creation of specialized managers in modern LLM serving engines. For example, vLLM has the Vision Encoding Cache, Mamba Cache, etc, in additional to its normal KVCache.

    There are several challenges emerging with this design:

    • ✘ Memory fragmentation due to multiple KVCache managers can lead to small batch sizes
    • ✘ Challenging to predict at server startup how much memory to allocate per KVCache
    • ✘ Disjoint Prefix Caching implementations lead to suboptimal cache hit rates
    • ✘ Diversity makes feature composition challenging

    Era 4: Distributed KVCache (2025+)

    As LLMs grow in size and handle increasing workloads, a single GPU or node becomes insufficient. Now LLM serving and the KVCache is becoming multi-node and distributed, often spanning an entire datacenter. Managing the massive scale of the KVCache requires new techniques as such:

    • Disaggregated Inference: LLM inference is divided into Prefill and Decode phases, deployed and scaled on separate model instances to reduce interference and optimize resource usage. A key challenge is efficiently transferring the KVCache from Prefill nodes to Decode nodes. Recently new variants of disaggregation have emerged like Encoder Disaggregation.
    • KVCache-aware Load Balancing: Request routing prioritizes instances that already hold the relevant KVCache, maximizing prefix cache hits. This requires a cluster-wide view of the current state of the KVCache on each of the individual instances.
    • Hierarchical KVCache: To increase KVCache capacity, cold pages can be spilled from GPU memory to more abundant CPU RAM or SSD. This extends the effective KVCache size while keeping the hot, frequently accessed pages in GPU memory for low-latency access. The higher latency of loading/storing of KVCache for one model layer from a lower tier of the cache can be hidden by overlapping it with the GPU execution for the prior layer.

    Many new kubernetes-native inference solutions like Nvidia Dynamo, vLLM Production Stack, llm-d, or AIBrix have emerged to tame this complexity. However, distributed LLM inference is still very hard:

    • ✘ Many existing optimizations or architectures are still incompatible with distributed inference like speculative decoding or VLMs
    • ✘ Despite the wide availability of open-source solutions, it still requires expert knowledge and a lot of patience to deploy
    • ✘ Inter-node GPU networking over Infiniband or RoCE is challenging and many libraries like NIXL are nascent
    • ✘ There are many inherent problems for large-scale distributed systems such as managing failover, stragglers, hardware defects, auto-scaling, etc

    Era 5: Unified Hybrid KVCaches (2025+)

    The next stage is building unified KV memory systems where many heterogeneous KV types share a common memory pool rather than isolated allocators. Another overarching theme in this era is striving for composability between all available optimizations.

    This evolution is happening today!

    Emerging approaches:

    1. vLLM / Jenga – Huge Pages + LCM Sizing
      1. Use huge pages with sizes chosen as the least common multiple of smaller page formats so different KV shapes can co-exist efficiently.
      2. Unified Prefix Caching design that takes into consideration many KVCaches at once to improve balance and hit rate.

      Unsupported block type: column_list

    2. SGLang – CUDA Virtual Memory
      1. SGLang uses CUDA Virtual Memory APIs to dynamically remap device memory and unify different KV regions
      2. This enables virtually contiguous but physically scattered KV pages
    <a href=https://pytorch.org/blog/hybrid-models-meet-sglang-more-than-full-attention/" />
    https://pytorch.org/blog/hybrid-models-meet-sglang-more-than-full-attention/
    1. Significant effort is also being invested into feature composability. In fact, this is one of the critical tenets of the 2025Q4 SGLang roadmap. For instance, one should be able to run a VLM model with Speculative Decoding across multiple nodes in a disaggregated setup. This will require long-term software investment and re-architecting core components of the inference engine.

    Conclusion

    What began as a simple optimization—caching attention states to avoid recomputation—has evolved into one of the most complex subsystems in modern AI infrastructure. Each era has brought new challenges: memory fragmentation, heterogeneous model architectures, distributed coordination, and now the need for unified systems that compose cleanly across all these dimensions. As new models, optimizations, and hardware emerge, KVCache management will require innovation across all layers of the LLM inference stack from GPU kernels to cluster-scheduling.

    This complexity is precisely why we built MAX with a ground-up approach to KVCache management. Combined with Mojo's performance and flexibility, we're building infrastructure that handles today's models while adapting to tomorrow's innovations.

    Interested in how MAX handles KVCache for your workloads? Get started here or join our community to discuss with the team.


    Read more from Modular

    View all blogs

    Build the future of AI with Modular

    View Editions
    • Person with blonde hair using a laptop with an Apple logo.

      Get started guide

      Install MAX with a few commands and deploy a GenAI model locally.

      Read Guide
    • Magnifying glass emoji with black handle and round clear lens.

      Browse open models

      500+ models, many optimized for lightning-fast performance

      Browse models
    No items found.