Modular: The Five Eras of KVCache

Speculative decoding accelerates LLM inference by having a small draft model generate multiple tokens ahead and then using a larger target model to verify and accept those tokens in a single pass. With this technique, a separate KVCache needs to be maintained for the draft and target model.

Vision encoders in Vision–Language Models (VLMs) generate large image embeddings that can be cached and reused across requests. While this differs from the traditional notion of a “KVCache” or prefix caching, it follows the same underlying principle of memoizing expensive intermediate states. Models which benefit from this cache include QwenVL and InternVL.

Quantized KVCache: Low precision datatypes like FP8 help reduce the storage requirements of the KVCache and rely on per-tensor/row/block scaling factors to preserve numerical range. This requires the KVCache implementation to also manage memory for these scaling factors.

Sliding Window Attention (SWA) limits each token to attend only to the preceding window_size tokens instead of the entire sequence, reducing memory and compute. As a result, KVCache management and prefix caching must track which tokens fall within the current window, making cache hits and evictions more complex than in full attention.

Fig 11. <a href= — Fig 11. https://arxiv.org/pdf/2503.18292

Mamba / State Space Models replaces attention with a recurrent state that updates a single large vector for each new token. This makes prefix caching more complex because serving systems must decide when and how to checkpoint or store the evolving state vector for future reuse.

Composite Models are composed of multiple sub-models. For example, it is a common pattern to combine an LLM backbone with an audio decoder. Each of these sub-models may require maintaining separate KV caches.

Hybrid Models combine multiple layer types within a single model, which often necessitates maintaining multiple KV caches to handle each layer’s distinct attention or state mechanism. Examples include:

Sliding Window Attention + Full Attention (Gemma2/3, Ministral, GPT-OSS, Cohere)
Mamba + Full Attention (Jamba, Bamba, Minimax)
Local Chunked + Full Attention (Llama4)

The Five Eras of KVCache

Introduction

Era 0: Pre-GenAI (<2017)

Era 1: Continuous KVCache (2017)

Era 2: PagedAttention (2023)

Era 3: Heterogenous KVCaches (2024)

Era 4: Distributed KVCache (2025+)

Era 5: Unified Hybrid KVCaches (2025+)

Conclusion

Read more from Modular