Summary
KV Cache and Prompt Caching are two complementary inference acceleration techniques for LLMs. KV Cache avoids redundant K/V computation within a single request (intra-request optimization). Prompt Caching reuses KV Cache across requests sharing the same prefix (inter-request optimization). The prefix-exact-matching constraint of Prompt Caching has deep implications for Agent system architecture.
Key Points
KV Cache (Intra-Request)
- In autoregressive generation, historical tokens’ K and V values never change — only the new token’s Q/K/V matters
- Core idea: cache K/V vectors for all previous tokens, compute only the new token’s Q/K/V each step
- Reduces per-step computation from O(t × d) to O(d)
- Memory cost formula: 4blh(s + n) bytes (b=batch, l=layers, h=hidden, s=input len, n=output len)
Two Inference Phases
- Prefill (Compute Bound): all input tokens processed in parallel, high arithmetic intensity, determines TTFT
- Decode (Memory Bound): one token per step, small computation but reads entire KV Cache from GPU memory, determines TPOT
- Key insight: Prefill must compute Q for all positions because multi-layer decoders require each layer’s K/V to depend on previous layer’s complete output
Prompt Caching (Inter-Request)
- If two API calls share identical prompt prefix, the second reuses the first’s KV Cache — skips prefix Prefill entirely
- Prefix matching rule: must match exactly from the first token; any divergence invalidates everything after that point
- Use cases: multi-turn conversations, shared system prompts, Agent systems
- Production implementations: vLLM (
--enable-prefix-caching, hash-based blocks), SGLang (RadixAttention with Radix Trees, enabled by default) - Architectural implication: the prefix-exact-matching constraint fundamentally shapes how Agent systems structure their prompts (system prompt first, shared context next, variable parts last)
Trade-offs
KV cache is not free — key constraints:
- Memory usage grows with sequence length; long contexts consume significant GPU memory
- Batch complexity — managing KV cache across parallel sequences requires careful memory management
- Context limits — cache size is tied to max context window; longer contexts = larger cache
Optimizations
Modern systems employ several techniques to improve KV cache efficiency:
- Paged KV Cache — memory allocated in chunks/pages rather than contiguously (vLLM’s core innovation)
- Quantized KV Cache — reduce precision (FP16 → INT8) to shrink memory footprint
- Eviction strategies — drop less important tokens in long contexts (e.g., H2O, StreamingLLM)
- Flash Attention — fused attention kernel with better memory locality, reduces HBM reads
Real-World Use Cases
KV cache is critical for: chatbots (multi-turn conversations), code generation tools, autocomplete systems, and streaming API responses. Without it, real-time LLM deployment would be prohibitively expensive.
Open Questions
- How does Prompt Caching interact with techniques like DeepSeek MLA (Multi-head Latent Attention)?
- What are the optimal prompt structuring strategies for maximizing cache hit rates in Agent systems?
- How do different providers (OpenAI, Anthropic, open-source) differ in their caching implementations and constraints?
Evidence Timeline
-
2026-05-05: Jayanth Sanku 的 Twitter Notes 入门文章补充了 KV Cache 的 trade-offs(内存、batch 复杂度、上下文限制)和优化技术(Paged KV Cache、量化、eviction、Flash Attention)
-
2026-04-10: “Claude Code from Source” Ch 9 — Fork agents achieve 95% prompt cache hit rate via byte-identical prefix trick. Slot reservation saves context in 99% of requests (Ch 17).
-
2026-04-07: Initial compilation from Chaofa Yuan’s article (published 2026-02-21, updated 2026-03-22)