KV Cache and Prompt Caching

Summary

KV Cache and Prompt Caching are two complementary inference acceleration techniques for LLMs. KV Cache avoids redundant K/V computation within a single request (intra-request optimization). Prompt Caching reuses KV Cache across requests sharing the same prefix (inter-request optimization). The prefix-exact-matching constraint of Prompt Caching has deep implications for Agent system architecture.

Key Points

KV Cache (Intra-Request)

In autoregressive generation, historical tokens’ K and V values never change — only the new token’s Q/K/V matters
Core idea: cache K/V vectors for all previous tokens, compute only the new token’s Q/K/V each step
Reduces per-step computation from O(t × d) to O(d)
Memory cost formula: 4blh(s + n) bytes (b=batch, l=layers, h=hidden, s=input len, n=output len)

Two Inference Phases

Prefill (Compute Bound): all input tokens processed in parallel, high arithmetic intensity, determines TTFT
Decode (Memory Bound): one token per step, small computation but reads entire KV Cache from GPU memory, determines TPOT
Key insight: Prefill must compute Q for all positions because multi-layer decoders require each layer’s K/V to depend on previous layer’s complete output

Prompt Caching (Inter-Request)

If two API calls share identical prompt prefix, the second reuses the first’s KV Cache — skips prefix Prefill entirely
Prefix matching rule: must match exactly from the first token; any divergence invalidates everything after that point
Use cases: multi-turn conversations, shared system prompts, Agent systems
Production implementations: vLLM (--enable-prefix-caching, hash-based blocks), SGLang (RadixAttention with Radix Trees, enabled by default)
Architectural implication: the prefix-exact-matching constraint fundamentally shapes how Agent systems structure their prompts (system prompt first, shared context next, variable parts last)

Trade-offs

KV cache is not free — key constraints:

Memory usage grows with sequence length; long contexts consume significant GPU memory
Batch complexity — managing KV cache across parallel sequences requires careful memory management
Context limits — cache size is tied to max context window; longer contexts = larger cache

Optimizations

Modern systems employ several techniques to improve KV cache efficiency:

Paged KV Cache — memory allocated in chunks/pages rather than contiguously (vLLM’s core innovation)
Quantized KV Cache — reduce precision (FP16 → INT8) to shrink memory footprint
Eviction strategies — drop less important tokens in long contexts (e.g., H2O, StreamingLLM)
Flash Attention — fused attention kernel with better memory locality, reduces HBM reads

Real-World Use Cases

KV cache is critical for: chatbots (multi-turn conversations), code generation tools, autocomplete systems, and streaming API responses. Without it, real-time LLM deployment would be prohibitively expensive.

Open Questions

How does Prompt Caching interact with techniques like DeepSeek MLA (Multi-head Latent Attention)?
What are the optimal prompt structuring strategies for maximizing cache hit rates in Agent systems?
How do different providers (OpenAI, Anthropic, open-source) differ in their caching implementations and constraints?

Evidence Timeline

2026-05-05: Jayanth Sanku 的 Twitter Notes 入门文章补充了 KV Cache 的 trade-offs（内存、batch 复杂度、上下文限制）和优化技术（Paged KV Cache、量化、eviction、Flash Attention）
2026-04-10: “Claude Code from Source” Ch 9 — Fork agents achieve 95% prompt cache hit rate via byte-identical prefix trick. Slot reservation saves context in 99% of requests (Ch 17).
2026-04-07: Initial compilation from Chaofa Yuan’s article (published 2026-02-21, updated 2026-03-22)

My Brain Wiki

探索