Summary

KV Cache and Prompt Caching are two complementary inference acceleration techniques for LLMs. KV Cache avoids redundant K/V computation within a single request (intra-request optimization). Prompt Caching reuses KV Cache across requests sharing the same prefix (inter-request optimization). The prefix-exact-matching constraint of Prompt Caching has deep implications for Agent system architecture.

Key Points

KV Cache (Intra-Request)

  • In autoregressive generation, historical tokens’ K and V values never change — only the new token’s Q/K/V matters
  • Core idea: cache K/V vectors for all previous tokens, compute only the new token’s Q/K/V each step
  • Reduces per-step computation from O(t × d) to O(d)
  • Memory cost formula: 4blh(s + n) bytes (b=batch, l=layers, h=hidden, s=input len, n=output len)

Two Inference Phases

  • Prefill (Compute Bound): all input tokens processed in parallel, high arithmetic intensity, determines TTFT
  • Decode (Memory Bound): one token per step, small computation but reads entire KV Cache from GPU memory, determines TPOT
  • Key insight: Prefill must compute Q for all positions because multi-layer decoders require each layer’s K/V to depend on previous layer’s complete output

Prompt Caching (Inter-Request)

  • If two API calls share identical prompt prefix, the second reuses the first’s KV Cache — skips prefix Prefill entirely
  • Prefix matching rule: must match exactly from the first token; any divergence invalidates everything after that point
  • Use cases: multi-turn conversations, shared system prompts, Agent systems
  • Production implementations: vLLM (--enable-prefix-caching, hash-based blocks), SGLang (RadixAttention with Radix Trees, enabled by default)
  • Architectural implication: the prefix-exact-matching constraint fundamentally shapes how Agent systems structure their prompts (system prompt first, shared context next, variable parts last)

Trade-offs

KV cache is not free — key constraints:

  • Memory usage grows with sequence length; long contexts consume significant GPU memory
  • Batch complexity — managing KV cache across parallel sequences requires careful memory management
  • Context limits — cache size is tied to max context window; longer contexts = larger cache

Optimizations

Modern systems employ several techniques to improve KV cache efficiency:

  • Paged KV Cache — memory allocated in chunks/pages rather than contiguously (vLLM’s core innovation)
  • Quantized KV Cache — reduce precision (FP16 → INT8) to shrink memory footprint
  • Eviction strategies — drop less important tokens in long contexts (e.g., H2O, StreamingLLM)
  • Flash Attention — fused attention kernel with better memory locality, reduces HBM reads

Real-World Use Cases

KV cache is critical for: chatbots (multi-turn conversations), code generation tools, autocomplete systems, and streaming API responses. Without it, real-time LLM deployment would be prohibitively expensive.

Open Questions

  • How does Prompt Caching interact with techniques like DeepSeek MLA (Multi-head Latent Attention)?
  • What are the optimal prompt structuring strategies for maximizing cache hit rates in Agent systems?
  • How do different providers (OpenAI, Anthropic, open-source) differ in their caching implementations and constraints?

Evidence Timeline

  • 2026-05-05: Jayanth Sanku 的 Twitter Notes 入门文章补充了 KV Cache 的 trade-offs(内存、batch 复杂度、上下文限制)和优化技术(Paged KV Cache、量化、eviction、Flash Attention)

  • 2026-04-10: “Claude Code from Source” Ch 9 — Fork agents achieve 95% prompt cache hit rate via byte-identical prefix trick. Slot reservation saves context in 99% of requests (Ch 17).

  • 2026-04-07: Initial compilation from Chaofa Yuan’s article (published 2026-02-21, updated 2026-03-22)

相关页面

chaofa-yuan