Dark metal lockers with red-glowing rows, representing cached key-value state reused during generation.
Each locker holds work already done, ready to reuse instead of redoing it - the same idea behind the KV cache.

The KV cache (key-value cache) is memory a transformer keeps on the GPU while it generates text. When a model reads a prompt and produces one token at a time, the attention mechanism computes a key and a value vector for every token. The KV cache stores those keys and values so the model does not recompute them for the next token. This makes generation much faster, but the cache grows with every token, and that growth is one of the main limits on serving cost and context length.

A plain analogy

Imagine reading a long book and writing a new sentence that has to fit the whole story so far. Without notes, you would reread every earlier page before writing each new word. That is slow and repetitive.

Instead, you keep running notes in the margin. Each time you finish a page, you jot down what matters. To write the next word, you glance at your notes rather than reread the book. The KV cache is those notes. The keys tell the model where to look, and the values are what it finds there. The model builds the notes once per token, then reuses them for every token that follows.

How it works

A transformer generates text one token at a time, a process called autoregressive decoding. Each new token attends to all previous tokens. Attention needs three projections per token: a query, a key, and a value. The query comes from the current token. The keys and values come from every token seen so far.

The insight is that keys and values for past tokens never change once computed. Only the new token adds a new key and value. So the model caches them. Without a cache, generating token number 1,000 would recompute keys and values for the previous 999 tokens. With the cache, it computes them once and reads them back.

Step 1 Process the prompt Compute keys and values for every prompt token and store them in the cache.
Step 2 Generate a token The new token attends to cached keys and values, plus its own.
Step 3 Append and repeat Add the new token's key and value to the cache, then generate the next.

The trade-off is memory. The cache grows linearly with the number of tokens, and it scales with the number of layers and attention heads in the model. A single long conversation can hold gigabytes of keys and values on the GPU. When many users share one GPU, their caches compete for the same limited memory. This is why a longer context window costs more to serve, and why the KV cache is central to both serving cost and long-context limits.

Managing cache memory

Because the cache dominates GPU memory during serving, how it is stored matters. The vLLM system, introduced by Kwon and colleagues in 2023, describes the KV cache as memory that grows and shrinks dynamically per request and that naive systems waste through fragmentation. Their PagedAttention method borrows the paging idea from operating systems: it splits each cache into fixed-size blocks that need not sit next to each other in memory. The paper reports near-zero waste in KV cache memory and throughput 2 to 4 times higher than prior systems at the same latency, with larger gains on longer sequences.

Other techniques trim the cache itself. Grouped-query attention lets several query heads share one set of keys and values, shrinking the cache without a separate optimization step. Quantization stores keys and values in lower precision. Eviction and compression drop or summarize older entries. Each method trades a little accuracy or complexity for room to serve longer contexts or more users.

The KV cache is the engine of inference , the phase where a trained model produces output. It is what makes an LLM fast enough to respond token by token in real time.

Two serving techniques build directly on it. Continuous batching packs many requests onto one GPU, and their KV caches share that GPU’s memory, so efficient cache management decides how many requests fit. Speculative decoding proposes several tokens at once with a small draft model, then verifies them against the cache in a single pass, cutting the number of full forward steps.

Further reading