The model's weights are fixed; what grows while the model writes is the KV cache — the working memory of generation. It holds the attention states (key and value) of every token already seen, so the model doesn't recompute the whole past for each new token. This lesson shows where those bytes come from, why they blow up RAM on long context, and the compressed sparse attention trick that lets DeepSeek-V4 fit 1M tokens where an ordinary model couldn't.
When the model generates the next token, attention has to look at every prior token. Without a cache, it would recompute the key and value projections of the entire history at each step — quadratic, wasted work. The KV cache solves this by storing those attention states for past tokens, so each new token only computes its own K and V and reads the rest from memory.
The KV cache is the model's working memory during generation. It stores the key/value attention states of prior tokens so the model doesn't recompute history at every token.— Ahmad Osman, "LLMs 101 (2026)"
The cost is predictable. Ahmad's rule of thumb: 0.5 MiB per token in an FP16 KV. It scales linearly with context length — and that's exactly where the problem lives.
The red line (dense attention) grows straight and runs off the chart before 256K. The orange one (DS4's compressed sparse attention) bends and levels near 26 GiB even at 1M tokens.
KV size isn't magic — it's a multiplication. tokens × layers × kv_heads × head_dim × precision × 2 (the ×2 is because we store K and V). Hence the rule of thumb: in FP16, this works out to ≈ 0.5 MiB per token. Multiply by context: 4K ≈ 2 GiB, 32K ≈ 16 GiB. Every lever in the formula — fewer kv_heads, lower precision, fewer rows stored — is a way to cut that bill.
The first lever in the formula is precision. Reducing the bytes per number in the cache drops memory directly. But unlike quantizing weights, the KV is numerically sensitive — it's the active memory of generation, and corrupting it degrades coherence on the spot. Ahmad gives the practical tiers:
| KV precision | Status | Practical takeaway |
|---|---|---|
| FP16 / BF16 | Clean baseline | No loss. It's the reference for the 0.5 MiB/token rule. |
| FP8 / INT8 | Practical floor | Half the memory, acceptable quality. As far as the casual user should go. |
| Sub-8-bit (KIVI, KVQuant) | Research territory | Heavy, with dedicated methods. Not a casual flip-the-switch. |
FP16/BF16 is the clean baseline; FP8/INT8 is the practical floor; below 8 bits is research territory (KIVI, KVQuant) — not a casual switch.— Ahmad Osman, "LLMs 101 (2026)"
The formula has a kv_heads term. The fewer KV heads, the smaller the cache — and that's exactly the idea behind the attention variants. A model can have many query heads but share few key/value heads.
Left to right, fewer KV heads: MHA (one KV per query, costly at long context), GQA (query heads in groups, each sharing one KV), MQA (a single KV for all queries, the most memory-efficient).
Quantizing and using GQA cut the KV by a constant. But the curve is still linear in the number of tokens — and at 1M tokens no constant saves you. DeepSeek-V4 attacks the term that really matters: how many KV rows each layer keeps. The technique is Compressed Sparse Attention (CSA), and it works per layer.
Each layer keeps a raw sliding window of the 128 most recent tokens — dense, exact attention where it matters most, on the immediate past. Anything older than that becomes compressed rows, and how much it compresses depends on the layer type.
The raw window gives precision on the immediate past; the indexer (64 heads, head-dim 128) runs a sparse search over the compressed history and returns only the 512 most relevant rows — instead of attending to millions of tokens, the layer attends to hundreds.
CSA isn't uniform. DS4 has 43 layers, and each one's type alternates in a fixed pattern. The first two layers are raw and simple; after that, even and odd layers specialize — one aggressive on detail, the other aggressive on compression.
| Layer | CSA type | Compression ratio | Indexer? |
|---|---|---|---|
| 0 and 1 | Raw window only | — | No |
| Even from 2 (2, 4, 6 …) | Compressed + search | ratio-4 (1 row per 4 tokens) | Yes — selects up to 512 rows |
| Odd from 3 (3, 5, 7 …) | Aggressively compressed | ratio-128 (1 row per 128 tokens) | No |
43 layers · 128-token raw window · indexer with 64 heads · indexer head-dim 128 · indexer top-k 512. Even layers ≥2 use ratio-4 with an indexer; odd ≥3 use ratio-128 without one.
Put it all together and the math closes where dense attention never would. According to DS4's README (antirez), the KV of a full 1M-token context takes ≈ 26 GB of memory — and the compressed indexer alone accounts for ≈ 22 GB of it. It's the orange curve from the first diagram, now in concrete numbers.
The 81 GB of weights plus the ~26 GB of KV (of which ~22 GB is the indexer) add up to ~107 GB — it fits in 128 GB, but scrapes the headroom Lesson 08 will demand.
A full 1M-token context consumes about 26 GB of memory; the compressed indexer alone accounts for ~22 GB. With 128 GB running the 81 GB q2 weights, a 100–300k context is the wiser choice.— antirez, DeepSeek-V4 (DS4) README
The KV cache is the piece that turns "how many tokens fit" into a concrete RAM bill. Three levers control it, in order of impact:
That's why DeepSeek-V4 fits 1M tokens in ~26 GB where a dense model would need far more and simply wouldn't fit. The next lesson shows how this concrete model runs on your Mac.