Course / Lesson 06  ·  Português →
Lesson 06 · The working memory

The KV cache

The model's weights are fixed; what grows while the model writes is the KV cache — the working memory of generation. It holds the attention states (key and value) of every token already seen, so the model doesn't recompute the whole past for each new token. This lesson shows where those bytes come from, why they blow up RAM on long context, and the compressed sparse attention trick that lets DeepSeek-V4 fit 1M tokens where an ordinary model couldn't.

0.5 MiB
per token in FP16 KV (rule of thumb)
128
DS4's raw sliding window (tokens)
512
top-k of visible compressed rows (indexer)
~26 GB
DS4 KV at a 1M-token context

01 · What the KV cache really is

When the model generates the next token, attention has to look at every prior token. Without a cache, it would recompute the key and value projections of the entire history at each step — quadratic, wasted work. The KV cache solves this by storing those attention states for past tokens, so each new token only computes its own K and V and reads the rest from memory.

The KV cache is the model's working memory during generation. It stores the key/value attention states of prior tokens so the model doesn't recompute history at every token.— Ahmad Osman, "LLMs 101 (2026)"

The cost is predictable. Ahmad's rule of thumb: 0.5 MiB per token in an FP16 KV. It scales linearly with context length — and that's exactly where the problem lives.

KV memory vs context length (FP16, ~0.5 MiB/token) KV memory (GiB) 0 8 16 24 0 4K 32K 256K 1M tokens RAM ceiling — off-chart ↑ normal model — linear, overflows 4K ≈ 2 GiB · 32K ≈ 16 GiB · 1M = impossible 4K, 2 GiB 32K, 16 GiB DeepSeek compressed — flattens 1M tokens ≈ 26 GiB (indexer ≈ 22) ≈26 GiB

The red line (dense attention) grows straight and runs off the chart before 256K. The orange one (DS4's compressed sparse attention) bends and levels near 26 GiB even at 1M tokens.

The byte formula

KV size isn't magic — it's a multiplication. tokens × layers × kv_heads × head_dim × precision × 2 (the ×2 is because we store K and V). Hence the rule of thumb: in FP16, this works out to ≈ 0.5 MiB per token. Multiply by context: 4K ≈ 2 GiB, 32K ≈ 16 GiB. Every lever in the formula — fewer kv_heads, lower precision, fewer rows stored — is a way to cut that bill.

02 · Quantizing the KV — how far it goes

The first lever in the formula is precision. Reducing the bytes per number in the cache drops memory directly. But unlike quantizing weights, the KV is numerically sensitive — it's the active memory of generation, and corrupting it degrades coherence on the spot. Ahmad gives the practical tiers:

KV precisionStatusPractical takeaway
FP16 / BF16Clean baselineNo loss. It's the reference for the 0.5 MiB/token rule.
FP8 / INT8Practical floorHalf the memory, acceptable quality. As far as the casual user should go.
Sub-8-bit (KIVI, KVQuant)Research territoryHeavy, with dedicated methods. Not a casual flip-the-switch.
FP16/BF16 is the clean baseline; FP8/INT8 is the practical floor; below 8 bits is research territory (KIVI, KVQuant) — not a casual switch.— Ahmad Osman, "LLMs 101 (2026)"
Why the KV is more fragile than the weights: the weights are read; the KV is the active history. A quantization error in the cache propagates through the entire following sequence. That's why the safe floor is 8 bits — and going deeper demands algorithms like KIVI/KVQuant, not a simple flag.

03 · The other lever — attention types

The formula has a kv_heads term. The fewer KV heads, the smaller the cache — and that's exactly the idea behind the attention variants. A model can have many query heads but share few key/value heads.

query head (Q) KV head (memory) MHA multi-head — 1 KV per query Q1 Q2 Q3 Q4 KV1 KV2 KV3 KV4 4 KV — costly at long context GQA grouped-query — KV per group Q1 Q2 Q3 Q4 KV1 KV2 2 KV — the common middle ground MQA multi-query — 1 shared KV Q1 Q2 Q3 Q4 KV 1 KV — memory-thrifty

Left to right, fewer KV heads: MHA (one KV per query, costly at long context), GQA (query heads in groups, each sharing one KV), MQA (a single KV for all queries, the most memory-efficient).

04 · The DeepSeek-V4 leap — compressed sparse attention

Quantizing and using GQA cut the KV by a constant. But the curve is still linear in the number of tokens — and at 1M tokens no constant saves you. DeepSeek-V4 attacks the term that really matters: how many KV rows each layer keeps. The technique is Compressed Sparse Attention (CSA), and it works per layer.

Each layer keeps a raw sliding window of the 128 most recent tokens — dense, exact attention where it matters most, on the immediate past. Anything older than that becomes compressed rows, and how much it compresses depends on the layer type.

One CSA layer (even, ≥2): raw window + compressed rows + top-512 indexer token stream (past → present) raw window latest 128 tokens · dense compressed history — 1 row per 4 tokens (ratio-4) … thousands of compressed rows … INDEXER 64 heads · head-dim 128 selects up to 512 visible rows current token (query) reads the raw window directly asks the indexer result: the token sees the ≤512 most relevant old chunks + the 128 raw recent ones

The raw window gives precision on the immediate past; the indexer (64 heads, head-dim 128) runs a sparse search over the compressed history and returns only the 512 most relevant rows — instead of attending to millions of tokens, the layer attends to hundreds.

The central insight: long context doesn't require looking at everything all the time. Almost every token only depends on the immediate past (the raw window) plus a few specific distant chunks (what the indexer fishes out). CSA encodes that intuition straight into the architecture.

05 · The 43-layer pattern

CSA isn't uniform. DS4 has 43 layers, and each one's type alternates in a fixed pattern. The first two layers are raw and simple; after that, even and odd layers specialize — one aggressive on detail, the other aggressive on compression.

DeepSeek-V4's 43 layers, by CSA type layers 0–1 · raw window only even ≥2 · ratio-4 + indexer odd ≥3 · ratio-128 (no indexer) 0 1 2 3 42 even = keeps a lot (1 row / 4 tokens) and searches · odd = keeps little (1 row / 128 tokens) the costly even and the cheap odd alternate — detail and compression splitting the depth
LayerCSA typeCompression ratioIndexer?
0 and 1Raw window onlyNo
Even from 2 (2, 4, 6 …)Compressed + searchratio-4 (1 row per 4 tokens)Yes — selects up to 512 rows
Odd from 3 (3, 5, 7 …)Aggressively compressedratio-128 (1 row per 128 tokens)No
Fixed CSA constants (from DS4's MODEL_CARD)

43 layers · 128-token raw window · indexer with 64 heads · indexer head-dim 128 · indexer top-k 512. Even layers ≥2 use ratio-4 with an indexer; odd ≥3 use ratio-128 without one.

06 · The result — 1M context in ~26 GB

Put it all together and the math closes where dense attention never would. According to DS4's README (antirez), the KV of a full 1M-token context takes ≈ 26 GB of memory — and the compressed indexer alone accounts for ≈ 22 GB of it. It's the orange curve from the first diagram, now in concrete numbers.

Memory at a 1M-token context (to scale — 6.4 px/GB over 128 GB) 0 32 64 96 128 GB DeepSeek-V4 q2 weights 81 GB · resident weights + 1M-context KV (≈ 26 GB) indexer ≈ 22 GB +4 other KV total ≈107 GB 128 GB ceiling headroom ≈21 GB Reference: a dense KV (≈0.5 MiB/token) for 1M tokens would overflow the chart → hundreds of GB — impossible in 128 GB (this is what CSA avoids)

The 81 GB of weights plus the ~26 GB of KV (of which ~22 GB is the indexer) add up to ~107 GB — it fits in 128 GB, but scrapes the headroom Lesson 08 will demand.

A full 1M-token context consumes about 26 GB of memory; the compressed indexer alone accounts for ~22 GB. With 128 GB running the 81 GB q2 weights, a 100–300k context is the wiser choice.— antirez, DeepSeek-V4 (DS4) README
The practical choice: 1M tokens is possible, not comfortable. Running q2 (81 GB) and still reserving ~26 GB for a 1M KV leaves the system without margin. That's why antirez's recommendation is to aim for 100–300k of context — a fraction of the KV, headroom preserved, and still far beyond what a dense model would deliver.

07 · Closing the argument

The KV cache is the piece that turns "how many tokens fit" into a concrete RAM bill. Three levers control it, in order of impact:

That's why DeepSeek-V4 fits 1M tokens in ~26 GB where a dense model would need far more and simply wouldn't fit. The next lesson shows how this concrete model runs on your Mac.

1. Why does the KV cache exist — what problem does it solve in generation?
Correct: b. The KV cache is the working memory of generation: each new token only computes its own K/V and reads the rest from the cache, avoiding recomputing the past. Rule-of-thumb cost: ~0.5 MiB/token in FP16 (4K ≈ 2 GiB, 32K ≈ 16 GiB).
2. In DeepSeek-V4's CSA, what lets an EVEN layer from 2 onward get long context without keeping everything?
Correct: c. The even layer ≥2 combines a 128 raw window (dense, recent) + ratio-4 compressed rows + a top-512 indexer (sparse search over history). The odd ones ≥3 go more aggressive: ratio-128 and no indexer. Result: 1M tokens ≈ 26 GB of KV (indexer ≈ 22 GB).