Lesson 06 · The working memory

The KV cache

The model's weights are fixed; what grows while the model writes is the KV cache — the working memory of generation. It holds the attention states (key and value) of every token already seen, so the model doesn't recompute the whole past for each new token. This lesson shows where those bytes come from, why they blow up RAM on long context, and the compressed sparse attention trick that lets DeepSeek-V4 fit 1M tokens where an ordinary model couldn't.

0.5 MiB

per token in FP16 KV (rule of thumb)

128

DS4's raw sliding window (tokens)

512

top-k of visible compressed rows (indexer)

~26 GB

DS4 KV at a 1M-token context

01 · What the KV cache really is

When the model generates the next token, attention has to look at every prior token. Without a cache, it would recompute the key and value projections of the entire history at each step — quadratic, wasted work. The KV cache solves this by storing those attention states for past tokens, so each new token only computes its own K and V and reads the rest from memory.

The KV cache is the model's working memory during generation. It stores the key/value attention states of prior tokens so the model doesn't recompute history at every token.— Ahmad Osman, "LLMs 101 (2026)"

The cost is predictable. Ahmad's rule of thumb: 0.5 MiB per token in an FP16 KV. It scales linearly with context length — and that's exactly where the problem lives.

The red line (dense attention) grows straight and runs off the chart before 256K. The orange one (DS4's compressed sparse attention) bends and levels near 26 GiB even at 1M tokens.

The byte formula

KV size isn't magic — it's a multiplication. tokens × layers × kv_heads × head_dim × precision × 2 (the ×2 is because we store K and V). Hence the rule of thumb: in FP16, this works out to ≈ 0.5 MiB per token. Multiply by context: 4K ≈ 2 GiB, 32K ≈ 16 GiB. Every lever in the formula — fewer kv_heads, lower precision, fewer rows stored — is a way to cut that bill.

02 · Quantizing the KV — how far it goes

The first lever in the formula is precision. Reducing the bytes per number in the cache drops memory directly. But unlike quantizing weights, the KV is numerically sensitive — it's the active memory of generation, and corrupting it degrades coherence on the spot. Ahmad gives the practical tiers:

KV precision	Status	Practical takeaway
FP16 / BF16	Clean baseline	No loss. It's the reference for the 0.5 MiB/token rule.
FP8 / INT8	Practical floor	Half the memory, acceptable quality. As far as the casual user should go.
Sub-8-bit (KIVI, KVQuant)	Research territory	Heavy, with dedicated methods. Not a casual flip-the-switch.

FP16/BF16 is the clean baseline; FP8/INT8 is the practical floor; below 8 bits is research territory (KIVI, KVQuant) — not a casual switch.— Ahmad Osman, "LLMs 101 (2026)"

Why the KV is more fragile than the weights: the weights are read; the KV is the active history. A quantization error in the cache propagates through the entire following sequence. That's why the safe floor is 8 bits — and going deeper demands algorithms like KIVI/KVQuant, not a simple flag.

03 · The other lever — attention types

The formula has a kv_heads term. The fewer KV heads, the smaller the cache — and that's exactly the idea behind the attention variants. A model can have many query heads but share few key/value heads.

Left to right, fewer KV heads: MHA (one KV per query, costly at long context), GQA (query heads in groups, each sharing one KV), MQA (a single KV for all queries, the most memory-efficient).

MQA — a single KV head shared by all queries. Minimal memory.
GQA — query heads grouped, each group with its own KV. The middle ground between cost and quality.
MHA — full attention, one KV per query head. More capacity, but costly at long context.

04 · The DeepSeek-V4 leap — compressed sparse attention

Quantizing and using GQA cut the KV by a constant. But the curve is still linear in the number of tokens — and at 1M tokens no constant saves you. DeepSeek-V4 attacks the term that really matters: how many KV rows each layer keeps. The technique is Compressed Sparse Attention (CSA), and it works per layer.

Each layer keeps a raw sliding window of the 128 most recent tokens — dense, exact attention where it matters most, on the immediate past. Anything older than that becomes compressed rows, and how much it compresses depends on the layer type.

The raw window gives precision on the immediate past; the indexer (64 heads, head-dim 128) runs a sparse search over the compressed history and returns only the 512 most relevant rows — instead of attending to millions of tokens, the layer attends to hundreds.

The central insight: long context doesn't require looking at everything all the time. Almost every token only depends on the immediate past (the raw window) plus a few specific distant chunks (what the indexer fishes out). CSA encodes that intuition straight into the architecture.

05 · The 43-layer pattern

CSA isn't uniform. DS4 has 43 layers, and each one's type alternates in a fixed pattern. The first two layers are raw and simple; after that, even and odd layers specialize — one aggressive on detail, the other aggressive on compression.

Layer	CSA type	Compression ratio	Indexer?
0 and 1	Raw window only	—	No
Even from 2 (2, 4, 6 …)	Compressed + search	ratio-4 (1 row per 4 tokens)	Yes — selects up to 512 rows
Odd from 3 (3, 5, 7 …)	Aggressively compressed	ratio-128 (1 row per 128 tokens)	No

Fixed CSA constants (from DS4's MODEL_CARD)

43 layers · 128-token raw window · indexer with 64 heads · indexer head-dim 128 · indexer top-k 512. Even layers ≥2 use ratio-4 with an indexer; odd ≥3 use ratio-128 without one.

06 · The result — 1M context in ~26 GB

Put it all together and the math closes where dense attention never would. According to DS4's README (antirez), the KV of a full 1M-token context takes ≈ 26 GB of memory — and the compressed indexer alone accounts for ≈ 22 GB of it. It's the orange curve from the first diagram, now in concrete numbers.

The 81 GB of weights plus the ~26 GB of KV (of which ~22 GB is the indexer) add up to ~107 GB — it fits in 128 GB, but scrapes the headroom Lesson 08 will demand.

A full 1M-token context consumes about 26 GB of memory; the compressed indexer alone accounts for ~22 GB. With 128 GB running the 81 GB q2 weights, a 100–300k context is the wiser choice.— antirez, DeepSeek-V4 (DS4) README

The practical choice: 1M tokens is possible, not comfortable. Running q2 (81 GB) and still reserving ~26 GB for a 1M KV leaves the system without margin. That's why antirez's recommendation is to aim for 100–300k of context — a fraction of the KV, headroom preserved, and still far beyond what a dense model would deliver.

07 · Closing the argument

The KV cache is the piece that turns "how many tokens fit" into a concrete RAM bill. Three levers control it, in order of impact:

Precision — FP16 is the baseline; FP8/INT8 is the practical floor; sub-8-bit is research. Cuts by a constant.
KV heads — MQA/GQA share KV across queries. Cuts by a constant.
Rows stored per layer — DS4's CSA (128-token raw window + compressed + top-512 indexer) attacks the linear term. It's what bends the curve from "overflows" to "flattens".

That's why DeepSeek-V4 fits 1M tokens in ~26 GB where a dense model would need far more and simply wouldn't fit. The next lesson shows how this concrete model runs on your Mac.

1. Why does the KV cache exist — what problem does it solve in generation?

Correct: b. The KV cache is the working memory of generation: each new token only computes its own K/V and reads the rest from the cache, avoiding recomputing the past. Rule-of-thumb cost: ~0.5 MiB/token in FP16 (4K ≈ 2 GiB, 32K ≈ 16 GiB).

2. In DeepSeek-V4's CSA, what lets an EVEN layer from 2 onward get long context without keeping everything?

Correct: c. The even layer ≥2 combines a 128 raw window (dense, recent) + ratio-4 compressed rows + a top-512 indexer (sparse search over history). The odd ones ≥3 go more aggressive: ratio-128 and no indexer. Result: 1M tokens ≈ 26 GB of KV (indexer ≈ 22 GB).

← Lesson 05 Lesson 07 →