Lesson 04 · Quality

Quantization & quality

Lesson 02 showed that every bit less per weight shrinks the model in memory. But shrinking has a price: at some point the intelligence starts to crumble — and it doesn't crumble evenly. This lesson shows the quality ladder bit by bit, what breaks first when you squeeze too hard, and the asymmetric recipe that fits a huge model into 2 bits without losing the ability to use tools.

the consumer sweet spot (quality × size)

near-lossless

2 bits

only the MoE routed experts (most of the model)

what stays untouched: shared experts, projections, routing

01 · The quality ladder

Quantizing is trading numerical precision for space. At the top, FP16/BF16 is the baseline — the full quality the model was trained at. Each step down cuts bits and, with them, a bit of fidelity. The secret is that the loss is not linear: the first steps cost almost nothing; the last ones cost the entire intelligence. The diagram below draws the real curve and marks the cliff.

From FP16 to Q4 the curve barely dips — quality is "cheap" to keep. Below Q4 it plummets: every bit less costs more and more intelligence. That's why Q3/Q2 are a last resort, not a default choice.

A smaller model at higher precision can beat a larger model crushed into too few bits.— Ahmad Osman, "LLMs 101 (2026)"

02 · The named ladder, step by step

Each level has a character of its own. This is the practical read — paired with the cost in GB per billion parameters we derived in lesson 02.

Level	Character	~GB / 1B	When to choose
FP16 / BF16	baseline · full quality	~2.0	Reference / training. Rarely for local inference.
Q8 / INT8	near-lossless	~1.0	When it fits and you want the local quality ceiling.
Q6 / Q5	excellent · strong middle ground	~0.7	Balance when Q8 won't fit but you don't want to risk it.
Q4 ★	consumer sweet spot	~0.5	General default. Best quality × size in practice.
Q3 / Q2	last resort	~0.3	Only to fit a larger model that otherwise wouldn't fit.

The mother rule: the choice between "larger model in few bits" and "smaller model in more bits" is not obvious — and the default beats intuition. In general, more bits in a smaller model wins over few bits in a large model. The exception (section 05) requires a special recipe.

03 · What breaks first

When you squeeze the quantization, the degradation doesn't show up everywhere at once. It attacks in order — the most fragile capabilities fall first. Knowing that order is the alarm signal: if your model starts botching arithmetic and disobeying the JSON schema, you've squeezed too many bits.

The order is your diagnostic panel. Fluent chat and generic prose resist a long time — that's why "it seems fine" fools you. The real failure shows up in math, reasoning, code, schema and tool-use, exactly what an agent needs.

Q3/Q2: math, code, structured output and tool use degrade FIRST.— Ahmad Osman, "LLMs 101 (2026)"

04 · The KV cache is a separate quantization

Watch out for a trap: quantizing the weights and quantizing the KV cache are two different knobs. The KV cache (lesson 03) has its own ladder — and it's much shorter. Fiddling here without knowing is where long context silently "rots."

Don't confuse the knobs: you can run weights in Q4 and KV cache in FP16 — they're independent. For the KV, FP16 is the safe default and FP8/INT8 is the practical floor. Below 8 bits is heavy research (KIVI, KVQuant), not a toggle to flip without testing.

05 · The asymmetric recipe — the resolution

So far the conclusion seems disheartening: Q2 breaks tool-use, so large models in 2 bits are out. But there's an elegant way out, and it's the heart of this lesson. Instead of crushing everything to 2 bits, you crush only the part that can take it — and leave the sensitive parts intact. The diagram below is to scale: each block's area is proportional to that slice of the model.

The trick: the routed experts are most of the bytes, so compressing them yields nearly all the memory savings. The parts that orchestrate (shared, projections, routing) are small — keeping them in Q8 costs little space and saves the intelligence.

Naive vs. asymmetric, side by side

The difference between a Q2 that fails and a Q2 that works is where the 2 bits land. On the left, the naive way (everything at 2 bits) — exactly the scenario where Ahmad is right. On the right, the asymmetric way — where antirez's recipe wins.

The two take up almost the same space — because the routed experts (most of the bytes) are at 2 bits in both. The difference is surgical: the asymmetric one pays a few GB more to keep the deciding parts in Q8, and that's what saves tool-use.

antirez · "no joke" — the DeepSeek-V4-Flash recipe

The 2-bit quants are no joke: they behave well, they work under coding agents, they call tools reliably. Only the MoE routed experts are quantized — up/gate in IQ2_XXS, down in Q2_K — they are MOST of the model's space; shared experts, projections and routing stay UNTOUCHED (Q8) to guarantee the quality.— antirez, DeepSeek-V4-Flash README (DS4)

And there's a second ingredient: the model itself. DeepSeek-V4-Flash "holds up very well to 2-bit quantization." Asymmetric recipe + a Q2-resistant model = the rare case where 2 bits is good for production.

06 · The synthesis — who's right?

Both. It's not a contradiction, it's context. Ahmad speaks of the general case; antirez, of a case built on purpose to escape it.

The unified read: Ahmad's heuristic ("distrust few bits") remains your default — it holds for almost every model and every naive quant. antirez's case is the engineered exception: asymmetric quantization applied to a model that resists 2 bits. When the two combine, 2 bits becomes production. Outside of that, raise the precision.

the concrete plan B

If in practice the q2 disappoints on hard math or code, the next step is q2-q4-imatrix (98 GB) — the last layers go up to q4 and recover the edge on the hard cases. It's the direct application of the ladder: you stepped up one notch of precision where it hurts. (The memory cost of that choice is the topic of lesson 08.)

1. As you tighten the quantization, which capability tends to degrade FIRST?

Correct: b. Fluent prose resists a long time (that's why "it seems fine" fools you); the real failure shows up first in math, reasoning, code, schema and tool-use. Speed and footprint are governed by size/bandwidth, not by quality degradation.

2. Why does antirez's asymmetric recipe let a Q2 call tools reliably?

Correct: c. Compressing the experts (huge and redundant) yields almost all the savings; keeping the small orchestrating parts in Q8 costs little space and saves the intelligence. Add to that a model that "holds up very well to 2 bits" and the result works under coding agents. The KV cache is a separate knob (section 04).

← Lesson 03 Lesson 05 →