Course / Lesson 04  ·  Português →
Lesson 04 · Quality

Quantization & quality

Lesson 02 showed that every bit less per weight shrinks the model in memory. But shrinking has a price: at some point the intelligence starts to crumble — and it doesn't crumble evenly. This lesson shows the quality ladder bit by bit, what breaks first when you squeeze too hard, and the asymmetric recipe that fits a huge model into 2 bits without losing the ability to use tools.

Q4
the consumer sweet spot (quality × size)
Q8
near-lossless
2 bits
only the MoE routed experts (most of the model)
Q8
what stays untouched: shared experts, projections, routing

01 · The quality ladder

Quantizing is trading numerical precision for space. At the top, FP16/BF16 is the baseline — the full quality the model was trained at. Each step down cuts bits and, with them, a bit of fidelity. The secret is that the loss is not linear: the first steps cost almost nothing; the last ones cost the entire intelligence. The diagram below draws the real curve and marks the cliff.

Quality retained (%) × bits per weight — the loss is not linear 100% 85% 70% quality retained FP16 Q8 Q6 Q5 Q4 Q3 Q2 ← more bits · fewer bits → ★ sweet spot ⚠ the cliff math / code tool-use break near-lossless strong middle ground

From FP16 to Q4 the curve barely dips — quality is "cheap" to keep. Below Q4 it plummets: every bit less costs more and more intelligence. That's why Q3/Q2 are a last resort, not a default choice.

A smaller model at higher precision can beat a larger model crushed into too few bits.— Ahmad Osman, "LLMs 101 (2026)"

02 · The named ladder, step by step

Each level has a character of its own. This is the practical read — paired with the cost in GB per billion parameters we derived in lesson 02.

The ladder — best (top) to last resort (bottom) · GB per 1B parameters FP16 / BF16 baseline · full quality ~2.0 GB/1B Q8 / INT8 near-lossless ~1.0 GB/1B Q6 / Q5 excellent · strong middle ground ~0.7 GB/1B ★ Q4 consumer sweet spot ~0.5 GB/1B Q3 / Q2 last resort · only to fit a larger model ~0.3 GB/1B smaller in memory less intelligent Each step's width hints at the quality retained; the number on the right is the memory cost from lesson 02. Rule: drop only as far as needed to fit + headroom. Never drop "for sport."
LevelCharacter~GB / 1BWhen to choose
FP16 / BF16baseline · full quality~2.0Reference / training. Rarely for local inference.
Q8 / INT8near-lossless~1.0When it fits and you want the local quality ceiling.
Q6 / Q5excellent · strong middle ground~0.7Balance when Q8 won't fit but you don't want to risk it.
Q4consumer sweet spot~0.5General default. Best quality × size in practice.
Q3 / Q2last resort~0.3Only to fit a larger model that otherwise wouldn't fit.
The mother rule: the choice between "larger model in few bits" and "smaller model in more bits" is not obvious — and the default beats intuition. In general, more bits in a smaller model wins over few bits in a large model. The exception (section 05) requires a special recipe.

03 · What breaks first

When you squeeze the quantization, the degradation doesn't show up everywhere at once. It attacks in order — the most fragile capabilities fall first. Knowing that order is the alarm signal: if your model starts botching arithmetic and disobeying the JSON schema, you've squeezed too many bits.

As quantization tightens → these capabilities fall IN THIS order fragility → tighten the bits → 1 · Math exact arithmetic is the canary — wrong first ✗ breaks early 2 · Multi-step reasoning long chains lose the thread 3 · Code correctness syntax survives; logic and edge cases fail 4 · JSON / schema adherence missing fields, wrong types 5 · Tool-use reliability calls the wrong tool / malformed arguments 6 · Long-context retrieval forgets details way back at the start of the window ▼ Fluent prose · casual chat resists last — that's why "it seems fine" fools you ✓ resists

The order is your diagnostic panel. Fluent chat and generic prose resist a long time — that's why "it seems fine" fools you. The real failure shows up in math, reasoning, code, schema and tool-use, exactly what an agent needs.

Q3/Q2: math, code, structured output and tool use degrade FIRST.— Ahmad Osman, "LLMs 101 (2026)"

04 · The KV cache is a separate quantization

Watch out for a trap: quantizing the weights and quantizing the KV cache are two different knobs. The KV cache (lesson 03) has its own ladder — and it's much shorter. Fiddling here without knowing is where long context silently "rots."

KV cache quantization — a separate ladder, and much shorter FP16 baseline safe · lossless ✓ use by default cache footprint: 1× FP8 / INT8 practical floor doubles the usable window ○ acceptable with care ≈ ½× · same window in half the RAM sub-8-bit research territory KIVI · KVQuant ⚠ not a casual toggle 1 2 3 < ½× but quality uncertain more aggressive · more fragile → knob independent of the weights
Don't confuse the knobs: you can run weights in Q4 and KV cache in FP16 — they're independent. For the KV, FP16 is the safe default and FP8/INT8 is the practical floor. Below 8 bits is heavy research (KIVI, KVQuant), not a toggle to flip without testing.

05 · The asymmetric recipe — the resolution

So far the conclusion seems disheartening: Q2 breaks tool-use, so large models in 2 bits are out. But there's an elegant way out, and it's the heart of this lesson. Instead of crushing everything to 2 bits, you crush only the part that can take it — and leave the sensitive parts intact. The diagram below is to scale: each block's area is proportional to that slice of the model.

Asymmetric anatomy (to scale) — what goes 2-bit vs. what stays in Q8 0% 25 50 75 100% of model MoE routed experts — MOST of the space → 2 bits · up/gate in IQ2_XXS · down in Q2_K ≈ 78% of the bytes ↑ up/gate · IQ2_XXS ↓ down · Q2_K Shared experts + projections + routing → UNTOUCHED · Q8 · ≈22% compress here (yields almost all the GB saved) preserve here (guarantees the quality) Why it works • The experts are huge but redundant — they tolerate 2 bits without collapsing. • Projections and routing are tiny but decide WHICH expert to call. • Keeping them in Q8 costs ~a few bytes and keeps the decision right.

The trick: the routed experts are most of the bytes, so compressing them yields nearly all the memory savings. The parts that orchestrate (shared, projections, routing) are small — keeping them in Q8 costs little space and saves the intelligence.

Naive vs. asymmetric, side by side

The difference between a Q2 that fails and a Q2 that works is where the 2 bits land. On the left, the naive way (everything at 2 bits) — exactly the scenario where Ahmad is right. On the right, the asymmetric way — where antirez's recipe wins.

Naive Q2 everything crushed to 2 bits routed experts2b shared experts2b projections2b routing2b ✗ tool-use fails math / code / schema also fall the WHICH-expert decision turns to noise Asymmetric Q2 only the routed experts at 2 bits routed experts (the majority)2b shared experts ✓Q8 projections ✓Q8 routing ✓Q8 ✓ reliable tool-use ~same size · quality preserved the decision stays intact in Q8 vs

The two take up almost the same space — because the routed experts (most of the bytes) are at 2 bits in both. The difference is surgical: the asymmetric one pays a few GB more to keep the deciding parts in Q8, and that's what saves tool-use.

antirez · "no joke" — the DeepSeek-V4-Flash recipe
The 2-bit quants are no joke: they behave well, they work under coding agents, they call tools reliably. Only the MoE routed experts are quantized — up/gate in IQ2_XXS, down in Q2_K — they are MOST of the model's space; shared experts, projections and routing stay UNTOUCHED (Q8) to guarantee the quality.— antirez, DeepSeek-V4-Flash README (DS4)

And there's a second ingredient: the model itself. DeepSeek-V4-Flash "holds up very well to 2-bit quantization." Asymmetric recipe + a Q2-resistant model = the rare case where 2 bits is good for production.

06 · The synthesis — who's right?

Both. It's not a contradiction, it's context. Ahmad speaks of the general case; antirez, of a case built on purpose to escape it.

Not a contradiction — it's the general rule and the engineered exception THE RULE THE EXCEPTION Ahmad — right in GENERAL naive Q2 (everything at 2 bits) → math / code / tool-use break → default: prefer more bits condition: naive quant on any model antirez — right HERE asymmetric Q2 + resistant model → sensitive parts in Q8 → tool-use preserved condition: surgical recipe + Q2-resistant model + both together → 2 bits in production Plan B: if q2 disappoints on hard math/code → step up to q2-q4-imatrix (98 GB)
The unified read: Ahmad's heuristic ("distrust few bits") remains your default — it holds for almost every model and every naive quant. antirez's case is the engineered exception: asymmetric quantization applied to a model that resists 2 bits. When the two combine, 2 bits becomes production. Outside of that, raise the precision.
the concrete plan B

If in practice the q2 disappoints on hard math or code, the next step is q2-q4-imatrix (98 GB) — the last layers go up to q4 and recover the edge on the hard cases. It's the direct application of the ladder: you stepped up one notch of precision where it hurts. (The memory cost of that choice is the topic of lesson 08.)

1. As you tighten the quantization, which capability tends to degrade FIRST?
Correct: b. Fluent prose resists a long time (that's why "it seems fine" fools you); the real failure shows up first in math, reasoning, code, schema and tool-use. Speed and footprint are governed by size/bandwidth, not by quality degradation.
2. Why does antirez's asymmetric recipe let a Q2 call tools reliably?
Correct: c. Compressing the experts (huge and redundant) yields almost all the savings; keeping the small orchestrating parts in Q8 costs little space and saves the intelligence. Add to that a model that "holds up very well to 2 bits" and the result works under coding agents. The KV cache is a separate knob (section 04).