Course / Lesson 02  ·  Português →
Lesson 02 · Capacity

Memory math

The whole question "does this model fit on my machine?" boils down to a multiplication you can do in your head. This lesson installs Ahmad Osman's VRAM formula, the quantization ladder that falls out of it, and the reflex of doing the math before downloading 81 GB. By the end, you look at "284B" and know — in seconds — that only ~2 bits fit in your 128 GB.

P × bits/8
the entire formula, in GB
568 GB
DeepSeek-V4-Flash 284B in FP16
81 GB
the same model as GGUF q2 (on disk)
10–20%
mandatory headroom over RAM

01 · The formula that fits in your head

There is no mystery to a model's size. Each parameter is a number, and how many bytes it takes depends only on how many bits you use to store it. Sum every parameter and you have the memory of the weights. Ahmad reduces this to one line:

VRAM (in GB) ≈ Parameters (billions) × (effective bits ÷ 8).— Ahmad Osman, "GPU Memory Math (2026)"

The ÷ 8 converts bits to bytes; "billions of parameters" lines up with "gigabytes" with no extra factors. FP16 uses 16 bits → 16/8 = 2 bytes per parameter, so a model of N billion weighs 2N GB. That's it. The diagram treats the formula as a conveyor belt: parameters go in, precision multiplies, GB come out.

The formula as a belt — count goes in, memory comes out Parameters counted in billions (B) P × Precision effective bits ÷ 8 = bytes per parameter = Memory gigabytes of weights GB bytes/param: FP16 = 2.0 Q4_K ≈ 0.56 Q2_K ≈ 0.33 Worked example — DeepSeek-V4-Flash in FP16 284 billion params × 16 ÷ 8 = 2.0 bytes/param (FP16) = ≈ 568 GB weights only Swap 2.0 for ~0.25 (≈ 2 bits) and the same model becomes ~71→81 GB.

Precision is the only knob that moves the result. The parameter count is fixed; dropping from FP16 to ~2 bits is what sheds 568 GB and hands back something that fits.

Why ~71 and not exactly 81? The formula gives the theoretical floor (284 × 0.25 ≈ 71 GB for pure 2 bits). The real GGUF uses mixed per-layer schemes (k-quants) plus metadata, so the actual q2 of DeepSeek-V4-Flash lands at 81 GB on disk. The formula is the back-of-the-envelope estimate; the file is the truth.

02 · The quantization ladder

Quantizing means storing each parameter with fewer bits. Every rung below FP16 cuts memory — and bites off a little quality. This is the ladder in GB per billion parameters; multiply by your model's size to get the estimate. Note the k-quants (Q6_K…Q2_K) aren't round numbers: they include the format's real overhead.

PrecisionEffective bitsGB per 1B paramsCost
FP16162.00reference (top quality)
FP8 / INT881.00nearly imperceptible loss
Q6_K~6.6≈ 0.82minimal difference vs FP16
Q5_K~5.5≈ 0.69very good
Q4_K~4.5≈ 0.56the usual "sweet spot"
Q3_K~3.4≈ 0.43noticeable degradation
Q2_K~2.6≈ 0.33last resort to fit

Now the ladder applied to this machine's real case: the 284B DeepSeek-V4-Flash. The chart below is to scale within the 0–128 GB window; anything that overflows is clipped at the edge with an arrow and the true value. The vertical line marks the 128 GB of unified memory.

DeepSeek-V4-Flash 284B — weights by precision (scale 0–128 GB, 4.4 px/GB) 128 GB (RAM) 32 64 96 FP16 568 GB · 4.4× RAM FP8 284 GB · 2.2× RAM Q4_K 159 GB · over by 31 Q2 (formula) ≈ 94 GB ✓ fits Q2 GGUF (real) 81 GB on disk ✓ fits w/ headroom Only the ~2-bit rung enters the 128 GB. Everything else demands more machine.

The three red/orange bars run off the window: FP16, FP8 and even Q4 don't fit in 128 GB for a 284B model. Only ~2 bits stays left of the line — and the real q2 GGUF (81 GB) is even leaner than the formula's estimate (94 GB).

GGUF isn't magic… "fits in 6 GB" isn't a universal truth. It's a runtime-specific truth.— Ahmad Osman, "LLMs 101 (2026)"

03 · What fits in 128 GB

Pair the formula with your RAM and a decision map is born. Rows are model sizes; columns are precisions. Each cell answers "do these weights fit in 128 GB?" (leaving ~20% headroom, i.e. a practical target ≈ 102 GB). Green fits, red doesn't.

Fits in 128 GB? (✓ within headroom · ◑ tight · ✗ overflows) FP16 (2.0) Q8 (1.0) Q4 (0.56) Q2 (0.33) model ↓ 7B ✓ 14 GBlots to spare ✓ 7 GB ✓ 4 GB ✓ 2 GB 30B ✓ 60 GB ✓ 30 GB ✓ 17 GB ✓ 10 GB 70B ✗ 140 GB ✓ 70 GB ✓ 39 GB ✓ 23 GB 284B(DS4-Flash) ✗ 568 GB ✗ 284 GB ✗ 159 GB ✓ 81 GBthe only one that fits 671B ✗ 1342 GB ✗ 671 GB ✗ 376 GB ◑ 221 GBneeds more RAM

Read it on the diagonal: the bigger the model, the further right (the more aggressive the quantization) you must go just to enter the window. For 284B, the only green column is Q2 — and even that thanks to the real 81 GB GGUF, below the 94 estimate.

The practical read

The grid doesn't say "always run in Q2." It says what is physically possible. The leftmost green cell in each row is your best quality option that still fits — and Lesson 03 (memory bandwidth) will explain why, even when it fits, size still decides the speed.

04 · The headroom rule

"Fits" isn't the same as "runs well." The operating system, your apps, the context KV cache and memory fragmentation all need room. Filling RAM to the brim is the shortest path to an out-of-memory crash mid-generation.

Leave 10 to 20 percent of headroom. Running at 99% of VRAM is begging for out-of-memory and fragmentation failures.— Ahmad Osman, "GPU Memory Math (2026)"
Headroom gauge — 128 GB unified memory (6.25 px/GB) macOS ~18 brain weights safe line 80% · 102 GB danger zone OOM + fragmentation 0 32 64 96 128 GB DeepSeek q2 · 81 GB inside the safe zone ✓ ~21 GB for KV + system 99% = crash

The brain's 81 GB land left of the 80% line, leaving ~21 GB for the context KV cache and the system. Bumping up against 128 (the shaded region) is where fragmentation drops the session.

05 · Precision > size: the trade-off that matters

The wrong reflex is "bigger is always better." It isn't. Crushing a large model into very few bits can destroy more quality than you gain from the parameter count. A smaller model at decent precision often beats a larger one squeezed too hard.

A smaller model at higher precision can beat a larger model crushed into few bits — a 7B in Q6 can beat a 13B in Q2 at reasoning.— Ahmad Osman, "GPU Memory Math (2026)"
Similar memory, different quality — who wins at reasoning? 7B in Q6_K smaller model · high precision (~6.6 bits) 7B params memory ≈ 5.7 GB reasoning weights nearly intact ★ wins at reasoning fewer params, but preserved 13B in Q2_K larger model · crushed (~2.6 bits) 13B params memory ≈ 4.3 GB reasoning weights degraded loses despite +6B params 2 bits throw away the edge vs memory axis (same scale) → 5.7 GB 4.3 GB ∆ < 1.5 GB — memory tie

Both take ~5 GB — a memory tie. But the 13B lost so much precision per parameter that the well-preserved 7B reasons better. Size buys capacity only if the bits survive.

Applied to DS4: this is why picking q2 for DeepSeek-V4-Flash isn't an obvious "because it's big." 284B in q2 works because the model is a robust MoE and q2-imatrix preserves what matters — but the same logic says that if a smaller model in Q4/Q6 solves your task, it may win. Do the math on both sides.

06 · The file-safety rule

Quantization comes with a provenance catch. Formats like GGUF and safetensors store only tensors — data. PyTorch's old pickle-based .bin format can embed code that runs on load. Downloading weights is downloading data; it shouldn't run anything.

Avoid random .bin files from untrusted sources (pickle = code execution). Prefer GGUF/safetensors.— Ahmad Osman, "GPU Memory Math (2026)"
Download checklist

1) Prefer GGUF (runtimes like llama.cpp/MLX) or safetensors — both are tensors only, no execution. 2) Treat .bin/.pt/.pth from an unknown source as suspect. 3) Check the publisher and the hash when there is one. The formula tells you if it fits; provenance tells you if it's safe to load.

1. Applying the formula, how much do the weights of a 284B model in FP16 weigh — and why does only ~2 bits fit in 128 GB?
Correct: c. FP16 = 2 bytes/param → 568 GB. The ladder halves at each big jump (FP8 284, Q4 159), and only ~2 bits (94 GB estimated / 81 GB real on disk) enters the 128 GB. The formula's ~71 GB holds for theoretical pure 2 bits; the real GGUF is 81.
2. You need reasoning and have room for ~5 GB of weights. What does Ahmad's rule recommend?
Correct: b. Similar memory, different quality: 2 bits discard too much information per parameter. And never touch 99% of RAM — leave 10–20% headroom for the KV cache and fragmentation.