The whole question "does this model fit on my machine?" boils down to a multiplication you can do in your head. This lesson installs Ahmad Osman's VRAM formula, the quantization ladder that falls out of it, and the reflex of doing the math before downloading 81 GB. By the end, you look at "284B" and know — in seconds — that only ~2 bits fit in your 128 GB.
There is no mystery to a model's size. Each parameter is a number, and how many bytes it takes depends only on how many bits you use to store it. Sum every parameter and you have the memory of the weights. Ahmad reduces this to one line:
VRAM (in GB) ≈ Parameters (billions) × (effective bits ÷ 8).— Ahmad Osman, "GPU Memory Math (2026)"
The ÷ 8 converts bits to bytes; "billions of parameters" lines up with "gigabytes" with no extra factors. FP16 uses 16 bits → 16/8 = 2 bytes per parameter, so a model of N billion weighs 2N GB. That's it. The diagram treats the formula as a conveyor belt: parameters go in, precision multiplies, GB come out.
Precision is the only knob that moves the result. The parameter count is fixed; dropping from FP16 to ~2 bits is what sheds 568 GB and hands back something that fits.
Quantizing means storing each parameter with fewer bits. Every rung below FP16 cuts memory — and bites off a little quality. This is the ladder in GB per billion parameters; multiply by your model's size to get the estimate. Note the k-quants (Q6_K…Q2_K) aren't round numbers: they include the format's real overhead.
| Precision | Effective bits | GB per 1B params | Cost |
|---|---|---|---|
| FP16 | 16 | 2.00 | reference (top quality) |
| FP8 / INT8 | 8 | 1.00 | nearly imperceptible loss |
| Q6_K | ~6.6 | ≈ 0.82 | minimal difference vs FP16 |
| Q5_K | ~5.5 | ≈ 0.69 | very good |
| Q4_K | ~4.5 | ≈ 0.56 | the usual "sweet spot" |
| Q3_K | ~3.4 | ≈ 0.43 | noticeable degradation |
| Q2_K | ~2.6 | ≈ 0.33 | last resort to fit |
Now the ladder applied to this machine's real case: the 284B DeepSeek-V4-Flash. The chart below is to scale within the 0–128 GB window; anything that overflows is clipped at the edge with an arrow and the true value. The vertical line marks the 128 GB of unified memory.
The three red/orange bars run off the window: FP16, FP8 and even Q4 don't fit in 128 GB for a 284B model. Only ~2 bits stays left of the line — and the real q2 GGUF (81 GB) is even leaner than the formula's estimate (94 GB).
GGUF isn't magic… "fits in 6 GB" isn't a universal truth. It's a runtime-specific truth.— Ahmad Osman, "LLMs 101 (2026)"
Pair the formula with your RAM and a decision map is born. Rows are model sizes; columns are precisions. Each cell answers "do these weights fit in 128 GB?" (leaving ~20% headroom, i.e. a practical target ≈ 102 GB). Green fits, red doesn't.
Read it on the diagonal: the bigger the model, the further right (the more aggressive the quantization) you must go just to enter the window. For 284B, the only green column is Q2 — and even that thanks to the real 81 GB GGUF, below the 94 estimate.
The grid doesn't say "always run in Q2." It says what is physically possible. The leftmost green cell in each row is your best quality option that still fits — and Lesson 03 (memory bandwidth) will explain why, even when it fits, size still decides the speed.
"Fits" isn't the same as "runs well." The operating system, your apps, the context KV cache and memory fragmentation all need room. Filling RAM to the brim is the shortest path to an out-of-memory crash mid-generation.
Leave 10 to 20 percent of headroom. Running at 99% of VRAM is begging for out-of-memory and fragmentation failures.— Ahmad Osman, "GPU Memory Math (2026)"
The brain's 81 GB land left of the 80% line, leaving ~21 GB for the context KV cache and the system. Bumping up against 128 (the shaded region) is where fragmentation drops the session.
The wrong reflex is "bigger is always better." It isn't. Crushing a large model into very few bits can destroy more quality than you gain from the parameter count. A smaller model at decent precision often beats a larger one squeezed too hard.
A smaller model at higher precision can beat a larger model crushed into few bits — a 7B in Q6 can beat a 13B in Q2 at reasoning.— Ahmad Osman, "GPU Memory Math (2026)"
Both take ~5 GB — a memory tie. But the 13B lost so much precision per parameter that the well-preserved 7B reasons better. Size buys capacity only if the bits survive.
Quantization comes with a provenance catch. Formats like GGUF and safetensors store only tensors — data. PyTorch's old pickle-based .bin format can embed code that runs on load. Downloading weights is downloading data; it shouldn't run anything.
Avoid random .bin files from untrusted sources (pickle = code execution). Prefer GGUF/safetensors.— Ahmad Osman, "GPU Memory Math (2026)"
1) Prefer GGUF (runtimes like llama.cpp/MLX) or safetensors — both are tensors only, no execution. 2) Treat .bin/.pt/.pth from an unknown source as suspect. 3) Check the publisher and the hash when there is one. The formula tells you if it fits; provenance tells you if it's safe to load.