Lesson 03 · Speed

Bandwidth = speed

You don't feel the GPU's FLOPs when you chat with a local model — you feel memory bandwidth. How fast the model spits out a token depends on how quickly the hardware can read the weights from memory, not on how much raw math it does. This lesson shows why bandwidth decides whether the box feels alive or feels like it's decoding through wet cement, why a few-active MoE is the perfect match for a Mac, and exactly where your M5 Max lands on that map.

460–614

GB/s — MacBook Pro M5 Max bandwidth

3350

GB/s — H100 SXM (3.35 TB/s, the ceiling)

~7×

how much faster the H100 reads memory vs M5 Max

13B

active params/token of DeepSeek-V4-Flash (of 284B)

01 · Decode tracks bandwidth, not peak compute

There are two regimes in every generation. In prefill, the model reads your entire prompt at once — it's parallel, dense work that saturates the compute units (compute-bound). In decode, it generates one token at a time, and for each token it must sweep all the active weights out of memory. That's where bandwidth rules. Short prompt and long answer? Decode dominates, and the lever is bandwidth + batching. Long prompt and short answer? Prefill dominates, and the levers are attention kernels and chunked prefill.

Decode speed tracks memory bandwidth more than peak compute. Short prompt → long answer: decode dominates → bandwidth + batching. Long prompt → short answer: prefill dominates → attention kernels + chunked prefill.— Ahmad Osman, "Memory Bandwidth (2026)"

Same machine, two worlds. The typical coding agent (short prompt, long answer) lives on the left side — so for that use, bandwidth is destiny.

02 · The spectrum: alive or wet cement

Bandwidth isn't an abstract datasheet number — it's the sensory difference between a box that answers instantly and one that seems to be thinking underwater. Ahmad puts the discomfort floor at ~3 tokens/s: below that, you wait. The axis below is memory bandwidth; the farther right, the more "alive" the box feels during decode.

Bandwidth decides whether the box feels alive or like it's decoding through wet cement at 3 tokens per second.— Ahmad Osman, "Memory Bandwidth (2026)"

Positions to scale on a 0–3500 GB/s axis. The M5 Max shows up as a band (460–614) because bandwidth varies by SKU/binning; even the 460 floor is in comfortable, responsive territory.

03 · The bandwidth table, to scale

Here are the raw numbers that feed everything in this lesson, and the same data drawn as proportional bars. The scale is real: each bar pixel is the same amount of GB/s across every row (ceiling = H100, 3350 GB/s).

Hardware	Class	Memory bandwidth
H100 SXM	Datacenter	3350 GB/s (3.35 TB/s)
RTX 5090 / PRO 6000	Desktop GPU	1792 GB/s
Mac Studio M3 Ultra	Apple unified	819 GB/s
MacBook Pro M5 Max ★	Apple unified (laptop)	460–614 GB/s
Mac Studio M4 Max	Apple unified	546 GB/s
DGX Spark	Mini-AI	273 GB/s
Strix Halo / Ryzen AI Max	x86 APU	256 GB/s

Quick read: the H100 reads memory ~7× faster than the M5 Max floor and ~5.5× faster than its top. But the H100 has ~80 GB; the M5 Max has 128 GB of unified memory. It's exactly this trade-off — bandwidth vs an absurd amount of memory in a single body — that the next section settles.

04 · Where Apple wins (and where it loses)

Raw bandwidth isn't the only dimension. Apple ships an amount of memory that consumer GPUs come nowhere near, in a single silent body, with no sharding across cards. The cost is peak tokens/s and concurrency. Ahmad sums up the trade-off plainly:

The "when Apple wins" rule

Apple wins when: one box, silence, stupid amounts of memory, no sharding. It loses when raw tokens/sec & concurrency matter most.
— Ahmad Osman, "Inference Engines"

Translating to your decision: if you want one box, no server noise, that loads a giant model in one shot and serves you (not 200 concurrent users), Apple is unbeatable. If you need maximum aggregate throughput and many requests in parallel, an NVIDIA cluster wins. For a local dev running a coding agent, the first case is exactly yours.

Why this matters for bandwidth: with little concurrency (one user), you can't hide the memory-read latency behind big batches. Single-stream decode is a pure bandwidth test — which makes the M5 Max's 460–614 GB/s the most honest number in your daily experience.

05 · Why a few-active MoE matches the Mac

Here's the trick that makes it all fit together. Bandwidth limits how many weight bytes you read per token. So the obvious move is: what if the model only needed to read a small fraction of its weights per token? That's exactly what a few-active Mixture-of-Experts (MoE) does. DeepSeek-V4-Flash has 284B parameters in total, but activates only ~13B per token — only ~4.6% of the model is "lit up" at each step.

The 284B must fit in memory — and here the Mac's 128 GB shine. But only the ~13B active pay the bandwidth toll every token. That's why this MoE class measures 25–34 tok/s on this Mac, despite the bandwidth not being datacenter-grade.

It all comes together: the Mac's weakness is raw bandwidth; its strength is total memory. A few-active MoE flips the equation — it demands lots of memory to reside (✓ the Mac's strength) and reads little per token (✓ sidesteps the weakness). It's the exact match the next lesson deepens with quantization, and Lesson 08 turns into your concrete config.

The M5 Max doesn't win on raw bandwidth — it wins on the combination of enough-bandwidth + huge-memory in a silent laptop. To serve a single dev with a few-active MoE, it's the sweet spot.

1. You send a short prompt ("write a function") and the model generates a long answer. Which regime dominates, and what's the speed lever?

Correct: b. Short prompt → long answer means many token-by-token generation steps: decode dominates, and decode "tracks memory bandwidth more than peak compute". Long prompt → short answer is the opposite case, dominated by prefill (attention kernels + chunked prefill).

2. DeepSeek-V4-Flash has 284B total parameters but measures 25–34 tok/s on the M5 Max, whose bandwidth (460–614 GB/s) is far lower than an H100's. Why does it still run fast?

Correct: c. The 284B must reside in memory (and the Mac's 128 GB unified handles that), but only ~13B are "lit up" per token. Bandwidth limits bytes read per token; reading ~4.6% of the model per step is what keeps speed high despite the bandwidth not being datacenter-grade.

← Lesson 02 Lesson 04 →