Course / Lesson 03  ·  Português →
Lesson 03 · Speed

Bandwidth = speed

You don't feel the GPU's FLOPs when you chat with a local model — you feel memory bandwidth. How fast the model spits out a token depends on how quickly the hardware can read the weights from memory, not on how much raw math it does. This lesson shows why bandwidth decides whether the box feels alive or feels like it's decoding through wet cement, why a few-active MoE is the perfect match for a Mac, and exactly where your M5 Max lands on that map.

460–614
GB/s — MacBook Pro M5 Max bandwidth
3350
GB/s — H100 SXM (3.35 TB/s, the ceiling)
~7×
how much faster the H100 reads memory vs M5 Max
13B
active params/token of DeepSeek-V4-Flash (of 284B)

01 · Decode tracks bandwidth, not peak compute

There are two regimes in every generation. In prefill, the model reads your entire prompt at once — it's parallel, dense work that saturates the compute units (compute-bound). In decode, it generates one token at a time, and for each token it must sweep all the active weights out of memory. That's where bandwidth rules. Short prompt and long answer? Decode dominates, and the lever is bandwidth + batching. Long prompt and short answer? Prefill dominates, and the levers are attention kernels and chunked prefill.

Decode speed tracks memory bandwidth more than peak compute. Short prompt → long answer: decode dominates → bandwidth + batching. Long prompt → short answer: prefill dominates → attention kernels + chunked prefill.— Ahmad Osman, "Memory Bandwidth (2026)"
Two generation regimes — which one dominates decides which lever matters Short prompt → long answer e.g. "write 800 lines of code" prompt t1 t2 t3 t4 t5 t6 … ×N each token re-reads ALL active weights from RAM memory → compute units · repeats every token DECODE dominates ⚙ lever: memory BANDWIDTH + batching more GB/s = more tokens/s directly Long prompt → short answer e.g. "summarize this 100k-token repo" huge prompt (read in parallel, dense) ans. attention over all tokens at once (big matrix) compute units saturated · O(n²) work PREFILL dominates ⚙ lever: attention kernels + chunked prefill compute and kernels, not raw bandwidth

Same machine, two worlds. The typical coding agent (short prompt, long answer) lives on the left side — so for that use, bandwidth is destiny.

02 · The spectrum: alive or wet cement

Bandwidth isn't an abstract datasheet number — it's the sensory difference between a box that answers instantly and one that seems to be thinking underwater. Ahmad puts the discomfort floor at ~3 tokens/s: below that, you wait. The axis below is memory bandwidth; the farther right, the more "alive" the box feels during decode.

Bandwidth decides whether the box feels alive or like it's decoding through wet cement at 3 tokens per second.— Ahmad Osman, "Memory Bandwidth (2026)"
Decode "aliveness" spectrum — axis = memory bandwidth (GB/s) ⛓ wet cement · ~3 tok/s alive · answers instantly ▸ 0 500 1000 1500 2000 2500 3000 3500 Strix 256 Spark 273 M4 Max 546 M5 Max 460–614 M3 Ultra 819 RTX 5090 / PRO 6000 · 1792 H100 SXM · 3350 Apple isn't the fastest on the axis — but every Mac class here sits WELL above the wet-cement floor.

Positions to scale on a 0–3500 GB/s axis. The M5 Max shows up as a band (460–614) because bandwidth varies by SKU/binning; even the 460 floor is in comfortable, responsive territory.

03 · The bandwidth table, to scale

Here are the raw numbers that feed everything in this lesson, and the same data drawn as proportional bars. The scale is real: each bar pixel is the same amount of GB/s across every row (ceiling = H100, 3350 GB/s).

Memory bandwidth by class — bars to scale (0.191 px per GB/s) 1000 2000 3000 0 GB/s H100 SXM datacenter 3350 GB/s · 3.35 TB/s RTX 5090 / PRO 6000 desktop GPU 1792 GB/s Mac Studio M3 Ultra Apple unified 819 MacBook Pro M5 Max ★ your machine 460–614 (band) Mac Studio M4 Max Apple unified 546 DGX Spark mini-AI 273 Strix Halo / Ryzen AI Max x86 APU 256
HardwareClassMemory bandwidth
H100 SXMDatacenter3350 GB/s (3.35 TB/s)
RTX 5090 / PRO 6000Desktop GPU1792 GB/s
Mac Studio M3 UltraApple unified819 GB/s
MacBook Pro M5 MaxApple unified (laptop)460–614 GB/s
Mac Studio M4 MaxApple unified546 GB/s
DGX SparkMini-AI273 GB/s
Strix Halo / Ryzen AI Maxx86 APU256 GB/s
Quick read: the H100 reads memory ~7× faster than the M5 Max floor and ~5.5× faster than its top. But the H100 has ~80 GB; the M5 Max has 128 GB of unified memory. It's exactly this trade-off — bandwidth vs an absurd amount of memory in a single body — that the next section settles.

04 · Where Apple wins (and where it loses)

Raw bandwidth isn't the only dimension. Apple ships an amount of memory that consumer GPUs come nowhere near, in a single silent body, with no sharding across cards. The cost is peak tokens/s and concurrency. Ahmad sums up the trade-off plainly:

The "when Apple wins" rule

Apple wins when: one box, silence, stupid amounts of memory, no sharding. It loses when raw tokens/sec & concurrency matter most.
— Ahmad Osman, "Inference Engines"

Translating to your decision: if you want one box, no server noise, that loads a giant model in one shot and serves you (not 200 concurrent users), Apple is unbeatable. If you need maximum aggregate throughput and many requests in parallel, an NVIDIA cluster wins. For a local dev running a coding agent, the first case is exactly yours.

Why this matters for bandwidth: with little concurrency (one user), you can't hide the memory-read latency behind big batches. Single-stream decode is a pure bandwidth test — which makes the M5 Max's 460–614 GB/s the most honest number in your daily experience.

05 · Why a few-active MoE matches the Mac

Here's the trick that makes it all fit together. Bandwidth limits how many weight bytes you read per token. So the obvious move is: what if the model only needed to read a small fraction of its weights per token? That's exactly what a few-active Mixture-of-Experts (MoE) does. DeepSeek-V4-Flash has 284B parameters in total, but activates only ~13B per token — only ~4.6% of the model is "lit up" at each step.

Few-active MoE — large total, tiny fraction read per token DeepSeek-V4-Flash · 284B total params all reside in memory — but NOT all are read per token ▮ active ▮ active … dozens of experts asleep (read: 0 bytes this token) Only 2 lit ⇒ just ~13B of 284B are read/computed Per-token consequence reads ~13B of weights (not 284B) ⇒ little bandwidth pressure little compute per token ⇒ the Mac's bandwidth keeps up = 25–34 tok/s on M5 Max total fits in unified memory; only the active fraction pays bandwidth

The 284B must fit in memory — and here the Mac's 128 GB shine. But only the ~13B active pay the bandwidth toll every token. That's why this MoE class measures 25–34 tok/s on this Mac, despite the bandwidth not being datacenter-grade.

It all comes together: the Mac's weakness is raw bandwidth; its strength is total memory. A few-active MoE flips the equation — it demands lots of memory to reside (✓ the Mac's strength) and reads little per token (✓ sidesteps the weakness). It's the exact match the next lesson deepens with quantization, and Lesson 08 turns into your concrete config.

Where the M5 Max lands — and why it's enough for this workload ▸ more bandwidth 0 1000 2000 3000 Strix 256 M4 Max 546 M3 Ultra 819 RTX 1792 H100 3350 460 614 M5 Max · 460–614 GB/s usable for a few-active MoE — reads ~13B/token, not 284B measured: 25–34 tok/s · well above the ~3 tok/s floor not the longest bar on the axis… …but + 128 GB in one body, silent, no sharding = the right box for 1 dev

The M5 Max doesn't win on raw bandwidth — it wins on the combination of enough-bandwidth + huge-memory in a silent laptop. To serve a single dev with a few-active MoE, it's the sweet spot.

1. You send a short prompt ("write a function") and the model generates a long answer. Which regime dominates, and what's the speed lever?
Correct: b. Short prompt → long answer means many token-by-token generation steps: decode dominates, and decode "tracks memory bandwidth more than peak compute". Long prompt → short answer is the opposite case, dominated by prefill (attention kernels + chunked prefill).
2. DeepSeek-V4-Flash has 284B total parameters but measures 25–34 tok/s on the M5 Max, whose bandwidth (460–614 GB/s) is far lower than an H100's. Why does it still run fast?
Correct: c. The 284B must reside in memory (and the Mac's 128 GB unified handles that), but only ~13B are "lit up" per token. Bandwidth limits bytes read per token; reading ~4.6% of the model per step is what keeps speed high despite the bandwidth not being datacenter-grade.