Course / Lesson 08  ·  Português →
Lesson 08 · Where it all comes together

Your M5 Max config

The previous seven lessons derived the principles; this one turns them into a concrete machine. You have 128 GB of unified memory and two workloads — a coding/agent brain and occasional vision. This lesson shows exactly what to load, how much each piece takes, and the three ways to orchestrate both models without blowing past the RAM.

81 GB
DeepSeek-V4-Flash q2 (resident weights)
~20 GB
Qwen3-VL-30B-A3B (vision, on demand)
~18 GB
macOS + apps (floor to preserve)
10–20%
mandatory headroom (Ahmad's rule)

01 · The memory budget

Everything in this config is decided by one sum. DeepSeek q2 is 81 GB of weights that stay resident in GPU-addressable memory. The KV cache grows with context — at 100k tokens it sits at a few GB; at the 1M extreme it reaches ~26 GB (the compressed indexer alone is ~22 GB). Add the system and the headroom, and 128 GB gets tight. The diagram below is to scale: each pixel is proportional to the real bytes.

Unified memory — 128 GB (real scale, 6.4 px/GB) 0 32 64 96 128 GB Brain only (ctx 100k) DeepSeek q2 · 81 GB KV 8 macOS 18 ✓ headroom 21 GB Brain + vision co-resident DeepSeek q2 · 81 GB KV 8 Qwen3-VL 20 macOS 18 safe ceiling 80% (102 GB) ⚠ blows the headroom

The top bar (brain only) leaves 21 GB free — comfortable. The bottom one (permanent co-residency) crosses the red 80% line: unstable. Hence the three modes below.

Leave 10 to 20 percent headroom. Running at 99% of VRAM is begging for out-of-memory and fragmentation failures.— Ahmad Osman, "LLMs 101 (2026)"

02 · The three deploy modes

How do you run brain (81 GB) + vision (20 GB) without crossing the ceiling? Three arrangements, from simplest to no-compromise. The diagram compares what stays resident in each.

★ On-demand with swap recommended · 1 Mac DeepSeek serving:8000 · always on ↓ an image arrives pause DS4 → run Qwen3-VL ↑ mmap reload (seconds) peak ≤ 101 GB · zero contention Two Macs no-compromise · TB5 M5 Max128 GBDeepSeek81 GB · :8000 M4 Max36 GBQwen3-VL20 GB · :8081 TB5 vision always ready · zero swap Co-residency risky DeepSeek (ctx ≤64k) Qwen3-VL co-resident ⚠ lives at the 128 GB edge only if vision is very frequent and you accept occasional OOM
ModeRAM peakVision latencyWhen to use
On-demand swap≤ 101 GB+ DS4 reload (seconds)Default. Sporadic vision, a single Mac.
Two Macs81 GB (M5) · 20 GB (M4)instantFrequent vision and you have the 2nd Mac.
Co-residency~125 GB ⚠instantRarely. Only with low ctx.

03 · The 2-Mac pool over Thunderbolt 5

The 2nd Mac (M4 Max, 36 GB) doesn't speed up the brain — but it solves vision for good. Over Thunderbolt 5 (80 Gb/s, with RDMA on macOS 26.x), each Mac runs an independent endpoint; your app talks to both. It's Ahmad's reading ("use both as independent nodes") put into practice.

MacBook Pro M5 Max 128 GB · 460–614 GB/s DS4 · DeepSeek-V4-Flash q2 127.0.0.1:8000 · OpenAI+Anthropic MacBook Pro M4 Max 36 GB · vision node mlx_vlm.server · Qwen3-VL <mac-ip>:8081 · OpenAI TB5 · 80 Gb/s RDMA (macOS 26.x) your app / Claude Code / Codex
Important: distributing one model across both Macs (exo/mlx.distributed) helps capacity, not generation speed — and DeepSeek-Flash already fits on a single Mac. That's why the recommendation is independent nodes (brain on one, vision on the other), not sharding.

04 · Real examples (measured in this session)

This isn't theory. Below are the exact commands and the real output captured on your M5 Max.

Serve the brain (DS4)

# build already done and proven on this Mac (make → 5 Metal binaries, exit 0)
./ds4-server --ctx 100000 --kv-disk-dir ~/.ds4/kv --kv-disk-space-mb 8192
# measured perf (official README, M5 Max class): gen 25–34 t/s · prefill 87–463 t/s

Serve vision (MLX) — running now on :8081

# real call to the OpenAI-compatible endpoint
curl -s 127.0.0.1:8081/v1/chat/completions -d '{"model":"...Qwen3-VL...","messages":[
  {"role":"user","content":"Why does a low-active MoE run fast on a unified-memory Mac?"}]}'
# real response:
"An MoE with few active parameters runs quickly because only a small
 part of the model is activated at each step... the Mac's unified memory enables
 fast and efficient access to the data."
# timings: prompt 88 t/s · generation 62 t/s · peak 18.4 GB
q2 (81 GB) vs q2-q4 (98 GB) — the choice

The q2-q4-imatrix raises quality on hard math/code (last 6 layers in q4), but the extra 17 GB eat the headroom vision needs (98 + 20 + 18 > 128). For a speed+intelligence+vision setup, q2-imatrix is the balance. Pick q2-q4 only if you will NOT co-reside vision.

1. Why is permanent co-residency of DeepSeek + Qwen3-VL unstable on the M5 Max?
Correct: b. The sum brushes up against 128 GB and breaks the 10–20% headroom rule. That's why the default is on-demand with swap (peak ≤101 GB) or splitting vision onto the 2nd Mac.
2. What is the 2nd Mac (M4 Max 36 GB) for in this config?
Correct: c. Distributing one model across both gives capacity, not speed; the real win is using them as independent nodes (brain + vision).