The previous seven lessons derived the principles; this one turns them into a concrete machine. You have 128 GB of unified memory and two workloads — a coding/agent brain and occasional vision. This lesson shows exactly what to load, how much each piece takes, and the three ways to orchestrate both models without blowing past the RAM.
81 GB
DeepSeek-V4-Flash q2 (resident weights)
~20 GB
Qwen3-VL-30B-A3B (vision, on demand)
~18 GB
macOS + apps (floor to preserve)
10–20%
mandatory headroom (Ahmad's rule)
01 · The memory budget
Everything in this config is decided by one sum. DeepSeek q2 is 81 GB of weights that stay resident in GPU-addressable memory. The KV cache grows with context — at 100k tokens it sits at a few GB; at the 1M extreme it reaches ~26 GB (the compressed indexer alone is ~22 GB). Add the system and the headroom, and 128 GB gets tight. The diagram below is to scale: each pixel is proportional to the real bytes.
The top bar (brain only) leaves 21 GB free — comfortable. The bottom one (permanent co-residency) crosses the red 80% line: unstable. Hence the three modes below.
Leave 10 to 20 percent headroom. Running at 99% of VRAM is begging for out-of-memory and fragmentation failures.— Ahmad Osman, "LLMs 101 (2026)"
02 · The three deploy modes
How do you run brain (81 GB) + vision (20 GB) without crossing the ceiling? Three arrangements, from simplest to no-compromise. The diagram compares what stays resident in each.
Mode
RAM peak
Vision latency
When to use
On-demand swap ★
≤ 101 GB
+ DS4 reload (seconds)
Default. Sporadic vision, a single Mac.
Two Macs
81 GB (M5) · 20 GB (M4)
instant
Frequent vision and you have the 2nd Mac.
Co-residency
~125 GB ⚠
instant
Rarely. Only with low ctx.
03 · The 2-Mac pool over Thunderbolt 5
The 2nd Mac (M4 Max, 36 GB) doesn't speed up the brain — but it solves vision for good. Over Thunderbolt 5 (80 Gb/s, with RDMA on macOS 26.x), each Mac runs an independent endpoint; your app talks to both. It's Ahmad's reading ("use both as independent nodes") put into practice.
Important: distributing one model across both Macs (exo/mlx.distributed) helps capacity, not generation speed — and DeepSeek-Flash already fits on a single Mac. That's why the recommendation is independent nodes (brain on one, vision on the other), not sharding.
04 · Real examples (measured in this session)
This isn't theory. Below are the exact commands and the real output captured on your M5 Max.
Serve the brain (DS4)
# build already done and proven on this Mac (make → 5 Metal binaries, exit 0)
./ds4-server --ctx 100000 --kv-disk-dir ~/.ds4/kv --kv-disk-space-mb 8192
# measured perf (official README, M5 Max class): gen 25–34 t/s · prefill 87–463 t/s
Serve vision (MLX) — running now on :8081
# real call to the OpenAI-compatible endpoint
curl -s 127.0.0.1:8081/v1/chat/completions -d '{"model":"...Qwen3-VL...","messages":[
{"role":"user","content":"Why does a low-active MoE run fast on a unified-memory Mac?"}]}'
# real response:"An MoE with few active parameters runs quickly because only a small
part of the model is activated at each step... the Mac's unified memory enables
fast and efficient access to the data."# timings: prompt 88 t/s · generation 62 t/s · peak 18.4 GB
q2 (81 GB) vs q2-q4 (98 GB) — the choice
The q2-q4-imatrix raises quality on hard math/code (last 6 layers in q4), but the extra 17 GB eat the headroom vision needs (98 + 20 + 18 > 128). For a speed+intelligence+vision setup, q2-imatrix is the balance. Pick q2-q4 only if you will NOT co-reside vision.
1. Why is permanent co-residency of DeepSeek + Qwen3-VL unstable on the M5 Max?
Correct: b. The sum brushes up against 128 GB and breaks the 10–20% headroom rule. That's why the default is on-demand with swap (peak ≤101 GB) or splitting vision onto the 2nd Mac.
2. What is the 2nd Mac (M4 Max 36 GB) for in this config?
Correct: c. Distributing one model across both gives capacity, not speed; the real win is using them as independent nodes (brain + vision).