Lesson 09 · Running it

Hands-on

Enough RAM budgeting. This lesson is the command playbook: clone and build the brain, download the weights, bring up both servers and plug in your client. Every step marked ✓ was run on this M5 Max in this session — not "should work", but "ran, here's the output". By the end you'll have two OpenAI-compatible endpoints on your own machine: a brain on :8000 and vision on :8081.

~81 GB

DS4 q2 weights download (resumable)

62 t/s

measured Qwen3-VL generation (peak 18.4 GB)

25–34 t/s

DeepSeek generation (M5 Max class)

endpoints, one machine, both OpenAI

01 · The end-to-end flow

There are two parallel tracks that end in the same place: your client. The brain track (native DS4, Metal) and the vision track (MLX, in Python). Neither depends on the other to come up. The diagram marks with ✓ what's already been proven this session and with ⏳ the single slow step (the 81 GB download).

Everything except the download was executed and verified this session. The two tracks don't block each other: you can bring up vision while the brain's 81 GB is still downloading.

✓ verified this session

The DS4 git clone + make ran to completion on this Mac: 5 Metal binaries generated, exit 0. And the vision mlx_vlm.server is up right now on :8081, answering real calls (the output is further down). The only thing not done live is the 81 GB download — for time, not for doubt.

02 · Brain track — clone, build, download, serve

DS4 (by antirez) builds natively with Metal. The golden rule: never make cpu on Apple Silicon — that throws away the GPU and unified memory, exactly what makes the Mac viable. The default build (make) already targets Metal.

# 1) clone the brain repository
git clone https://github.com/antirez/ds4 && cd ds4

# 2) build with Metal — NEVER `make cpu` on Apple Silicon
make
# ✓ provado nesta sessão: 5 binários Metal gerados, exit 0

# 3) download the q2 weights (~81 GB, the download is resumable)
./download_model.sh q2-imatrix

# 4) serve — KV cache on disk to free up RAM
./ds4-server --ctx 100000 --kv-disk-dir ~/.ds4/kv --kv-disk-space-mb 8192
# perf medida (README do DS4, classe M5 Max): gen 25–34 t/s · prefill 87–463 t/s

Why --kv-disk-dir? The KV cache grows with context. Pushing it to disk (here an 8 GB cap) keeps the 81 GB of weights + system within the 20% headroom, even on long contexts. It's the detail that separates "it ran" from "it blew out the RAM".

03 · The two-endpoint topology

The result is easy to picture: one machine, two HTTP servers. The brain speaks OpenAI and Anthropic on :8000; vision speaks OpenAI on :8081/v1. To your client, they're just two base-URLs.

Two processes, two ports, zero cloud. Both OpenAI-compatible, so any client that accepts a custom base-URL connects to both.

04 · Vision running — command and real output

The vision track uses uv for an isolated venv and installs mlx-vlm straight from Git. The server brings up an OpenAI endpoint on :8081. Below, the setup and — what matters — the real response captured this session, with measured timings.

# 1) isolated venv, activated (uv is fast and reproducible)
uv venv ~/.venv-mlxvlm && source ~/.venv-mlxvlm/bin/activate

# 2) install mlx-vlm from main
uv pip install "git+https://github.com/Blaizzy/mlx-vlm"

# 3) bring up the vision server (OpenAI-compatible) on :8081
python -m mlx_vlm.server \
  --model mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit \
  --port 8081
# ✓ rodando agora nesta sessão

The real call and the measured response

# chamada real ao endpoint OpenAI-compatível da visão
curl -s 127.0.0.1:8081/v1/chat/completions -d '{
  "model":"mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit",
  "messages":[{"role":"user","content":"Descreva o conceito de memória unificada."}]
}'
# resposta real (medida nesta sessão):
"A memória unificada é uma arquitetura em que CPU e GPU compartilham o mesmo
 pool físico de memória, eliminando cópias entre dispositivos e permitindo que
 modelos grandes sejam acessados diretamente pela GPU com baixa latência."
# timings reais:
prompt 88 t/s · gen 62 t/s · peak 18,4 GB

✓ verified this session

These numbers aren't from anyone's README: they were measured on this machine, now. The server processed the prompt at 88 t/s, generated at 62 t/s and never crossed 18.4 GB peak — comfortably inside the vision budget from Lesson 08. On a one-shot call (generate) generation reached 122 t/s.

05 · What runs where — and what's verified

Before plugging in the client, the state map is worth a look. Each piece of the config carries a seal: ✓ if it's already built/running, ⏳ if it still depends on the download. Only one piece is on ⏳.

Component	Where	State	Evidence
clone + make (DS4)	brain · :8000	✓ ready	5 Metal binaries, exit 0 (this session)
q2 weights (81 GB)	brain · disk	⏳ downloading	resumable download
venv + mlx-vlm	vision · :8081	✓ ready	installed (this session)
mlx_vlm.server	vision · :8081	✓ up	real curl · gen 62 t/s · peak 18.4 GB

06 · The sequence of one call

What happens when your client asks a question? A lean HTTP exchange, over loopback, never leaving the machine. The diagram follows one request from the client to the tokens coming back.

The same sequence shape holds for both endpoints; only the port and which model loads change. The brain timings are M5 Max class (DS4 README); the vision ones were measured above.

07 · Plug into the client

This is the simplest part — and the most satisfying. Since both servers speak OpenAI, any client that accepts a base-URL points at 127.0.0.1. The api-key can be any string: there's no real authentication on a local server.

# brain (DeepSeek) — for coding/agent
base-url  = http://127.0.0.1:8000
model     = deepseek-v4-flash          # the id the ds4-server exposes
api-key   = anything            # local doesn't validate the key

# vision (Qwen3-VL) — for images/screenshots
base-url  = http://127.0.0.1:8081/v1
model     = mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit
api-key   = anything

# example: variables that Claude Code / Codex / opencode understand
export OPENAI_BASE_URL=http://127.0.0.1:8000
export OPENAI_API_KEY=local

Rule of thumb: point the brain at your day-to-day coding agent and keep vision as a second endpoint you only hit when there's an image. It's exactly the on-demand mode from Lesson 08, now in commands.

08 · The on-demand swap (timeline)

In the recommended mode (one Mac), the brain stays resident and vision only loads when an image arrives. The swap takes seconds thanks to mmap, and the peak never crosses the safe ceiling. This is Lesson 08 translated into a timeline.

There are never two large models in RAM at once: pause one, run the other, return. The cost is the reload latency (seconds); the prize is never blowing past the 128 GB.

1. Why is make cpu the mistake to avoid when building DS4 on an M5 Max?

Correct: b. On Apple Silicon the gain comes from Metal over unified memory. make cpu throws that away. This session, make (default) generated 5 Metal binaries with exit 0.

2. You plugged the brain into Claude Code, but the server refuses the connection. What is NOT the cause?

Correct: c. Locally there's no real authentication: the api-key can be any string. A refused connection comes from a server not started, the wrong base-url/port, or the download still in progress.

← Lesson 08 Lesson 10 →

Sources:
· Brain (DS4): antirez/ds4 repository + its README (M5 Max-class perf: gen 25–34 t/s, prefill 87–463 t/s). git clone + make proven this session (5 Metal binaries, exit 0).
· Vision (Qwen3-VL via mlx_vlm.server): own measurement this session — real curl call, captured response, prompt 88 t/s · gen 62 t/s · peak 18,4 GB (one-shot generate: 122 t/s).
· Client plug / RAM budget and on-demand mode: Lesson 08 + docs/local-models-macbook.md.
← Course hub · Português