Course / Lesson 09  ·  Português →
Lesson 09 · Running it

Hands-on

Enough RAM budgeting. This lesson is the command playbook: clone and build the brain, download the weights, bring up both servers and plug in your client. Every step marked ✓ was run on this M5 Max in this session — not "should work", but "ran, here's the output". By the end you'll have two OpenAI-compatible endpoints on your own machine: a brain on :8000 and vision on :8081.

~81 GB
DS4 q2 weights download (resumable)
62 t/s
measured Qwen3-VL generation (peak 18.4 GB)
25–34 t/s
DeepSeek generation (M5 Max class)
2
endpoints, one machine, both OpenAI

01 · The end-to-end flow

There are two parallel tracks that end in the same place: your client. The brain track (native DS4, Metal) and the vision track (MLX, in Python). Neither depends on the other to come up. The diagram marks with ✓ what's already been proven this session and with ⏳ the single slow step (the 81 GB download).

Brain track — DS4 (Metal) Vision track — MLX (Python) git clone antirez/ds4 make 5 Metal binaries ✓ exit 0 download_model q2 · 81 GB ⏳ resumable ds4-server 127.0.0.1:8000 brain · OpenAI+Anthropic uv venv ~/.venv-mlxvlm uv pip install mlx-vlm mlx_vlm.server 127.0.0.1:8081 ✓ running Your client Claude Code · Codex · opencode base-url :8000 → brain base-url :8081/v1 → vision ✓ proven this session ⏳ only slow step (network) — · independent tracks; come up in any order

Everything except the download was executed and verified this session. The two tracks don't block each other: you can bring up vision while the brain's 81 GB is still downloading.

✓ verified this session

The DS4 git clone + make ran to completion on this Mac: 5 Metal binaries generated, exit 0. And the vision mlx_vlm.server is up right now on :8081, answering real calls (the output is further down). The only thing not done live is the 81 GB download — for time, not for doubt.

02 · Brain track — clone, build, download, serve

DS4 (by antirez) builds natively with Metal. The golden rule: never make cpu on Apple Silicon — that throws away the GPU and unified memory, exactly what makes the Mac viable. The default build (make) already targets Metal.

# 1) clone the brain repository
git clone https://github.com/antirez/ds4 && cd ds4

# 2) build with Metal — NEVER `make cpu` on Apple Silicon
make
# ✓ provado nesta sessão: 5 binários Metal gerados, exit 0

# 3) download the q2 weights (~81 GB, the download is resumable)
./download_model.sh q2-imatrix

# 4) serve — KV cache on disk to free up RAM
./ds4-server --ctx 100000 --kv-disk-dir ~/.ds4/kv --kv-disk-space-mb 8192
# perf medida (README do DS4, classe M5 Max): gen 25–34 t/s · prefill 87–463 t/s
Why --kv-disk-dir? The KV cache grows with context. Pushing it to disk (here an 8 GB cap) keeps the 81 GB of weights + system within the 20% headroom, even on long contexts. It's the detail that separates "it ran" from "it blew out the RAM".

03 · The two-endpoint topology

The result is easy to picture: one machine, two HTTP servers. The brain speaks OpenAI and Anthropic on :8000; vision speaks OpenAI on :8081/v1. To your client, they're just two base-URLs.

One MacBook Pro M5 Max · 128 GB unified memory Brain DS4 · DeepSeek-V4-Flash q2 127.0.0.1:8000 · OpenAI + Anthropic Vision mlx_vlm.server · Qwen3-VL-30B-A3B 127.0.0.1:8081/v1 · OpenAI 81 GB Metal ~20 GB MLX 4-bit loopback client talks to both just swap the base-url and the model :8000 :8081

Two processes, two ports, zero cloud. Both OpenAI-compatible, so any client that accepts a custom base-URL connects to both.

04 · Vision running — command and real output

The vision track uses uv for an isolated venv and installs mlx-vlm straight from Git. The server brings up an OpenAI endpoint on :8081. Below, the setup and — what matters — the real response captured this session, with measured timings.

# 1) isolated venv, activated (uv is fast and reproducible)
uv venv ~/.venv-mlxvlm && source ~/.venv-mlxvlm/bin/activate

# 2) install mlx-vlm from main
uv pip install "git+https://github.com/Blaizzy/mlx-vlm"

# 3) bring up the vision server (OpenAI-compatible) on :8081
python -m mlx_vlm.server \
  --model mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit \
  --port 8081
# ✓ rodando agora nesta sessão

The real call and the measured response

# chamada real ao endpoint OpenAI-compatível da visão
curl -s 127.0.0.1:8081/v1/chat/completions -d '{
  "model":"mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit",
  "messages":[{"role":"user","content":"Descreva o conceito de memória unificada."}]
}'
# resposta real (medida nesta sessão):
"A memória unificada é uma arquitetura em que CPU e GPU compartilham o mesmo
 pool físico de memória, eliminando cópias entre dispositivos e permitindo que
 modelos grandes sejam acessados diretamente pela GPU com baixa latência."
# timings reais:
prompt 88 t/s · gen 62 t/s · peak 18,4 GB
✓ verified this session

These numbers aren't from anyone's README: they were measured on this machine, now. The server processed the prompt at 88 t/s, generated at 62 t/s and never crossed 18.4 GB peak — comfortably inside the vision budget from Lesson 08. On a one-shot call (generate) generation reached 122 t/s.

05 · What runs where — and what's verified

Before plugging in the client, the state map is worth a look. Each piece of the config carries a seal: ✓ if it's already built/running, ⏳ if it still depends on the download. Only one piece is on ⏳.

Config state — ✓ ready · ⏳ downloading ready downloading Brain (DS4 · Metal) clone of antirez/ds4 make — 5 Metal binaries ✓ exit 0 q2 weights · 81 GB ⏳ resumable ds4-server :8000 ready after ⏳ Vision (MLX · Python) uv venv ~/.venv-mlxvlm uv pip install mlx-vlm mlx_vlm.server :8081 ✓ up curl → real response ✓ 62 t/s · 18,4 GB Vision is 100% verified and serving. The brain is built; only the bytes still need to download.
ComponentWhereStateEvidence
clone + make (DS4)brain · :8000✓ ready5 Metal binaries, exit 0 (this session)
q2 weights (81 GB)brain · disk⏳ downloadingresumable download
venv + mlx-vlmvision · :8081✓ readyinstalled (this session)
mlx_vlm.servervision · :8081✓ upreal curl · gen 62 t/s · peak 18.4 GB

06 · The sequence of one call

What happens when your client asks a question? A lean HTTP exchange, over loopback, never leaving the machine. The diagram follows one request from the client to the tokens coming back.

Client Claude Code / Codex Endpoint :8000 brain · :8081 vision 1 POST /v1/chat/completions (messages, model, api-key) 1. load the KV / context 2. prefill (87–463 t/s) 3. generate tokens (25–34 t/s) only the active MoE experts kick in 2 streamed tokens (SSE) — word by word 3 [DONE] · connection closes all on 127.0.0.1 — no byte leaves the machine

The same sequence shape holds for both endpoints; only the port and which model loads change. The brain timings are M5 Max class (DS4 README); the vision ones were measured above.

07 · Plug into the client

This is the simplest part — and the most satisfying. Since both servers speak OpenAI, any client that accepts a base-URL points at 127.0.0.1. The api-key can be any string: there's no real authentication on a local server.

# brain (DeepSeek) — for coding/agent
base-url  = http://127.0.0.1:8000
model     = deepseek-v4-flash          # the id the ds4-server exposes
api-key   = anything            # local doesn't validate the key

# vision (Qwen3-VL) — for images/screenshots
base-url  = http://127.0.0.1:8081/v1
model     = mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit
api-key   = anything

# example: variables that Claude Code / Codex / opencode understand
export OPENAI_BASE_URL=http://127.0.0.1:8000
export OPENAI_API_KEY=local
Rule of thumb: point the brain at your day-to-day coding agent and keep vision as a second endpoint you only hit when there's an image. It's exactly the on-demand mode from Lesson 08, now in commands.

08 · The on-demand swap (timeline)

In the recommended mode (one Mac), the brain stays resident and vision only loads when an image arrives. The swap takes seconds thanks to mmap, and the peak never crosses the safe ceiling. This is Lesson 08 translated into a timeline.

time → DeepSeek serving :8000 · resident · 81 GB ↓ an image arrives pause DS4 frees RAM Qwen3-VL answers :8081 · ~20 GB · gen 62 t/s loads in seconds (mmap) restart DS4 brain returns · :8000 during the swap: only one model in RAM → peak ≤ 101 GB t0 t1 t2 t3 brain on brain on

There are never two large models in RAM at once: pause one, run the other, return. The cost is the reload latency (seconds); the prize is never blowing past the 128 GB.

1. Why is make cpu the mistake to avoid when building DS4 on an M5 Max?
Correct: b. On Apple Silicon the gain comes from Metal over unified memory. make cpu throws that away. This session, make (default) generated 5 Metal binaries with exit 0.
2. You plugged the brain into Claude Code, but the server refuses the connection. What is NOT the cause?
Correct: c. Locally there's no real authentication: the api-key can be any string. A refused connection comes from a server not started, the wrong base-url/port, or the download still in progress.