Enough RAM budgeting. This lesson is the command playbook: clone and build the brain, download the weights, bring up both servers and plug in your client. Every step marked ✓ was run on this M5 Max in this session — not "should work", but "ran, here's the output". By the end you'll have two OpenAI-compatible endpoints on your own machine: a brain on :8000 and vision on :8081.
There are two parallel tracks that end in the same place: your client. The brain track (native DS4, Metal) and the vision track (MLX, in Python). Neither depends on the other to come up. The diagram marks with ✓ what's already been proven this session and with ⏳ the single slow step (the 81 GB download).
Everything except the download was executed and verified this session. The two tracks don't block each other: you can bring up vision while the brain's 81 GB is still downloading.
The DS4 git clone + make ran to completion on this Mac: 5 Metal binaries generated, exit 0. And the vision mlx_vlm.server is up right now on :8081, answering real calls (the output is further down). The only thing not done live is the 81 GB download — for time, not for doubt.
DS4 (by antirez) builds natively with Metal. The golden rule: never make cpu on Apple Silicon — that throws away the GPU and unified memory, exactly what makes the Mac viable. The default build (make) already targets Metal.
# 1) clone the brain repository git clone https://github.com/antirez/ds4 && cd ds4 # 2) build with Metal — NEVER `make cpu` on Apple Silicon make # ✓ provado nesta sessão: 5 binários Metal gerados, exit 0 # 3) download the q2 weights (~81 GB, the download is resumable) ./download_model.sh q2-imatrix # 4) serve — KV cache on disk to free up RAM ./ds4-server --ctx 100000 --kv-disk-dir ~/.ds4/kv --kv-disk-space-mb 8192 # perf medida (README do DS4, classe M5 Max): gen 25–34 t/s · prefill 87–463 t/s
--kv-disk-dir? The KV cache grows with context. Pushing it to disk (here an 8 GB cap) keeps the 81 GB of weights + system within the 20% headroom, even on long contexts. It's the detail that separates "it ran" from "it blew out the RAM".The result is easy to picture: one machine, two HTTP servers. The brain speaks OpenAI and Anthropic on :8000; vision speaks OpenAI on :8081/v1. To your client, they're just two base-URLs.
Two processes, two ports, zero cloud. Both OpenAI-compatible, so any client that accepts a custom base-URL connects to both.
The vision track uses uv for an isolated venv and installs mlx-vlm straight from Git. The server brings up an OpenAI endpoint on :8081. Below, the setup and — what matters — the real response captured this session, with measured timings.
# 1) isolated venv, activated (uv is fast and reproducible) uv venv ~/.venv-mlxvlm && source ~/.venv-mlxvlm/bin/activate # 2) install mlx-vlm from main uv pip install "git+https://github.com/Blaizzy/mlx-vlm" # 3) bring up the vision server (OpenAI-compatible) on :8081 python -m mlx_vlm.server \ --model mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit \ --port 8081 # ✓ rodando agora nesta sessão
# chamada real ao endpoint OpenAI-compatível da visão curl -s 127.0.0.1:8081/v1/chat/completions -d '{ "model":"mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit", "messages":[{"role":"user","content":"Descreva o conceito de memória unificada."}] }' # resposta real (medida nesta sessão): "A memória unificada é uma arquitetura em que CPU e GPU compartilham o mesmo pool físico de memória, eliminando cópias entre dispositivos e permitindo que modelos grandes sejam acessados diretamente pela GPU com baixa latência." # timings reais: prompt 88 t/s · gen 62 t/s · peak 18,4 GB
These numbers aren't from anyone's README: they were measured on this machine, now. The server processed the prompt at 88 t/s, generated at 62 t/s and never crossed 18.4 GB peak — comfortably inside the vision budget from Lesson 08. On a one-shot call (generate) generation reached 122 t/s.
Before plugging in the client, the state map is worth a look. Each piece of the config carries a seal: ✓ if it's already built/running, ⏳ if it still depends on the download. Only one piece is on ⏳.
| Component | Where | State | Evidence |
|---|---|---|---|
| clone + make (DS4) | brain · :8000 | ✓ ready | 5 Metal binaries, exit 0 (this session) |
| q2 weights (81 GB) | brain · disk | ⏳ downloading | resumable download |
| venv + mlx-vlm | vision · :8081 | ✓ ready | installed (this session) |
| mlx_vlm.server | vision · :8081 | ✓ up | real curl · gen 62 t/s · peak 18.4 GB |
What happens when your client asks a question? A lean HTTP exchange, over loopback, never leaving the machine. The diagram follows one request from the client to the tokens coming back.
The same sequence shape holds for both endpoints; only the port and which model loads change. The brain timings are M5 Max class (DS4 README); the vision ones were measured above.
This is the simplest part — and the most satisfying. Since both servers speak OpenAI, any client that accepts a base-URL points at 127.0.0.1. The api-key can be any string: there's no real authentication on a local server.
# brain (DeepSeek) — for coding/agent base-url = http://127.0.0.1:8000 model = deepseek-v4-flash # the id the ds4-server exposes api-key = anything # local doesn't validate the key # vision (Qwen3-VL) — for images/screenshots base-url = http://127.0.0.1:8081/v1 model = mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit api-key = anything # example: variables that Claude Code / Codex / opencode understand export OPENAI_BASE_URL=http://127.0.0.1:8000 export OPENAI_API_KEY=local
In the recommended mode (one Mac), the brain stays resident and vision only loads when an image arrives. The swap takes seconds thanks to mmap, and the peak never crosses the safe ceiling. This is Lesson 08 translated into a timeline.
There are never two large models in RAM at once: pause one, run the other, return. The cost is the reload latency (seconds); the prize is never blowing past the 128 GB.
make cpu the mistake to avoid when building DS4 on an M5 Max?make cpu throws that away. This session, make (default) generated 5 Metal binaries with exit 0.api-key can be any string. A refused connection comes from a server not started, the wrong base-url/port, or the download still in progress.