The earlier lessons gave you the vocabulary: MoE, unified memory, quantization, KV cache. DS4 is where all of it becomes a binary that runs. It's no longer a generic GGUF loader — it's a native engine in C written by antirez with a single target in mind: running DeepSeek-V4 on a Mac. This lesson opens the box: the architecture, the build we've already proven on this Mac, the SSD streaming that dissolves the RAM ceiling, and how to wire up a client.
make (Metal)DS4 — nicknamed "DwarfStar" — is a small, native, self-contained inference engine written in C. It doesn't try to run any model: it was optimized first for DeepSeek-V4-Flash (284B total parameters, 13B active per token, 1M context). The Pro variant (1.6T total / 49B active) exists for giant-memory machines. Think of it as a racing engine built for one track — not a passenger car that goes anywhere.
It's the first time I'm using a local model for serious work that I'd normally hand to Claude or GPT... DS4 is much more B [frontier] than A [small local model].— antirez, antirez.com/news/165
DS4 is a short, direct stack: a GGUF with asymmetric quantization on disk, a C core with Metal kernels that does the math, a KV cache that lives across RAM and disk, and a server that speaks the protocols of the commercial APIs. The clients (Claude Code, Codex, your app) don't even notice they're talking to something local.
The asymmetric GGUF (1) feeds the C core with Metal kernels (2); the KV cache (3) lives across RAM and disk; ds4-server (4) exposes four API families and the clients (5) speak the protocol they already know.
In this session, on the founder's M5 Max: git clone && make compiled 5 Metal binaries — ds4, ds4-server, ds4-bench, ds4-eval and ds4-agent — with exit 0, and the binaries run. This isn't "should compile": it compiled and ran.
Here's the trick that fits 284B into 81 GB without turning it to garbage. DS4 doesn't quantize everything at the same level. The routed experts — most of the parameters, but only a handful active per token — go to 2-bit. Everything else (attention, embeddings, router, dense layers) stays in Q8. Precision is spent where it matters for fidelity, and the savings are made where the volume is.
The 2-bit quants are no joke; it calls tools reliably.— antirez, DS4 MODEL_CARD
This is DS4's most original mechanism. Instead of treating RAM as a hard cut — "it fits or it doesn't" — DS4 treats the SSD as a continuous extension. The non-routed weights (attention, router, dense) stay resident in memory. The routed experts are read from the GGUF on demand: when the router asks for an expert that isn't in the cache, it's brought in from the SSD (cache-miss). And the KV cache is, in antirez's words, "a first-class disk citizen".
Non-routed weights stay pinned in RAM; the hot experts live in a cache; a cold expert (E88) is brought in from the SSD only when the router asks for it. More RAM = more experts fit resident = fewer misses = faster — a spectrum, not a cut.
The KV cache is a first-class disk citizen. SSD streaming turns the available RAM from a hard cut into a continuous spectrum of speed tiers.— antirez, DS4 README
The download isn't one size fits all. Each quant pairs with a memory range: the more RAM, the more experts fit resident and the higher the quality you can afford. They all come from hf.co/antirez/deepseek-v4-gguf.
On a 128 GB M5 Max, q2-imatrix (81 GB) fits with room to spare and q2-q4-imatrix (98 GB) fits tight; q4 and pro-q2 call for bigger machines. SSD streaming softens the edges, but the table is the starting point.
| Download target | Disk ≈ | RAM class | Variant |
|---|---|---|---|
| q2-imatrix | 81 GB | 96 / 128 GB Macs | V4-Flash |
| q2-q4-imatrix | 98 GB | 128 GB (more quality) | V4-Flash |
| q4-imatrix | 153 GB | ≥ 256 GB | V4-Flash |
| pro-q2 | 430 GB | 512 GB | V4-Pro |
All from hf.co/antirez/deepseek-v4-gguf. q2-imatrix is the entry point for coding/agent Macs.
Four steps take you from zero to a local endpoint serving. The first two — clone and make — are already proven on this Mac. The other two depend only on disk and on pointing the client.
# 1+2 — clone and build (proven on this Mac: 5 Metal binaries, exit 0) git clone <ds4-repo> && cd ds4 && make # → ds4 ds4-server ds4-bench ds4-eval ds4-agent # 3 — download the quant that fits in 128 GB (≈ 81 GB) ./download_model.sh q2-imatrix # 4 — serve (NEVER make cpu on macOS: kernel panic; Metal is the default) ./ds4-server # OpenAI + Anthropic + Responses + completions on 127.0.0.1:8000 # 5 — plug in a client (e.g., an Anthropic-compatible call) curl -s 127.0.0.1:8000/v1/messages -d '{"model":"deepseek-v4-flash","messages":[...]}'
antirez's take is the point of the whole lesson: using a local model for the work you'd normally outsource to Claude or GPT. On the A→B axis (A = small local model, B = frontier), DS4 is "much more B than A". That's why it's worth learning the engine, and not just running a toy chat.
The ds4-server accepts all four protocols, but it has one property that defines how you operate it: it's single-stream — one request in flight at a time. There's no request parallelism; the server processes one graph at a time and serializes the rest. In the diagram, three clients arrive, but only one crosses the worker; the others wait their turn.
Three requests arrive; the queue serializes B and C; the single graph worker processes A and returns tokens as a stream. "Single-stream" = one in flight — plan your usage (and vision) around it.
ds4-agent is alpha. Treat it as a moving frontier tool: great for serious work now, but not stable production infra yet. And the rule that can cost you the system: on macOS, never make cpu — a VM bug causes a kernel panic. Always Metal (the default).