Lesson 07 · The engine

DS4 / DwarfStar

The earlier lessons gave you the vocabulary: MoE, unified memory, quantization, KV cache. DS4 is where all of it becomes a binary that runs. It's no longer a generic GGUF loader — it's a native engine in C written by antirez with a single target in mind: running DeepSeek-V4 on a Mac. This lesson opens the box: the architecture, the build we've already proven on this Mac, the SSD streaming that dissolves the RAM ceiling, and how to wire up a client.

284B / 13B

DeepSeek-V4-Flash · total / active

81 GB

q2-imatrix on disk (96/128 GB Macs)

supported context (tokens)

5 binaries

produced by make (Metal)

DS4 — nicknamed "DwarfStar" — is a small, native, self-contained inference engine written in C. It doesn't try to run any model: it was optimized first for DeepSeek-V4-Flash (284B total parameters, 13B active per token, 1M context). The Pro variant (1.6T total / 49B active) exists for giant-memory machines. Think of it as a racing engine built for one track — not a passenger car that goes anywhere.

It's the first time I'm using a local model for serious work that I'd normally hand to Claude or GPT... DS4 is much more B [frontier] than A [small local model].— antirez, antirez.com/news/165

01 · The engine architecture

DS4 is a short, direct stack: a GGUF with asymmetric quantization on disk, a C core with Metal kernels that does the math, a KV cache that lives across RAM and disk, and a server that speaks the protocols of the commercial APIs. The clients (Claude Code, Codex, your app) don't even notice they're talking to something local.

The asymmetric GGUF (1) feeds the C core with Metal kernels (2); the KV cache (3) lives across RAM and disk; ds4-server (4) exposes four API families and the clients (5) speak the protocol they already know.

✓ proven on this Mac

In this session, on the founder's M5 Max: git clone && make compiled 5 Metal binaries — ds4, ds4-server, ds4-bench, ds4-eval and ds4-agent — with exit 0, and the binaries run. This isn't "should compile": it compiled and ran.

02 · The asymmetric 2/8-bit quantization

Here's the trick that fits 284B into 81 GB without turning it to garbage. DS4 doesn't quantize everything at the same level. The routed experts — most of the parameters, but only a handful active per token — go to 2-bit. Everything else (attention, embeddings, router, dense layers) stays in Q8. Precision is spent where it matters for fidelity, and the savings are made where the volume is.

The 2-bit quants are no joke; it calls tools reliably.— antirez, DS4 MODEL_CARD

Why this works: in an MoE, the routed experts are where the parameter volume lives, but since only ~13B activate per token, each one's quantization error weighs little on the result. The parts that touch every token (attention, router) stay in Q8 so error doesn't accumulate. It's the thesis of Lesson 03 — quantization — applied with a scalpel, not a hammer.

03 · SSD streaming: RAM becomes a spectrum

This is DS4's most original mechanism. Instead of treating RAM as a hard cut — "it fits or it doesn't" — DS4 treats the SSD as a continuous extension. The non-routed weights (attention, router, dense) stay resident in memory. The routed experts are read from the GGUF on demand: when the router asks for an expert that isn't in the cache, it's brought in from the SSD (cache-miss). And the KV cache is, in antirez's words, "a first-class disk citizen".

Non-routed weights stay pinned in RAM; the hot experts live in a cache; a cold expert (E88) is brought in from the SSD only when the router asks for it. More RAM = more experts fit resident = fewer misses = faster — a spectrum, not a cut.

The KV cache is a first-class disk citizen. SSD streaming turns the available RAM from a hard cut into a continuous spectrum of speed tiers.— antirez, DS4 README

04 · How much to download for each RAM class

The download isn't one size fits all. Each quant pairs with a memory range: the more RAM, the more experts fit resident and the higher the quality you can afford. They all come from hf.co/antirez/deepseek-v4-gguf.

On a 128 GB M5 Max, q2-imatrix (81 GB) fits with room to spare and q2-q4-imatrix (98 GB) fits tight; q4 and pro-q2 call for bigger machines. SSD streaming softens the edges, but the table is the starting point.

Download target	Disk ≈	RAM class	Variant
q2-imatrix	81 GB	96 / 128 GB Macs	V4-Flash
q2-q4-imatrix	98 GB	128 GB (more quality)	V4-Flash
q4-imatrix	153 GB	≥ 256 GB	V4-Flash
pro-q2	430 GB	512 GB	V4-Pro

All from hf.co/antirez/deepseek-v4-gguf. q2-imatrix is the entry point for coding/agent Macs.

05 · From clone to client: the flow

Four steps take you from zero to a local endpoint serving. The first two — clone and make — are already proven on this Mac. The other two depend only on disk and on pointing the client.

The command, in practice

# 1+2 — clone and build (proven on this Mac: 5 Metal binaries, exit 0)
git clone <ds4-repo> && cd ds4 && make
# → ds4  ds4-server  ds4-bench  ds4-eval  ds4-agent

# 3 — download the quant that fits in 128 GB (≈ 81 GB)
./download_model.sh q2-imatrix

# 4 — serve (NEVER make cpu on macOS: kernel panic; Metal is the default)
./ds4-server   # OpenAI + Anthropic + Responses + completions on 127.0.0.1:8000

# 5 — plug in a client (e.g., an Anthropic-compatible call)
curl -s 127.0.0.1:8000/v1/messages -d '{"model":"deepseek-v4-flash","messages":[...]}'

a frontier model, local

antirez's take is the point of the whole lesson: using a local model for the work you'd normally outsource to Claude or GPT. On the A→B axis (A = small local model, B = frontier), DS4 is "much more B than A". That's why it's worth learning the engine, and not just running a toy chat.

06 · What "single-stream" means

The ds4-server accepts all four protocols, but it has one property that defines how you operate it: it's single-stream — one request in flight at a time. There's no request parallelism; the server processes one graph at a time and serializes the rest. In the diagram, three clients arrive, but only one crosses the worker; the others wait their turn.

Three requests arrive; the queue serializes B and C; the single graph worker processes A and returns tokens as a stream. "Single-stream" = one in flight — plan your usage (and vision) around it.

BETA — read before depending on this: DS4 "has been around for a few days" and is in beta; ds4-agent is alpha. Treat it as a moving frontier tool: great for serious work now, but not stable production infra yet. And the rule that can cost you the system: on macOS, never make cpu — a VM bug causes a kernel panic. Always Metal (the default).

1. What does DS4's asymmetric 2/8-bit quantization do, exactly?

Correct: c. The routed experts concentrate the parameters but only ~13B activate per token, so the 2-bit error weighs little; the parts that touch every token stay in Q8. That's why "the 2-bit quants are no joke" and it calls tools reliably (MODEL_CARD).

2. What does DS4's SSD streaming change in the relationship between RAM and model?

Correct: b. Non-routed weights stay resident; cold experts are brought in from the SSD on demand; the KV cache is a "first-class disk citizen". More RAM = more resident experts = fewer misses = faster — a spectrum, not a cutoff (README).

← Lesson 06 Lesson 08 →

Sources:
· Architecture, asymmetric 2/8-bit quant, SSD streaming ("KV cache is a first-class disk citizen"; "RAM becomes a spectrum"), download targets (q2 81 / q2-q4 98 / q4 153 / pro-q2 430 GB), server APIs (OpenAI + Anthropic + Responses + completions, :8000, single-stream): DS4 README + MODEL_CARD (hf.co/antirez/deepseek-v4-gguf).
· "2-bit is no joke; calls tools reliably": DS4 MODEL_CARD.
· "first time I'm using a local model for serious work... much more B than A"; BETA status / ds4-agent ALPHA; Metal primary target; NEVER make cpu (kernel panic): antirez.com/news/165.
· git clone && make → 5 Metal binaries (ds4, ds4-server, ds4-bench, ds4-eval, ds4-agent), exit 0: proven in this session, founder's M5 Max.
← Course hub · Português