Course / Lesson 07  ·  Português →
Lesson 07 · The engine

DS4 / DwarfStar

The earlier lessons gave you the vocabulary: MoE, unified memory, quantization, KV cache. DS4 is where all of it becomes a binary that runs. It's no longer a generic GGUF loader — it's a native engine in C written by antirez with a single target in mind: running DeepSeek-V4 on a Mac. This lesson opens the box: the architecture, the build we've already proven on this Mac, the SSD streaming that dissolves the RAM ceiling, and how to wire up a client.

284B / 13B
DeepSeek-V4-Flash · total / active
81 GB
q2-imatrix on disk (96/128 GB Macs)
1M
supported context (tokens)
5 binaries
produced by make (Metal)

DS4 — nicknamed "DwarfStar" — is a small, native, self-contained inference engine written in C. It doesn't try to run any model: it was optimized first for DeepSeek-V4-Flash (284B total parameters, 13B active per token, 1M context). The Pro variant (1.6T total / 49B active) exists for giant-memory machines. Think of it as a racing engine built for one track — not a passenger car that goes anywhere.

It's the first time I'm using a local model for serious work that I'd normally hand to Claude or GPT... DS4 is much more B [frontier] than A [small local model].— antirez, antirez.com/news/165

01 · The engine architecture

DS4 is a short, direct stack: a GGUF with asymmetric quantization on disk, a C core with Metal kernels that does the math, a KV cache that lives across RAM and disk, and a server that speaks the protocols of the commercial APIs. The clients (Claude Code, Codex, your app) don't even notice they're talking to something local.

1 · GGUF on disk hf.co/antirez/deepseek-v4-gguf Routed experts 2-bit quant (asymmetric) Rest of the weights Q8 quant (8-bit) q2-imatrix ≈ 81 GB DeepSeek-V4-Flash 284B total · 13B active 1M token ctx 2 · DS4 core (C) native engine · self-contained Metal kernels primary target · unified GPU CUDA / DGX-Spark · ROCm / Strix MoE router → 13B active per token ⚠ macOS: NEVER make cpu (VM bug causes kernel panic) → use Metal (default) 3 · KV cache RAM + disk (first-class citizen) 4 · ds4-server 127.0.0.1:8000 · single-stream /v1/chat/completions · OpenAI /v1/messages · Anthropic /v1/responses · Responses /v1/completions · completions 5 · Clients Claude Code (via /v1/messages) Codex / your app (via /v1/chat)

The asymmetric GGUF (1) feeds the C core with Metal kernels (2); the KV cache (3) lives across RAM and disk; ds4-server (4) exposes four API families and the clients (5) speak the protocol they already know.

✓ proven on this Mac

In this session, on the founder's M5 Max: git clone && make compiled 5 Metal binariesds4, ds4-server, ds4-bench, ds4-eval and ds4-agent — with exit 0, and the binaries run. This isn't "should compile": it compiled and ran.

02 · The asymmetric 2/8-bit quantization

Here's the trick that fits 284B into 81 GB without turning it to garbage. DS4 doesn't quantize everything at the same level. The routed experts — most of the parameters, but only a handful active per token — go to 2-bit. Everything else (attention, embeddings, router, dense layers) stays in Q8. Precision is spent where it matters for fidelity, and the savings are made where the volume is.

The 2-bit quants are no joke; it calls tools reliably.— antirez, DS4 MODEL_CARD
Why this works: in an MoE, the routed experts are where the parameter volume lives, but since only ~13B activate per token, each one's quantization error weighs little on the result. The parts that touch every token (attention, router) stay in Q8 so error doesn't accumulate. It's the thesis of Lesson 03 — quantization — applied with a scalpel, not a hammer.

03 · SSD streaming: RAM becomes a spectrum

This is DS4's most original mechanism. Instead of treating RAM as a hard cut — "it fits or it doesn't" — DS4 treats the SSD as a continuous extension. The non-routed weights (attention, router, dense) stay resident in memory. The routed experts are read from the GGUF on demand: when the router asks for an expert that isn't in the cache, it's brought in from the SSD (cache-miss). And the KV cache is, in antirez's words, "a first-class disk citizen".

Unified RAM (resident) high-bandwidth GPU access NON-routed weights — always resident attention · router · dense layers · Q8 Routed-expert cache the hot experts stay here E12 E47 E03 free free KV cache — starts in RAM, spills to disk SSD (GGUF · 81 GB) first-class disk citizen All routed experts (2-bit) E00 E01 E88 E89 only the requested ones rise to RAM streaming = continuous speed tiers cache-miss → stream expert E88 cache-hit → zero disk (E03/E12/E47 already in RAM)

Non-routed weights stay pinned in RAM; the hot experts live in a cache; a cold expert (E88) is brought in from the SSD only when the router asks for it. More RAM = more experts fit resident = fewer misses = faster — a spectrum, not a cut.

The KV cache is a first-class disk citizen. SSD streaming turns the available RAM from a hard cut into a continuous spectrum of speed tiers.— antirez, DS4 README

04 · How much to download for each RAM class

The download isn't one size fits all. Each quant pairs with a memory range: the more RAM, the more experts fit resident and the higher the quality you can afford. They all come from hf.co/antirez/deepseek-v4-gguf.

Download (disk) × RAM class — ✓ fits · ✗ doesn't fit 96 GB 128 GB 256 GB 512 GB q2-imatrix ≈ 81 GB q2-q4-imatrix ≈ 98 GB q4-imatrix ≈ 153 GB pro-q2 (V4-Pro) ≈ 430 GB ↓ your M5 Max

On a 128 GB M5 Max, q2-imatrix (81 GB) fits with room to spare and q2-q4-imatrix (98 GB) fits tight; q4 and pro-q2 call for bigger machines. SSD streaming softens the edges, but the table is the starting point.

Download targetDisk ≈RAM classVariant
q2-imatrix81 GB96 / 128 GB MacsV4-Flash
q2-q4-imatrix98 GB128 GB (more quality)V4-Flash
q4-imatrix153 GB≥ 256 GBV4-Flash
pro-q2430 GB512 GBV4-Pro

All from hf.co/antirez/deepseek-v4-gguf. q2-imatrix is the entry point for coding/agent Macs.

05 · From clone to client: the flow

Four steps take you from zero to a local endpoint serving. The first two — clone and make — are already proven on this Mac. The other two depend only on disk and on pointing the client.

1 · git clone DS4 repo ✓ proven on this Mac 2 · make (Metal) 5 binaries · exit 0 ✓ proven on this Mac 3 · download_model.sh q2-imatrix ≈ 81 GB · from HF 4 · ds4-server 127.0.0.1:8000 single-stream 5 · client Claude Code / Codex From zero to a local endpoint serving — 5 steps Status: DS4 is in BETA ("it's been around for a few days"); ds4-agent is in ALPHA. ⚠ on macOS, NEVER run make cpu — VM bug causes kernel panic. Use Metal (default).

The command, in practice

# 1+2 — clone and build (proven on this Mac: 5 Metal binaries, exit 0)
git clone <ds4-repo> && cd ds4 && make
# → ds4  ds4-server  ds4-bench  ds4-eval  ds4-agent

# 3 — download the quant that fits in 128 GB (≈ 81 GB)
./download_model.sh q2-imatrix

# 4 — serve (NEVER make cpu on macOS: kernel panic; Metal is the default)
./ds4-server   # OpenAI + Anthropic + Responses + completions on 127.0.0.1:8000

# 5 — plug in a client (e.g., an Anthropic-compatible call)
curl -s 127.0.0.1:8000/v1/messages -d '{"model":"deepseek-v4-flash","messages":[...]}'
a frontier model, local

antirez's take is the point of the whole lesson: using a local model for the work you'd normally outsource to Claude or GPT. On the A→B axis (A = small local model, B = frontier), DS4 is "much more B than A". That's why it's worth learning the engine, and not just running a toy chat.

06 · What "single-stream" means

The ds4-server accepts all four protocols, but it has one property that defines how you operate it: it's single-stream — one request in flight at a time. There's no request parallelism; the server processes one graph at a time and serializes the rest. In the diagram, three clients arrive, but only one crosses the worker; the others wait their turn.

One request at a time — the graph worker is serialized (single-stream) HTTP requests req A · /v1/messages req B · /v1/chat (waits) req C · /v1/responses (waits) queue serializes B, C 1 in flight single graph worker DeepSeek-V4-Flash · Metal kernels router → 13B active · KV cache tokens (stream) → req A's response B and C only start when A finishes — no request concurrency on the server. Practical consequence: 1 serious client at a time; for concurrent vision, separate the endpoint (Lesson 08).

Three requests arrive; the queue serializes B and C; the single graph worker processes A and returns tokens as a stream. "Single-stream" = one in flight — plan your usage (and vision) around it.

BETA — read before depending on this: DS4 "has been around for a few days" and is in beta; ds4-agent is alpha. Treat it as a moving frontier tool: great for serious work now, but not stable production infra yet. And the rule that can cost you the system: on macOS, never make cpu — a VM bug causes a kernel panic. Always Metal (the default).
1. What does DS4's asymmetric 2/8-bit quantization do, exactly?
Correct: c. The routed experts concentrate the parameters but only ~13B activate per token, so the 2-bit error weighs little; the parts that touch every token stay in Q8. That's why "the 2-bit quants are no joke" and it calls tools reliably (MODEL_CARD).
2. What does DS4's SSD streaming change in the relationship between RAM and model?
Correct: b. Non-routed weights stay resident; cold experts are brought in from the SSD on demand; the KV cache is a "first-class disk citizen". More RAM = more resident experts = fewer misses = faster — a spectrum, not a cutoff (README).