Course / Lesson 11  ·  Português →
Lesson 11 · Capstone

The decade setup

The previous ten lessons were loose pieces: capacity, bandwidth, MoE, KV cache, quantization, engines, your Mac's config, hands-on, vision. This one welds them into a single machine and answers the question that drives everything — why build this? The answer isn't "to run a model." It's to own the infrastructure: near-frontier inference, offline, at $0 per token, with the total privacy that serious documents demand. It's not a dev toy. It's the foundation of a product.

2 tiers
brain (DS4 :8000) + vision (MLX :8081)
$0 / token
marginal cost of inference
100% offline
no data leaves the machine
11 lessons
from capacity to capstone
I'm not optimizing for the next release. I'm optimizing for the next decade.— Ahmad Osman

01 · The full stack assembled, end to end

Everything you derived converges here. A 128 GB M5 Max hosts two local endpoints: Tier 1 is the brain — DeepSeek-V4-Flash q2 served by DS4 on port :8000, 81 GB resident, 25–34 tok/s, near-frontier quality. Tier 2 is vision — Qwen3-VL via MLX on port :8081, ~20 GB, 110–122 tok/s. Clients (Claude Code, Codex) speak OpenAI/Anthropic to 127.0.0.1 and have no idea the backend is local. The diagram below is this course's master map.

100% OFFLINE $0 / TOKEN MacBook Pro M5 Max 128 GB unified memory · 460–614 GB/s · GPU+CPU+ANE on one die TIER 1 · The brain DS4 — DeepSeek-V4-Flash q2 81 GB resident · 25–34 tok/s near-frontier quality 127.0.0.1:8000 OpenAI + Anthropic API TIER 2 · Vision MLX — Qwen3-VL ~20 GB · 110–122 tok/s on demand (swap) 127.0.0.1:8081 OpenAI-compatible API Your clients Claude Code Codex base_url → localhost One power cord. No network cable required. Weights live on the SSD, inference on the GPU, data never leaves.

Two tiers, two ports, one die. Clients think they're talking to a cloud — but the "cloud" is on your desk, offline and at zero marginal cost.

No, I didn't duplicate my homelab to the cloud. My homelab became the cloud I use.— Ahmad Osman

02 · Rented cloud vs. homelab-as-cloud

The underlying choice isn't technical, it's about ownership. An online service charges per token, and every request ships your data out — to someone else's datacenter, under someone else's retention policy. The homelab flips all of it: marginal cost drops to zero, it works with no network, and the data never crosses the door. The diagram contrasts the two paths side by side.

Online service (rented) you pay per token · the data travels your machine prompt + data data leaves ↗ datacenter someone else's response ⏱ meter: $ per 1k tokens ✗ requires an always-on network ✗ third-party retention/policy ✗ cost grows with usage Homelab-as-cloud (owned) $0 marginal · offline · private M5 Max is the client AND the server :8000 :8081 ↺ the data never crosses the door ✓ works with no network ✓ total privacy (nothing leaves) ✓ fixed hardware cost, $0 per use
The inversion that defines the decade

Ahmad's line isn't poetry, it's architecture: "my homelab became the cloud I use". Instead of renting inference and paying per token while your data travels, you own the inference. The hardware is a fixed cost paid once; every token after that is free, offline and private. It's the difference between renting and owning.

03 · The per-domain model: ds4-legal

The brain doesn't have to be a single one. antirez's observation points to this stack's immediate future: instead of a generic model, you load the specialist the question calls for. A ds4-legal fine-tuned on a legal corpus gives frontier-grade legal inference — offline, at $0 per token, with the total privacy that benefits-law documents demand. This isn't abstract: it points straight at the founder's Previdência Factory. The diagram shows the path from corpus to product.

It makes a lot of sense to have ds4-coding, ds4-legal, ds4-medical models. You load the one you need depending on the question.— antirez, news/165
Load the specialist the question calls for — the generic brain becomes ds4-legal DS4 family (loadable) ds4-coding ★ ds4-legal ds4-medical domain corpus statutes · opinions · case law benefits-law documents fine-tunes → local inference DS4 :8000 · M5 Max offline · $0/token total privacy Previdência Factory legal-document factory generates legal filings per firm client data never leaves the stack BECOMES a product Why local matters for legal: sensitive document + confidentiality requirement = inference must be offline and private. The homelab delivers exactly that.
The portfolio bridge: why this isn't a toy

Join the two ends. A local ds4-legal = frontier-grade legal inference, offline, at $0 per token, with the total privacy that benefits-law documents demand. That speaks directly to the founder's Previdência Factory — the legal-document factory personalized per firm. This stack isn't a dev experiment: it's the foundation of a product. The same machine you built in Lessons 08–10 is the one that serves a real client without sending a single byte out.

04 · The decade vision: models pass, the discipline stays

Why "the decade" and not "the release"? Because what you bought in this course isn't a specific model — it's a discipline for reading the hardware. DeepSeek-V4-Flash is opportunistic: today it's the best piece that fits the budget, tomorrow it'll be another. What persists is the stack, the method, and the capacity reasoning. The timeline below shows models swapping while the foundation doesn't move.

The next decade — models pass (opportunistic swap), the discipline remains PERSISTS · the foundation capacity × bandwidth × stack · memory math · engine choice · KV/quantization · M5 Max 128 GB 2026 2028 2030 2032 2034 DeepSeek-V4 today · fits and flies next MoE swap when better ds4-legal v2 domain fine-tunes whatever's next same budget ↑ MODELS · disposable you just re-download the new weights Optimizing for the decade = investing in the foundation that survives every model swap
The durable asset isn't the weights, it's the reasoning. When the next MoE drops, you don't start over: you already know how to read capacity×bandwidth, do the memory math, pick the engine, and fit it into the 128 GB budget. Swapping models becomes re-downloading weights. That's why you optimize for the decade, not the release.

05 · Recap: the 6 things you can now do

Eleven lessons, one arc. From raw capacity (Lesson 01) to the capstone (Lesson 11), you went from "local models are magic" to a set of operational, proven skills. The map below is your working certificate — each node is something you can do now that you couldn't before.

The 01 → 11 arc · six capabilities you now own YOU local operator 1 read capacity × bandwidth does it fit? does it run fast? (L01–02) 2 do the memory math weights + KV + headroom (L03,08) 3 pick the engine MLX / DS4 by workload (L06) 4 reason about KV / quantization what to cut without losing it (L04–05) 5 serve a local frontier model build + serve DS4 :8000 (L07,09) 6 evaluate vision for real real samples, no demo (L10) 01 ──────────────────────────── course arc ──────────────────────────── 11
CapabilityLessonsWhat you can do now
Read capacity × bandwidth01–02Look at a model and say whether it fits in RAM and whether it'll generate fast — both axes, not just size.
Memory math03, 08Add up weights + KV cache + system + headroom and prove the 128 GB budget closes.
Pick the engine06Decide MLX vs. DS4 by workload, knowing what each one trades.
KV & quantization04–05Reason about what to cut (quantization, cache) without killing quality.
Serve a local frontier model07, 09Build and serve DeepSeek on :8000 speaking OpenAI/Anthropic.
Evaluate vision10Measure VLMs on your real images and choose by the data, not the demo.
1. What is the central thesis that makes this stack "the decade setup" rather than "the setup of the month"?
Correct: b. "I'm not optimizing for the next release, I'm optimizing for the next decade" (Ahmad). Models pass — DS4 is the best piece that fits today; tomorrow it swaps. What persists is the reasoning and the stack. That's why swapping models becomes just re-downloading weights.
2. Why is a ds4-legal running locally the direct bridge to the founder's Previdência Factory?
Correct: c. "It makes sense to have ds4-coding, ds4-legal, ds4-medical; you load the one you need" (antirez, news/165). Combine it with homelab-as-cloud: frontier legal inference, offline, $0/token, private. It's the foundation of a real product — not a dev toy.