Lesson 11 · Capstone

The decade setup

The previous ten lessons were loose pieces: capacity, bandwidth, MoE, KV cache, quantization, engines, your Mac's config, hands-on, vision. This one welds them into a single machine and answers the question that drives everything — why build this? The answer isn't "to run a model." It's to own the infrastructure: near-frontier inference, offline, at $0 per token, with the total privacy that serious documents demand. It's not a dev toy. It's the foundation of a product.

2 tiers

brain (DS4 :8000) + vision (MLX :8081)

$0 / token

marginal cost of inference

100% offline

no data leaves the machine

11 lessons

from capacity to capstone

I'm not optimizing for the next release. I'm optimizing for the next decade.— Ahmad Osman

01 · The full stack assembled, end to end

Everything you derived converges here. A 128 GB M5 Max hosts two local endpoints: Tier 1 is the brain — DeepSeek-V4-Flash q2 served by DS4 on port :8000, 81 GB resident, 25–34 tok/s, near-frontier quality. Tier 2 is vision — Qwen3-VL via MLX on port :8081, ~20 GB, 110–122 tok/s. Clients (Claude Code, Codex) speak OpenAI/Anthropic to 127.0.0.1 and have no idea the backend is local. The diagram below is this course's master map.

Two tiers, two ports, one die. Clients think they're talking to a cloud — but the "cloud" is on your desk, offline and at zero marginal cost.

No, I didn't duplicate my homelab to the cloud. My homelab became the cloud I use.— Ahmad Osman

02 · Rented cloud vs. homelab-as-cloud

The underlying choice isn't technical, it's about ownership. An online service charges per token, and every request ships your data out — to someone else's datacenter, under someone else's retention policy. The homelab flips all of it: marginal cost drops to zero, it works with no network, and the data never crosses the door. The diagram contrasts the two paths side by side.

The inversion that defines the decade

Ahmad's line isn't poetry, it's architecture: "my homelab became the cloud I use". Instead of renting inference and paying per token while your data travels, you own the inference. The hardware is a fixed cost paid once; every token after that is free, offline and private. It's the difference between renting and owning.

03 · The per-domain model: ds4-legal

The brain doesn't have to be a single one. antirez's observation points to this stack's immediate future: instead of a generic model, you load the specialist the question calls for. A ds4-legal fine-tuned on a legal corpus gives frontier-grade legal inference — offline, at $0 per token, with the total privacy that benefits-law documents demand. This isn't abstract: it points straight at the founder's Previdência Factory. The diagram shows the path from corpus to product.

It makes a lot of sense to have ds4-coding, ds4-legal, ds4-medical models. You load the one you need depending on the question.— antirez, news/165

The portfolio bridge: why this isn't a toy

Join the two ends. A local ds4-legal = frontier-grade legal inference, offline, at $0 per token, with the total privacy that benefits-law documents demand. That speaks directly to the founder's Previdência Factory — the legal-document factory personalized per firm. This stack isn't a dev experiment: it's the foundation of a product. The same machine you built in Lessons 08–10 is the one that serves a real client without sending a single byte out.

04 · The decade vision: models pass, the discipline stays

Why "the decade" and not "the release"? Because what you bought in this course isn't a specific model — it's a discipline for reading the hardware. DeepSeek-V4-Flash is opportunistic: today it's the best piece that fits the budget, tomorrow it'll be another. What persists is the stack, the method, and the capacity reasoning. The timeline below shows models swapping while the foundation doesn't move.

The durable asset isn't the weights, it's the reasoning. When the next MoE drops, you don't start over: you already know how to read capacity×bandwidth, do the memory math, pick the engine, and fit it into the 128 GB budget. Swapping models becomes re-downloading weights. That's why you optimize for the decade, not the release.

05 · Recap: the 6 things you can now do

Eleven lessons, one arc. From raw capacity (Lesson 01) to the capstone (Lesson 11), you went from "local models are magic" to a set of operational, proven skills. The map below is your working certificate — each node is something you can do now that you couldn't before.

Capability	Lessons	What you can do now
Read capacity × bandwidth	01–02	Look at a model and say whether it fits in RAM and whether it'll generate fast — both axes, not just size.
Memory math	03, 08	Add up weights + KV cache + system + headroom and prove the 128 GB budget closes.
Pick the engine	06	Decide MLX vs. DS4 by workload, knowing what each one trades.
KV & quantization	04–05	Reason about what to cut (quantization, cache) without killing quality.
Serve a local frontier model	07, 09	Build and serve DeepSeek on `:8000` speaking OpenAI/Anthropic.
Evaluate vision	10	Measure VLMs on your real images and choose by the data, not the demo.

1. What is the central thesis that makes this stack "the decade setup" rather than "the setup of the month"?

Correct: b. "I'm not optimizing for the next release, I'm optimizing for the next decade" (Ahmad). Models pass — DS4 is the best piece that fits today; tomorrow it swaps. What persists is the reasoning and the stack. That's why swapping models becomes just re-downloading weights.

2. Why is a ds4-legal running locally the direct bridge to the founder's Previdência Factory?

Correct: c. "It makes sense to have ds4-coding, ds4-legal, ds4-medical; you load the one you need" (antirez, news/165). Combine it with homelab-as-cloud: frontier legal inference, offline, $0/token, private. It's the foundation of a real product — not a dev toy.

← Lesson 10 Course hub →