Course · Local AI  ·  Português →
Alembic · visual-teach · EN

Local AI, done right

Eleven lessons that build a real system on your MacBook Pro M5 Max 128 GB: from @TheAhmadOsman's playbook to a quasi-frontier brain (DeepSeek-V4-Flash via DS4) + local vision (Qwen3-VL), offline and $0/token. Every number was measured on your machine or cited from source; every lesson ships diagrams, flows, and real examples.

25–34
tok/s · DeepSeek q2 brain (DS4)
122
tok/s · Qwen3-VL vision (MLX)
11
lessons · EN + PT-BR
$0
per token · 100% offline
The stack you'll build — two tiers, one Mac TIER 1 · Intelligence + Speed DeepSeek-V4-Flash q2 · DS4 (Metal) 81 GB · 25–34 tok/s · 1M ctx 127.0.0.1:8000 · OpenAI + Anthropic TIER 2 · Vision (on-demand) Qwen3-VL-30B-A3B · MLX-VLM ~20 GB · 110–122 tok/s · text+image 127.0.0.1:8081 · OpenAI capacity × bandwidth × software stack — derived lesson by lesson
How to read this. Lessons 01–04 are the foundations (mental model, memory, bandwidth, quantization); 05–07 are the tools (engines, KV cache, DS4); 08 assembles your Mac's config; 09–11 put it to work (hands-on, vision, capstone). Each lesson is a self-contained HTML file — no network, no build — with several diagrams, real examples, and quizzes. Toggle light/dark with the ◐ Theme button.

The eleven lessons

LESSON 01 · FOUNDATION

The mental model

capacity × bandwidth × software stack. Why you pick the bottleneck, and the engine comes last.

capacity · bandwidth · stack
LESSON 02 · FOUNDATION

Memory math

VRAM ≈ params × bits/8. The quantization ladder and what fits in 128 GB.

VRAM · quant ladder · 128 GB
LESSON 03 · FOUNDATION

Bandwidth = speed

Decode tracks bandwidth, prefill tracks compute. Where the M5 Max lands and why it's usable.

bandwidth · decode · prefill
LESSON 04 · FOUNDATION

Quantization & quality

Where Q2 degrades — and antirez's asymmetric recipe that preserves tool-use.

Q2 · asymmetric · tool-use
LESSON 05 · TOOLS

Inference engines

The decision guide, "DO NOT USE Ollama", and why kernels are the real work.

MLX · kernels · vLLM/SGLang
LESSON 06 · TOOLS

The KV cache

The model's working memory — and DeepSeek-V4's compressed attention that makes 1M context viable.

KV · CSA · 1M ctx
LESSON 07 · TOOLS

DS4 / DwarfStar

antirez's C engine for DeepSeek-V4: Metal, 2/8-bit, SSD streaming — compiled on your Mac.

DS4 · Metal · SSD streaming
LESSON 08 · THE CONFIG

Your M5 Max config

The memory budget to scale, the 3 deployment modes, and the 2-Mac pool over TB5.

budget · 3 modes · 2 Macs
LESSON 09 · RUNNING IT

Hands-on

Build, download, serve and plug into Claude Code/Codex — with real outputs measured this session.

build · serve · plug
LESSON 10 · RUNNING IT

Vision, carefully

Ahmad's multimodal cautions and the 4-VLM benchmark on your real images.

VLM · benchmark · on-demand
LESSON 11 · CAPSTONE

The decade setup

The full assembled stack, homelab-as-cloud, and the ds4-legal → Previdência Factory bridge.

capstone · ds4-legal · decade