Course / Lesson 01  ·  Português →
Lesson 01 · The foundation

The mental model

Before you install anything, you need a way of thinking. This lesson installs Ahmad Osman's framework: local AI hardware is capacity × bandwidth × software stack, and you never pick an inference engine first. Internalize this and every decision in the lessons ahead stops being a guess and becomes a deduction.

3
factors that define everything: capacity, bandwidth, stack
4th
the inference engine comes LAST
128 GB
your fixed capacity (M5 Max)
~460–614
GB/s — your fixed bandwidth

01 · The three-factor formula

Everything else in the course hangs on a single sentence. Local AI hardware is the product of three things, and each one answers a different question. It is not a metaphor: it is the equation that tells you how much of the spec sheet you can actually draw on.

Local AI hardware = capacity × bandwidth × software stack. Capacity says what fits. Bandwidth says how hard the box can breathe. The software stack says how much of the spec sheet you can actually extract.— Ahmad Osman, "Memory Bandwidth for Local AI Hardware (2026)"
Local AI hardware = capacity × bandwidth × stack CAPACITY "what fits" VRAM / unified RAM sets the model size + context it holds BANDWIDTH "how hard it breathes" memory GB/s sets the tokens/s at generation STACK "how much you extract" engine + quantization + drivers + kernels turns spec into real × × = what you actually get real performance, not the spec on the box

It is a product, not a sum: if any factor is zero, the output is zero. Huge bandwidth with little capacity won't run the model; huge capacity with a poor stack wastes the silicon.

Why "×" and not "+": a weak factor is not made up for by the others — it multiplies downward. 512 GB of capacity with a stack that only draws 30% of the bandwidth delivers less than 128 GB well served. This is the lens that separates people who buy spec sheets from people who buy results.

02 · The shift in the question

The framework is not just descriptive — it changes the question you ask when buying or configuring. Going from "what is the best hardware?" to "which bottleneck am I buying?" is the difference between choosing by hype and choosing by engineering.

Once you internalize this, you stop asking "Which hardware is best?". You start asking "Which bottleneck am I buying?".— Ahmad Osman, "Memory Bandwidth for Local AI Hardware (2026)"
NAIVE QUESTION "Which hardware is best?" no answer — depends on the workload leads to hype and regret internalize the formula ENGINEERING QUESTION "Which bottleneck am I buying?" has an answer — pick the trade-off leads to a defensible decision

03 · Which bottleneck are you buying?

Every local AI machine is dominated by one of the three factors. There is no machine without a bottleneck — there is the machine whose bottleneck matches your workload. Knowing how to read the symptom is knowing which knob to turn.

Every setup is dominated by ONE bottleneck — find out which CAPACITY-BOUND Symptom: the model won't even load, or spills to disk swap (slow) OOM when raising the context Fix: + RAM/VRAM, or more aggressive quantization smaller model / MoE "won't fit" BANDWIDTH-BOUND Symptom: it fits and runs, but generation is slow (few t/s) GPU idle waiting on RAM Fix: + bandwidth (fast HBM/GDDR) smaller model = fewer bytes MoE (reads only the active) "won't breathe" STACK-BOUND Symptom: powerful hardware, but draws only a fraction of bandwidth poorly optimized engine/driver Fix: right engine (MLX, llama.cpp) native kernels / Metal supported quantization "won't extract"
Your read, founder

On the M5 Max, capacity (128 GB) and bandwidth (~460–614 GB/s) are fixed — you don't swap the hardware. That leaves one factor under your control: the stack. That's why the rest of the course is, in practice, a hunt for the stack that extracts the most from the bandwidth you already have. The bottleneck you "bought" was mid-bandwidth; your lever is not wasting it.

04 · The M5 Max as a fixed point

Each class of hardware sits in a different place on the capacity × bandwidth plane. Plotting the candidates makes obvious what Apple offers: high capacity, mid bandwidth. A dedicated GPU inverts it: extreme bandwidth, small capacity. There is no absolute winner — there are positions, and yours is already chosen.

capacity ↑ bandwidth → ~32 GB ~128 GB ~512 GB ~270 GB/s ~500 GB/s ~820 GB/s ~1.8 TB/s DGX Spark ~128 GB · low bandwidth Mac Studio 512 GB · mid-high bandwidth RTX 5090 32 GB · extreme bandwidth ★ M5 Max 128 GB YOUR fixed point ~460–614 GB/s ↖ high capacity (runs big models) high bandwidth ↘ (fast, little fits) Apple zone: high capacity · mid bandwidth

Approximate values, for positioning — not exact benchmarks. The point is the shape of the map: Apple trades top-end bandwidth for generous capacity in a single portable box. The RTX 5090 does the opposite. The Mac Studio 512 GB pushes capacity to the extreme.

HardwareCapacityBandwidth (approx.)Dominant profile
M5 Max128 GB unified~460–614 GB/sHigh capacity · mid bandwidth — big model in a laptop
RTX 509032 GB GDDR7~1.8 TB/sExtreme bandwidth · small capacity — fast, but little fits
Mac Studio512 GB unifiedmid-high bandwidthExtreme capacity — runs almost anything, not portable
DGX Spark~128 GBlow-mid bandwidthCapacity ok · weak bandwidth — bandwidth bottleneck early

05 · The decision order: the engine comes last

The most common beginner mistake is to open the forum and ask "do I use Ollama, llama.cpp or MLX?". Wrong — not because it's a bad tool, but because it's the last question. First you fix the hardware strategy, the workload shape and the serving model. The engine is a consequence.

You don't choose an inference engine first. You choose a hardware strategy, a workload shape and a serving model. The engine comes after.— Ahmad Osman, "Inference Engines (2026)"
The right order — the engine is the LAST step, not the first 1 Hardware strategy capacity × bandwidth 2 Workload shape coding? vision? batch? 3 Serving model 1 req? concurrency? 4 Inference engine follows from the 3 above ↑ the engine comes LAST start here → what you run decides everything downstream
Translation for you: the hardware is already fixed (M5 Max). The workload shape is two: a coding/agent brain and occasional vision. The serving model you still have to choose (and that's where the next distinction lives). Only after that does the engine — MLX, llama.cpp, etc. — show up, and it nearly picks itself.

06 · "Runs" is not "serves"

The last piece of the mental model is the most expensive one to learn in practice. Making a model answer once in the terminal is trivial. Making it serve — handle concurrency, predictable latency and cost under load — is systems work. Confusing the two is the source of half the frustration with local AI.

runs = demo; serves = systems work.— Ahmad Osman, "Inference Engines (2026)"
"RUNS" = demo proves it loads and answers 1 request model ✓ OK • no concurrency • latency doesn't matter • cost irrelevant "it runs" is enough "SERVES" = system holds up in production under load req req req queue + batch KV cache · scheduler • concurrency: N reqs at once • latency: predictable under load • cost: $/token, energy, throughput

The demo fits on one terminal line. The system demands a queue, batching, KV cache management and a respected RAM ceiling. This course takes you from "runs" to "serves" — that's why the final config (Lesson 08) talks about ports, RAM peaks and headroom, not just "which model to download".

This course's contract

For you, the hardware is given: M5 Max, 128 GB, ~460–614 GB/s. The workloads are two: a coding/agent brain and occasional vision. Everything that follows — memory, bandwidth, quantization, MoE, the engine, the final config — is derived from those facts with this lesson's framework. You won't memorize recipes; you'll deduce your own.

1. According to Ahmad Osman's framework, what is the right question when evaluating local AI hardware?
Correct: c. "Which is best?" has no answer — it depends on the workload. The shift is to swap it for "which bottleneck am I buying?", which is decidable. The engine (option d) comes last, not first.
2. In the correct decision order, when does the choice of inference engine (MLX, llama.cpp, etc.) come in?
Correct: b. "You don't choose an inference engine first." The sequence is hardware → workload → serving model → engine. And remember: "runs = demo; serves = systems work" — serving under load is what justifies that whole order.