Lesson 01 · The foundation

The mental model

Before you install anything, you need a way of thinking. This lesson installs Ahmad Osman's framework: local AI hardware is capacity × bandwidth × software stack, and you never pick an inference engine first. Internalize this and every decision in the lessons ahead stops being a guess and becomes a deduction.

factors that define everything: capacity, bandwidth, stack

4th

the inference engine comes LAST

128 GB

your fixed capacity (M5 Max)

~460–614

GB/s — your fixed bandwidth

01 · The three-factor formula

Everything else in the course hangs on a single sentence. Local AI hardware is the product of three things, and each one answers a different question. It is not a metaphor: it is the equation that tells you how much of the spec sheet you can actually draw on.

Local AI hardware = capacity × bandwidth × software stack. Capacity says what fits. Bandwidth says how hard the box can breathe. The software stack says how much of the spec sheet you can actually extract.— Ahmad Osman, "Memory Bandwidth for Local AI Hardware (2026)"

It is a product, not a sum: if any factor is zero, the output is zero. Huge bandwidth with little capacity won't run the model; huge capacity with a poor stack wastes the silicon.

Why "×" and not "+": a weak factor is not made up for by the others — it multiplies downward. 512 GB of capacity with a stack that only draws 30% of the bandwidth delivers less than 128 GB well served. This is the lens that separates people who buy spec sheets from people who buy results.

02 · The shift in the question

The framework is not just descriptive — it changes the question you ask when buying or configuring. Going from "what is the best hardware?" to "which bottleneck am I buying?" is the difference between choosing by hype and choosing by engineering.

Once you internalize this, you stop asking "Which hardware is best?". You start asking "Which bottleneck am I buying?".— Ahmad Osman, "Memory Bandwidth for Local AI Hardware (2026)"

03 · Which bottleneck are you buying?

Every local AI machine is dominated by one of the three factors. There is no machine without a bottleneck — there is the machine whose bottleneck matches your workload. Knowing how to read the symptom is knowing which knob to turn.

Your read, founder

On the M5 Max, capacity (128 GB) and bandwidth (~460–614 GB/s) are fixed — you don't swap the hardware. That leaves one factor under your control: the stack. That's why the rest of the course is, in practice, a hunt for the stack that extracts the most from the bandwidth you already have. The bottleneck you "bought" was mid-bandwidth; your lever is not wasting it.

04 · The M5 Max as a fixed point

Each class of hardware sits in a different place on the capacity × bandwidth plane. Plotting the candidates makes obvious what Apple offers: high capacity, mid bandwidth. A dedicated GPU inverts it: extreme bandwidth, small capacity. There is no absolute winner — there are positions, and yours is already chosen.

Approximate values, for positioning — not exact benchmarks. The point is the shape of the map: Apple trades top-end bandwidth for generous capacity in a single portable box. The RTX 5090 does the opposite. The Mac Studio 512 GB pushes capacity to the extreme.

Hardware	Capacity	Bandwidth (approx.)	Dominant profile
M5 Max ★	128 GB unified	~460–614 GB/s	High capacity · mid bandwidth — big model in a laptop
RTX 5090	32 GB GDDR7	~1.8 TB/s	Extreme bandwidth · small capacity — fast, but little fits
Mac Studio	512 GB unified	mid-high bandwidth	Extreme capacity — runs almost anything, not portable
DGX Spark	~128 GB	low-mid bandwidth	Capacity ok · weak bandwidth — bandwidth bottleneck early

05 · The decision order: the engine comes last

The most common beginner mistake is to open the forum and ask "do I use Ollama, llama.cpp or MLX?". Wrong — not because it's a bad tool, but because it's the last question. First you fix the hardware strategy, the workload shape and the serving model. The engine is a consequence.

You don't choose an inference engine first. You choose a hardware strategy, a workload shape and a serving model. The engine comes after.— Ahmad Osman, "Inference Engines (2026)"

Translation for you: the hardware is already fixed (M5 Max). The workload shape is two: a coding/agent brain and occasional vision. The serving model you still have to choose (and that's where the next distinction lives). Only after that does the engine — MLX, llama.cpp, etc. — show up, and it nearly picks itself.

06 · "Runs" is not "serves"

The last piece of the mental model is the most expensive one to learn in practice. Making a model answer once in the terminal is trivial. Making it serve — handle concurrency, predictable latency and cost under load — is systems work. Confusing the two is the source of half the frustration with local AI.

runs = demo; serves = systems work.— Ahmad Osman, "Inference Engines (2026)"

The demo fits on one terminal line. The system demands a queue, batching, KV cache management and a respected RAM ceiling. This course takes you from "runs" to "serves" — that's why the final config (Lesson 08) talks about ports, RAM peaks and headroom, not just "which model to download".

This course's contract

For you, the hardware is given: M5 Max, 128 GB, ~460–614 GB/s. The workloads are two: a coding/agent brain and occasional vision. Everything that follows — memory, bandwidth, quantization, MoE, the engine, the final config — is derived from those facts with this lesson's framework. You won't memorize recipes; you'll deduce your own.

1. According to Ahmad Osman's framework, what is the right question when evaluating local AI hardware?

Correct: c. "Which is best?" has no answer — it depends on the workload. The shift is to swap it for "which bottleneck am I buying?", which is decidable. The engine (option d) comes last, not first.

2. In the correct decision order, when does the choice of inference engine (MLX, llama.cpp, etc.) come in?

Correct: b. "You don't choose an inference engine first." The sequence is hardware → workload → serving model → engine. And remember: "runs = demo; serves = systems work" — serving under load is what justifies that whole order.

← Hub Lesson 02 →