Lesson 05 · The load-bearing layer

Inference engines

The model is just a graph of weights. What decides whether it runs at 14 or 64 tokens/s on the SAME hardware is the inference engine — the scheduler, the optimizer and the kernels that execute the math. This lesson shows how to pick the engine by your workload, why the same GPU can quadruple in speed just by swapping the software, and why, deep down, you don't run a model: you run kernels.

~4.4×

gain on 2× RTX 3090 when migrating to vLLM (TP=2)

~3.4×

gain on the RTX PRO 6000 when migrating to SGLang

kinds of kernel where the work actually happens

⛔ 0

times you should use Ollama

01 · Pick the engine by your workload

There's no "best engine" in the abstract — there's the best engine for your hardware and your workload. The guide below is a decision tree: answer the first question that applies and follow the branch. Mac first? MLX. Weird or edge hardware? llama.cpp. A single RTX? ExLlamaV2. And so on, up to production and cluster.

The tree is exhaustive but you rarely go all the way down: the first true condition already decides. For a Mac (this course), the answer is MLX before any other question.

Use MLX for Mac-first ML and LLM workflows.— Ahmad Osman, "Inference Engines for LLMs (2026)"

02 · You don't run a model. You run kernels.

This is the mental flip of the lesson. The "model" is a graph — a description of operations. What executes that graph is the inference engine: it schedules (the scheduler), optimizes and executes. But the real work — where the GPU cycles are spent — happens in the kernels: the low-level routines that compute each operation. Bad kernels make you think "this model is slow". Good kernels make you think "wait, how is this running locally?".

Why this matters in practice: the same model, the same hardware, two different engines = two different sets of kernels = different speeds. Switching engine is switching kernels — that's why the next section shows 3–4× gains without swapping a single physical part.

03 · The same hardware, just swapping the software

Proof that the engine is what's in charge: take the GPU exactly as it is and swap only the engine. The throughput (tokens/s) spikes. Two real measurements from Ahmad's guide — bars to scale, before (gray) and after (orange):

TP=2 = tensor-parallelism across 2 GPUs. No card was added — only the engine changed. The "slow model" was, in fact, poorly exploited kernels.

04 · The map of the territory (and what NOT to use)

Grouping the engines by target makes it clear that each was born for a scenario. And there's one non-negotiable point in Ahmad's guide: Ollama, as pleasant as it is, stays out.

⛔ Plain warning — don't use Ollama

The guide is categorical: "DO NOT USE Ollama. Ollama is pleasant yet should not be used." The convenience comes at a price: it abstracts away the engine and the kernels — exactly the layer this lesson teaches you to control. Prefer MLX on Mac, llama.cpp at the edge, or vLLM/SGLang in production.

Care in production, even with MLX: MLX-LM's own server "is not recommended for production (basic security checks only)". Great for development and local use; for an exposed endpoint, put a gateway in front or use a production engine.

05 · Decode vs. prefill: which kernel dominates

An inference run has two phases, and each one stresses different kernels. Knowing which phase dominates your use case tells you which lever to optimize — and even which engine to pick.

Rule of thumb: chat/agent that generates long responses lives in decode → prioritize bandwidth (and, on the Mac, the fast unified memory). Summarization/RAG with giant prompts lives in prefill → prioritize attention and chunked prefill (strong in SGLang).

Decode tracks memory bandwidth; prefill tracks attention kernels + chunked prefill. It's the same principle from Lesson 03 (bandwidth) and Lesson 06 (KV-cache) seen through the engine's lens.

06 · The complete decision guide

The section-01 tree in table form, for quick reference:

Your scenario	Engine	Why
Mac first (ML/LLM)	MLX / MLX-LM	Native Metal + unified memory; this course's path.
Laptop / edge / weird hardware	llama.cpp	Runs on almost anything, mixed CPU/GPU.
One RTX (single-GPU)	ExLlamaV2	High efficiency on a single card.
2–4+ NVIDIA GPUs	ExLlamaV3	Home multi-GPU.
General production	vLLM	Throughput, continuous batching, market standard.
Long context / MoE / routing	SGLang	RadixAttention, prefix cache, chunked prefill.
NVIDIA, maximum performance	TensorRT-LLM	Compiled for the GPU; NVIDIA-only.
Orchestrate a cluster	NVIDIA Dynamo	Multi-node orchestration.
Convenience at any cost	⛔ Ollama	Don't use. Hides the engine and the kernels.

07 · In this course: the engine follows the workload (to the extreme)

The doctrine "pick the engine by your workload" is exactly what this config does — and takes to the limit. Two workloads, two choices:

Vision (Qwen3-VL) + fast models → MLX. Mac first, so the tree decides MLX before any other question. It's the same mlx_vlm.server you saw running in the previous lessons.
The brain (DeepSeek) → DS4. Instead of a generic engine, Metal kernels custom-built for one model. It's "the engine follows the workload" taken to the extreme: when the workload is fixed and critical, a dedicated engine that extracts the most from those specific kernels is worth it.

The bridge to Lesson 08

That's why the final config uses MLX for vision and DS4 for the brain: two engines, because there are two workloads. You didn't pick "the best engine" — you picked the right engine for each job. Keep that in mind; in Lesson 08 it becomes the concrete memory budget for your M5 Max.

1. On the SAME pair of RTX 3090s, throughput rose from ~14.5 to ~64 tok/s. What changed?

Correct: b. Same hardware, different engine = different kernels = ~4.4× more tok/s. On the RTX PRO 6000, migrating to SGLang went from 32 to 110 tok/s (~3.4×). The engine is what's in charge.

2. You run on a Mac and want the recommended path. What about Ollama?

Correct: c. The tree decides MLX for Mac before any other question. Ollama is categorically discouraged; TensorRT-LLM is NVIDIA-only and doesn't run on the Mac.

← Lesson 04 Lesson 06 →