The model is just a graph of weights. What decides whether it runs at 14 or 64 tokens/s on the SAME hardware is the inference engine — the scheduler, the optimizer and the kernels that execute the math. This lesson shows how to pick the engine by your workload, why the same GPU can quadruple in speed just by swapping the software, and why, deep down, you don't run a model: you run kernels.
~4.4×
gain on 2× RTX 3090 when migrating to vLLM (TP=2)
~3.4×
gain on the RTX PRO 6000 when migrating to SGLang
7
kinds of kernel where the work actually happens
⛔ 0
times you should use Ollama
01 · Pick the engine by your workload
There's no "best engine" in the abstract — there's the best engine for your hardware and your workload. The guide below is a decision tree: answer the first question that applies and follow the branch. Mac first? MLX. Weird or edge hardware? llama.cpp. A single RTX? ExLlamaV2. And so on, up to production and cluster.
The tree is exhaustive but you rarely go all the way down: the first true condition already decides. For a Mac (this course), the answer is MLX before any other question.
Use MLX for Mac-first ML and LLM workflows.— Ahmad Osman, "Inference Engines for LLMs (2026)"
02 · You don't run a model. You run kernels.
This is the mental flip of the lesson. The "model" is a graph — a description of operations. What executes that graph is the inference engine: it schedules (the scheduler), optimizes and executes. But the real work — where the GPU cycles are spent — happens in the kernels: the low-level routines that compute each operation. Bad kernels make you think "this model is slow". Good kernels make you think "wait, how is this running locally?".
Why this matters in practice: the same model, the same hardware, two different engines = two different sets of kernels = different speeds. Switching engine is switching kernels — that's why the next section shows 3–4× gains without swapping a single physical part.
03 · The same hardware, just swapping the software
Proof that the engine is what's in charge: take the GPU exactly as it is and swap only the engine. The throughput (tokens/s) spikes. Two real measurements from Ahmad's guide — bars to scale, before (gray) and after (orange):
TP=2 = tensor-parallelism across 2 GPUs. No card was added — only the engine changed. The "slow model" was, in fact, poorly exploited kernels.
04 · The map of the territory (and what NOT to use)
Grouping the engines by target makes it clear that each was born for a scenario. And there's one non-negotiable point in Ahmad's guide: Ollama, as pleasant as it is, stays out.
⛔ Plain warning — don't use Ollama
The guide is categorical: "DO NOT USE Ollama. Ollama is pleasant yet should not be used." The convenience comes at a price: it abstracts away the engine and the kernels — exactly the layer this lesson teaches you to control. Prefer MLX on Mac, llama.cpp at the edge, or vLLM/SGLang in production.
Care in production, even with MLX: MLX-LM's own server "is not recommended for production (basic security checks only)". Great for development and local use; for an exposed endpoint, put a gateway in front or use a production engine.
05 · Decode vs. prefill: which kernel dominates
An inference run has two phases, and each one stresses different kernels. Knowing which phase dominates your use case tells you which lever to optimize — and even which engine to pick.
Rule of thumb: chat/agent that generates long responses lives in decode → prioritize bandwidth (and, on the Mac, the fast unified memory). Summarization/RAG with giant prompts lives in prefill → prioritize attention and chunked prefill (strong in SGLang).
Decode tracks memory bandwidth; prefill tracks attention kernels + chunked prefill. It's the same principle from Lesson 03 (bandwidth) and Lesson 06 (KV-cache) seen through the engine's lens.
06 · The complete decision guide
The section-01 tree in table form, for quick reference:
Your scenario
Engine
Why
Mac first (ML/LLM)
MLX / MLX-LM
Native Metal + unified memory; this course's path.
Laptop / edge / weird hardware
llama.cpp
Runs on almost anything, mixed CPU/GPU.
One RTX (single-GPU)
ExLlamaV2
High efficiency on a single card.
2–4+ NVIDIA GPUs
ExLlamaV3
Home multi-GPU.
General production
vLLM
Throughput, continuous batching, market standard.
Long context / MoE / routing
SGLang
RadixAttention, prefix cache, chunked prefill.
NVIDIA, maximum performance
TensorRT-LLM
Compiled for the GPU; NVIDIA-only.
Orchestrate a cluster
NVIDIA Dynamo
Multi-node orchestration.
Convenience at any cost
⛔ Ollama
Don't use. Hides the engine and the kernels.
07 · In this course: the engine follows the workload (to the extreme)
The doctrine "pick the engine by your workload" is exactly what this config does — and takes to the limit. Two workloads, two choices:
Vision (Qwen3-VL) + fast models → MLX. Mac first, so the tree decides MLX before any other question. It's the same mlx_vlm.server you saw running in the previous lessons.
The brain (DeepSeek) → DS4. Instead of a generic engine, Metal kernels custom-built for one model. It's "the engine follows the workload" taken to the extreme: when the workload is fixed and critical, a dedicated engine that extracts the most from those specific kernels is worth it.
The bridge to Lesson 08
That's why the final config uses MLX for vision and DS4 for the brain: two engines, because there are two workloads. You didn't pick "the best engine" — you picked the right engine for each job. Keep that in mind; in Lesson 08 it becomes the concrete memory budget for your M5 Max.
1. On the SAME pair of RTX 3090s, throughput rose from ~14.5 to ~64 tok/s. What changed?
Correct: b. Same hardware, different engine = different kernels = ~4.4× more tok/s. On the RTX PRO 6000, migrating to SGLang went from 32 to 110 tok/s (~3.4×). The engine is what's in charge.
2. You run on a Mac and want the recommended path. What about Ollama?
Correct: c. The tree decides MLX for Mac before any other question. Ollama is categorically discouraged; TensorRT-LLM is NVIDIA-only and doesn't run on the Mac.