Course / Lesson 05  ·  Português →
Lesson 05 · The load-bearing layer

Inference engines

The model is just a graph of weights. What decides whether it runs at 14 or 64 tokens/s on the SAME hardware is the inference engine — the scheduler, the optimizer and the kernels that execute the math. This lesson shows how to pick the engine by your workload, why the same GPU can quadruple in speed just by swapping the software, and why, deep down, you don't run a model: you run kernels.

~4.4×
gain on 2× RTX 3090 when migrating to vLLM (TP=2)
~3.4×
gain on the RTX PRO 6000 when migrating to SGLang
7
kinds of kernel where the work actually happens
⛔ 0
times you should use Ollama

01 · Pick the engine by your workload

There's no "best engine" in the abstract — there's the best engine for your hardware and your workload. The guide below is a decision tree: answer the first question that applies and follow the branch. Mac first? MLX. Weird or edge hardware? llama.cpp. A single RTX? ExLlamaV2. And so on, up to production and cluster.

What's your workload? answer from top to bottom Mac first? MLX / MLX-LM native Apple Silicon laptop / edge / weird HW? llama.cpp runs on almost anything 1 RTX (single-GPU)? ExLlamaV2 one card, high efficiency 2–4+ NVIDIA? ExLlamaV3 home multi-GPU general production? vLLM throughput + batching long ctx / MoE / routing? SGLang RadixAttention / cache NVIDIA, max perf? TensorRT-LLM compiled, NVIDIA-only orchestrate a cluster? NVIDIA Dynamo multi-node orchestration In this course (M5 Max) Mac first → MLX (vision) + bespoke DS4 for the brain

The tree is exhaustive but you rarely go all the way down: the first true condition already decides. For a Mac (this course), the answer is MLX before any other question.

Use MLX for Mac-first ML and LLM workflows.— Ahmad Osman, "Inference Engines for LLMs (2026)"

02 · You don't run a model. You run kernels.

This is the mental flip of the lesson. The "model" is a graph — a description of operations. What executes that graph is the inference engine: it schedules (the scheduler), optimizes and executes. But the real work — where the GPU cycles are spent — happens in the kernels: the low-level routines that compute each operation. Bad kernels make you think "this model is slow". Good kernels make you think "wait, how is this running locally?".

Model a GRAPH of operations — weights + topology, declarative Inference engine scheduler · optimizer · executor decides the order, fuses ops, manages memory — doesn't do the math Kernels — the work happens HERE MatMul (GEMM) Attention RMSNorm KV-cache quantized-linear sampling fused (e.g. fused-MoE, fused-attn) Hardware — GPU · Metal (Mac) · CUDA (NVIDIA) bad kernel → "this model is slow" good kernel → "how does it run local?!"
Why this matters in practice: the same model, the same hardware, two different engines = two different sets of kernels = different speeds. Switching engine is switching kernels — that's why the next section shows 3–4× gains without swapping a single physical part.

03 · The same hardware, just swapping the software

Proof that the engine is what's in charge: take the GPU exactly as it is and swap only the engine. The throughput (tokens/s) spikes. Two real measurements from Ahmad's guide — bars to scale, before (gray) and after (orange):

tokens/s (true scale, 5.5 px per tok/s) 0 30 60 90 120 2× RTX 3090 before → vLLM (TP=2) 14.5 t/s 64 t/s ≈ 4.4× ↑ RTX PRO 6000 before → SGLang 32 t/s 110 t/s ≈ 3.4× ↑ · zero new hardware before (generic engine) after (right engine)

TP=2 = tensor-parallelism across 2 GPUs. No card was added — only the engine changed. The "slow model" was, in fact, poorly exploited kernels.

04 · The map of the territory (and what NOT to use)

Grouping the engines by target makes it clear that each was born for a scenario. And there's one non-negotiable point in Ahmad's guide: Ollama, as pleasant as it is, stays out.

Mac / Apple Silicon MLX / MLX-LM unified memory native Metal ★ this course Single GPU / edge llama.cpp ExLlamaV2 1 card / weird HW Multi-GPU / cluster ExLlamaV3 NVIDIA Dynamo 2–4+ cards · orchestrate nodes Production vLLM SGLang throughput · long ctx / MoE NVIDIA maximum TensorRT-LLM compiled NVIDIA-only ⛔ Ollama — DO NOT USE "pleasant, yet should not be used." hides the kernels and the real engine from you pick a real engine from the row above — not a wrapper
⛔ Plain warning — don't use Ollama

The guide is categorical: "DO NOT USE Ollama. Ollama is pleasant yet should not be used." The convenience comes at a price: it abstracts away the engine and the kernels — exactly the layer this lesson teaches you to control. Prefer MLX on Mac, llama.cpp at the edge, or vLLM/SGLang in production.

Care in production, even with MLX: MLX-LM's own server "is not recommended for production (basic security checks only)". Great for development and local use; for an exposed endpoint, put a gateway in front or use a production engine.

05 · Decode vs. prefill: which kernel dominates

An inference run has two phases, and each one stresses different kernels. Knowing which phase dominates your use case tells you which lever to optimize — and even which engine to pick.

DECODE dominates SHORT prompt → LONG response prompt 1 token per step, repeatedly lever: MEMORY BANDWIDTH each step re-reads the weights from RAM → tok/s tracks bandwidth hot kernels: · quantized-linear (re-reads weights) · KV-cache (read/write every token) · sampling PREFILL dominates LONG prompt → SHORT response huge prompt processed all at once short resp. all prompt tokens in parallel lever: ATTENTION KERNELS + chunked prefill (slice the prompt into blocks) hot kernels: · Attention (cost grows with the ctx) · MatMul (GEMM) in batch · fused-attn / chunked prefill

Rule of thumb: chat/agent that generates long responses lives in decode → prioritize bandwidth (and, on the Mac, the fast unified memory). Summarization/RAG with giant prompts lives in prefill → prioritize attention and chunked prefill (strong in SGLang).

Decode tracks memory bandwidth; prefill tracks attention kernels + chunked prefill. It's the same principle from Lesson 03 (bandwidth) and Lesson 06 (KV-cache) seen through the engine's lens.

06 · The complete decision guide

The section-01 tree in table form, for quick reference:

Your scenarioEngineWhy
Mac first (ML/LLM)MLX / MLX-LMNative Metal + unified memory; this course's path.
Laptop / edge / weird hardwarellama.cppRuns on almost anything, mixed CPU/GPU.
One RTX (single-GPU)ExLlamaV2High efficiency on a single card.
2–4+ NVIDIA GPUsExLlamaV3Home multi-GPU.
General productionvLLMThroughput, continuous batching, market standard.
Long context / MoE / routingSGLangRadixAttention, prefix cache, chunked prefill.
NVIDIA, maximum performanceTensorRT-LLMCompiled for the GPU; NVIDIA-only.
Orchestrate a clusterNVIDIA DynamoMulti-node orchestration.
Convenience at any cost⛔ OllamaDon't use. Hides the engine and the kernels.

07 · In this course: the engine follows the workload (to the extreme)

The doctrine "pick the engine by your workload" is exactly what this config does — and takes to the limit. Two workloads, two choices:

The bridge to Lesson 08

That's why the final config uses MLX for vision and DS4 for the brain: two engines, because there are two workloads. You didn't pick "the best engine" — you picked the right engine for each job. Keep that in mind; in Lesson 08 it becomes the concrete memory budget for your M5 Max.

1. On the SAME pair of RTX 3090s, throughput rose from ~14.5 to ~64 tok/s. What changed?
Correct: b. Same hardware, different engine = different kernels = ~4.4× more tok/s. On the RTX PRO 6000, migrating to SGLang went from 32 to 110 tok/s (~3.4×). The engine is what's in charge.
2. You run on a Mac and want the recommended path. What about Ollama?
Correct: c. The tree decides MLX for Mac before any other question. Ollama is categorically discouraged; TensorRT-LLM is NVIDIA-only and doesn't run on the Mac.