Course / Lesson 10  ·  Português →
Lesson 10 · Vision

Vision, carefully

Giving your model eyes is tempting — and full of traps. An image becomes tokens, the vision encoder eats memory, a single high-resolution screenshot can burn thousands of context tokens, and small VLMs hallucinate details that aren't there. This lesson shows the hidden costs and, above all, the method we followed in this session: measure 4 VLMs on the real images from the founder's corpus before trusting any of them.

122 tok/s
Qwen3-VL-30B-A3B (the winner, generation)
19.5 GB
winner's RAM (MoE, 3B active)
4 VLMs
tested on the real corpus images
default
Qwen3-VL is now the @alembic/vision

01 · The hidden cost: an image becomes tokens too

Vision's treacherous intuition is to think "an image is just an attachment." It isn't. In a VLM's pipeline, the image is sliced into patches, each patch passes through the vision encoder, and the result enters the context window as tokens — the exact same tokens that text consumes. A single high-resolution screenshot can cost thousands of them. The encoder, on top of that, is extra model weight loaded into memory. The diagram below follows a screenshot from disk to context.

One image → thousands of tokens (the cost nobody sees) screenshot 2048×1536 sliced into patches vision encoder extra weight in memory (ViT + projection) becomes a stream of visual tokens ×1000s per image context window (shared with the text) image takes this slice left for text ⚠ less image resolution = fewer tokens spent = more free context Ahmad's rule: treat resolution as a cost dial, not as "bigger is better"

The encoder is extra memory; the patches become tokens that compete for the SAME window as your prompt. Resolution is a cost dial.

Non-text input becomes tokens too. Vision encoders add memory. Image patches consume context. A single high-resolution image can consume thousands of tokens.— Ahmad Osman, "LLMs 101 (2026)"

02 · The golden rule: evaluate with REAL samples

Small VLMs hallucinate visual details. OCR reliability varies. Charts and tables stay hard. That's why the only thing that counts for a serious document or image workflow is to test with your real samples — never trust a pretty demo. That's exactly what we did: we ran the 4 candidates on the real images from the founder's corpus and measured. The loop below is the method.

The method: why a small VLM fools you — and how to armor the choice pretty demo ✗ do NOT trust misleading starting point what the demo hides: hallucinates visual details OCR confidence varies charts/tables still hard real images founder's corpus screenshots, icons, text-in-image measure accuracy + tok/s + RAM did it read the exact text? winner becomes the default Qwen3-VL repeat per candidate — here it was 4 VLMs decision = measured on YOUR corpus, not the README demo
The rule that isn't up for negotiation

Multimodal templates are easier to get wrong than text ones. OCR varies. Charts and tables are still hard. So: for any serious document or image flow, evaluate with real samples and do not trust a demo. The accuracy that matters is yours, on your corpus — not the model's README.

03 · The benchmark: 4 VLMs on the real images

Four candidates, the same corpus images, the same Mac. We plot each one by speed (tok/s, X axis) against memory (RAM, Y axis); size/position tells the story. The Qwen3-VL-30B-A3B won: the fastest and the most accurate (it read the exact text inside the images), in a small footprint.

Benchmark this session — speed (X) × memory (Y) · bottom-right corner = best 0 30 60 90 120 generation speed (tok/s) → 0 12 24 36 48 GB ↑ RAM (GB) — lower is better better Qwen2.5-VL-72B dense · 11 tok/s · 43.9 GB · less accurate InternVL3-8B light · 5.7 GB · fallback node diffusiongemma multimodal · 84.7 tok/s · 18.5 GB · ok ★ Qwen3-VL-30B-A3B MoE 3B active · 110–122 tok/s · 19.5 GB + read the exact text in the image → WON

X axis = speed, Y axis = memory. The winner sits in the good corner (fast and lean) AND was the most accurate — accuracy doesn't show on the axis, but it broke the tie.

ModelTypetok/sRAMVerdict
Qwen3-VL-30B-A3BMoE · 3B active110–12219.5 GBWon. Most accurate; read the exact text inside the image. Now the default of @alembic/vision.
Qwen2.5-VL-72Bdense1143.9 GBSlower and less accurate on an icon. Size didn't buy accuracy.
diffusiongemmadense · multimodal84.718.5 GBOk. Genuinely multimodal; runs via mlx-vlm from the main branch.
InternVL3-8Bdense5.7 GBLight. Good fallback node when headroom gets tight.

04 · Why the few-active MoE beat the giant dense model

The result repeats the brain's lesson (Lesson 04): few active parameters beat many dense parameters. The Qwen3-VL-30B-A3B has 30B parameters but only ~3B active per token — so it generates fast (110–122 tok/s) and takes up little (19.5 GB). The Qwen2.5-VL-72B is dense: each token pays the full 72B, hence 11 tok/s and 43.9 GB. In vision, as in text, sparsity is what makes it fit and run on a unified-memory Mac.

★ Qwen3-VL-30B-A3B — MoE 30B on disk · only ~3B active per token router lights up 1 of N experts: active sleeping (zero cost per token) speed RAM → 110–122 tok/s · 19.5 GB · fits with room to spare Qwen2.5-VL-72B — dense 72B on disk · ALL active per token each token pays the whole model: all lit at every step speed RAM → 11 tok/s · 43.9 GB · slow and heavy
Same physics as Lesson 04: what matters for speed on a unified-memory Mac are the active parameters, not the total ones. A 30B/3B-active MoE reads the image with the intelligence of a large model and the speed of a small one. That's why it became the default of @alembic/vision.

05 · Fitting vision into the budget: swap on demand

The brain (DeepSeek q2) already takes ~81 GB. Keeping vision co-resident all the time crowds the 128 GB (Lesson 08). The way out is swap on demand: DeepSeek serves normally; when an image arrives, you pause DS4 to free the 81 GB, load Qwen3-VL (20 GB), describe the image, and DeepSeek comes back. The sequence below shows the RAM peak at each step — it never crosses the ceiling.

Swap on demand — RAM per step (safe ceiling 102 GB) 1 · DS4 serving brain on · :8000 DeepSeek q2 81 GB resident peak ~99 GB ✓ 🖼 2 · image arrives vision task in the queue DeepSeek still on decides to pause 3 · pause DS4 frees 81 GB RAM almost free only macOS ~18 GB peak ~18 GB ✓ 4 · load Qwen3-VL describes the image Qwen3-VL · 20 GB 110–122 tok/s peak ~38 GB ✓ 5 · restart DS4 (mmap reload, seconds) → back to step 1 At no step does RAM cross the ceiling: the brain and vision are never large at the same time. Cost: a few seconds of DS4 reload per swap. Benefit: zero contention, zero OOM, inside 128 GB.
Why it works: the brain (81 GB) and vision (20 GB) never coexist at full size. At any moment only one of them is large, so the real peak stays well below the safe 102 GB. It's Lesson 08's default mode applied to sporadic vision — you trade seconds of reload for total stability.
1. Why is a single high-resolution image expensive for a VLM's context?
Correct: c. Non-text input becomes tokens too; the encoder is also extra weight in memory. More resolution = more tokens spent = less free context. Treat resolution as a cost dial (Ahmad, "LLMs 101").
2. In this session's benchmark, why did the Qwen3-VL-30B-A3B beat the Qwen2.5-VL-72B (dense)?
Correct: b. Same physics as Lesson 04: ACTIVE parameters count, not total ones. The dense 72B pays the whole model per token (11 tok/s, 43.9 GB) and still got an icon wrong. Size didn't buy accuracy — and the decision came from measuring on the real corpus, not from trusting a demo.