Giving your model eyes is tempting — and full of traps. An image becomes tokens, the vision encoder eats memory, a single high-resolution screenshot can burn thousands of context tokens, and small VLMs hallucinate details that aren't there. This lesson shows the hidden costs and, above all, the method we followed in this session: measure 4 VLMs on the real images from the founder's corpus before trusting any of them.
122 tok/s
Qwen3-VL-30B-A3B (the winner, generation)
19.5 GB
winner's RAM (MoE, 3B active)
4 VLMs
tested on the real corpus images
default
Qwen3-VL is now the @alembic/vision
01 · The hidden cost: an image becomes tokens too
Vision's treacherous intuition is to think "an image is just an attachment." It isn't. In a VLM's pipeline, the image is sliced into patches, each patch passes through the vision encoder, and the result enters the context window as tokens — the exact same tokens that text consumes. A single high-resolution screenshot can cost thousands of them. The encoder, on top of that, is extra model weight loaded into memory. The diagram below follows a screenshot from disk to context.
The encoder is extra memory; the patches become tokens that compete for the SAME window as your prompt. Resolution is a cost dial.
Non-text input becomes tokens too. Vision encoders add memory. Image patches consume context. A single high-resolution image can consume thousands of tokens.— Ahmad Osman, "LLMs 101 (2026)"
02 · The golden rule: evaluate with REAL samples
Small VLMs hallucinate visual details. OCR reliability varies. Charts and tables stay hard. That's why the only thing that counts for a serious document or image workflow is to test with your real samples — never trust a pretty demo. That's exactly what we did: we ran the 4 candidates on the real images from the founder's corpus and measured. The loop below is the method.
The rule that isn't up for negotiation
Multimodal templates are easier to get wrong than text ones. OCR varies. Charts and tables are still hard. So: for any serious document or image flow, evaluate with real samples and do not trust a demo. The accuracy that matters is yours, on your corpus — not the model's README.
03 · The benchmark: 4 VLMs on the real images
Four candidates, the same corpus images, the same Mac. We plot each one by speed (tok/s, X axis) against memory (RAM, Y axis); size/position tells the story. The Qwen3-VL-30B-A3B won: the fastest and the most accurate (it read the exact text inside the images), in a small footprint.
X axis = speed, Y axis = memory. The winner sits in the good corner (fast and lean) AND was the most accurate — accuracy doesn't show on the axis, but it broke the tie.
Model
Type
tok/s
RAM
Verdict
Qwen3-VL-30B-A3B ★
MoE · 3B active
110–122
19.5 GB
Won. Most accurate; read the exact text inside the image. Now the default of @alembic/vision.
Qwen2.5-VL-72B
dense
11
43.9 GB
Slower and less accurate on an icon. Size didn't buy accuracy.
diffusiongemma
dense · multimodal
84.7
18.5 GB
Ok. Genuinely multimodal; runs via mlx-vlm from the main branch.
InternVL3-8B
dense
—
5.7 GB
Light. Good fallback node when headroom gets tight.
04 · Why the few-active MoE beat the giant dense model
The result repeats the brain's lesson (Lesson 04): few active parameters beat many dense parameters. The Qwen3-VL-30B-A3B has 30B parameters but only ~3B active per token — so it generates fast (110–122 tok/s) and takes up little (19.5 GB). The Qwen2.5-VL-72B is dense: each token pays the full 72B, hence 11 tok/s and 43.9 GB. In vision, as in text, sparsity is what makes it fit and run on a unified-memory Mac.
Same physics as Lesson 04: what matters for speed on a unified-memory Mac are the active parameters, not the total ones. A 30B/3B-active MoE reads the image with the intelligence of a large model and the speed of a small one. That's why it became the default of @alembic/vision.
05 · Fitting vision into the budget: swap on demand
The brain (DeepSeek q2) already takes ~81 GB. Keeping vision co-resident all the time crowds the 128 GB (Lesson 08). The way out is swap on demand: DeepSeek serves normally; when an image arrives, you pause DS4 to free the 81 GB, load Qwen3-VL (20 GB), describe the image, and DeepSeek comes back. The sequence below shows the RAM peak at each step — it never crosses the ceiling.
Why it works: the brain (81 GB) and vision (20 GB) never coexist at full size. At any moment only one of them is large, so the real peak stays well below the safe 102 GB. It's Lesson 08's default mode applied to sporadic vision — you trade seconds of reload for total stability.
1. Why is a single high-resolution image expensive for a VLM's context?
Correct: c. Non-text input becomes tokens too; the encoder is also extra weight in memory. More resolution = more tokens spent = less free context. Treat resolution as a cost dial (Ahmad, "LLMs 101").
2. In this session's benchmark, why did the Qwen3-VL-30B-A3B beat the Qwen2.5-VL-72B (dense)?
Correct: b. Same physics as Lesson 04: ACTIVE parameters count, not total ones. The dense 72B pays the whole model per token (11 tok/s, 43.9 GB) and still got an icon wrong. Size didn't buy accuracy — and the decision came from measuring on the real corpus, not from trusting a demo.