Running Gemma 4 26B on a 2016 Xeon with no GPU — and making it fast
A recycled server with a 2016 Intel Xeon E5-2620 v4, 128 GB of DDR3 RAM, and no GPU has no business running a 26B-parameter mixture-of-experts model — but with the right flags on ik_llama.cpp, it does. The author lays out why off-the-shelf tools like ollama or stock llama.cpp won’t cut it: LLM decoding is memory-bandwidth-bound, and naive configs leave most of the available optimizations on the table. On hardware this slow, every lever matters.
The core tricks are speculative decoding (pairing the 26B verifier with a tiny MTP drafter whose working set fits in L3 cache, with autotuned chain length), MoE-aware CPU routing that keeps expert weights resident in cache to avoid thrashing, and fusing the up- and gate-projections into a single matmul to halve memory-bus trips. Thread count is pinned to physical cores rather than SMT threads because oversubscription only adds scheduling overhead on a memory-bound workload. Runtime weight repacking reshapes the tensors to match the CPU’s preferred ingestion layout, and mlock pins the 27 GB of weights in physical RAM so the kernel can’t page them to disk mid-inference.
The broader point: speculative decoding is actually a stronger win on CPUs than GPUs because CPU compute is cheap relative to the cost of streaming verifier weights through cache, so cycles spent on a small drafter buy accepted tokens almost for free. Black-box runners hide all of this — running modern models well on aging silicon means understanding each flag, watching the logs to confirm they took effect, and treating the memory wall as the real adversary.
Read the full article
Continue reading at Hacker News →This is an AI-generated summary. Read the original for the full story.