RC RANDOM CHAOS

Tiny-vLLM: Build a CUDA LLM Inference Engine from Scratch in C++

· via Hacker News

Original source

Show HN: Tiny-vLLM – high performance LLM inference engine in C++ and CUDA

Hacker News →

Tiny-vLLM is an open-source project that pairs a working LLM inference server with a step-by-step course teaching readers how to build one themselves. The engine loads Llama 3.2 1B Instruct from Safetensors and runs a full forward pass entirely through custom CUDA kernels, implementing the techniques that make production serving fast: KV cache, static and continuous batching, online softmax with FlashAttention-style computation, and PagedAttention.

The accompanying curriculum derives the underlying math and systems concepts from scratch, covering bfloat16 numerics, RMSNorm and parallel reduction, RoPE, GQA, causal masking, cuBLAS GEMM with column-major-to-row-major tricks, and the prefill-versus-decode split that motivates the KV cache. The author frames C++ and CUDA as the right tools because LLM workloads reduce to large volumes of matrix multiplication that demand GPU parallelism.

Training and model design are explicitly out of scope, with pointers to Karpathy’s nanoGPT and llm.c, tinygrad, and GPU MODE for adjacent territory. The project targets NVIDIA hardware on Linux with CUDA 13.1 and invites readers to fork, adapt build paths, and upstream fixes, positioning itself as both a learning resource and a teaching aid for university courses.

Read the full article

Continue reading at Hacker News →

This is an AI-generated summary. Read the original for the full story.