Go from zero understanding of LLMs to reading model code and reasoning about inference performance. A strict, linear curriculum for engineers who want practical understanding, not hand-waving.
By the end, you will have practical, systems-level understanding of how large language models work at inference time.
Define tokens, embeddings, hidden states, logits, softmax, attention, FFN, MoE, KV cache, prefill, and decode — without bluffing.
Draw a decoder transformer block from memory. Explain what Q, K, V do. Distinguish dense, GQA, MoE.
Read model code without panic. Track tensor shapes through projections, attention, and FFN.
Explain why prefill and decode are different workloads. Explain the KV cache. Explain batching policy.
Walk through llama.cpp model builders and connect theory to implementation artifact by artifact.
Look at a profile and produce a plausible first hypothesis. Know when a speedup needs validation.
9 modules. 48 lessons. One strict, linear sequence.
Ready?
Start the Course