Go from no exposure to LLM internals to reading model code and reasoning about inference performance. A strict, linear curriculum for engineers who want real understanding, not hand-waving.
By the end, you will have practical, systems-level understanding of how large language models work at inference time.
Define tokens, embeddings, hidden states, logits, softmax, attention, FFN, MoE, KV cache, prefill, and decode — without bluffing.
Draw a decoder transformer block from memory. Explain what Q, K, V do. Distinguish dense, GQA, MoE.
Read model code without panic. Track tensor shapes through projections, attention, and FFN.
Explain why prefill and decode are different workloads. Explain the KV cache. Explain batching policy.
Walk through llama.cpp model builders and connect theory to implementation artifact by artifact.
Look at a profile and produce a plausible first hypothesis. Know when a speedup needs validation.
9 modules. 56 guided steps: 51 lessons, 4 quizzes, 1 capstone.
Ready?
Start the Course