What This Course Teaches
A large language model receives token IDs derived from text and produces logits: raw scores over the vocabulary for what token should come next. A decoding policy then turns those scores into an actual chosen token. Everything else — the architecture, the math, the engineering — is machinery for producing good scores quickly and reliably.
This course teaches you the machinery. Not as abstract theory, but as something you can trace through real code.
The path is cumulative. You will start from how text becomes numbers, move through the math and the model structure, and end by reading real llama.cpp code and reasoning about performance.
This course focuses on inference and systems understanding. It teaches only the minimum training and evaluation concepts needed for that purpose. It is not a full machine learning curriculum.
Treat the course as a guided code-reading apprenticeship, not as a glossary dump. Each lesson introduces one durable mental model, one toy example, one code hook, and one performance consequence. The gates are not there to make the course feel academic. They are there to stop you from drifting past a concept without being able to say it back clearly.
Use the running example aggressively. When a lesson mentions "The cat sat" and the token IDs
[791, 2368, 3290], do not dismiss it as childish. That tiny example is the scaffold the later modules keep reusing.
The point is not the sentence. The point is that one stable object lets you compare ideas across lessons without constantly changing context.
One more practical rule: whenever you see a shape, narrate it in plain English. Whenever you see a file path, ask what role that file plays in the pipeline. Those two habits carry a large fraction of this curriculum.
Here is the simplest possible picture of what an LLM does:
The model received three token IDs and produced scores over the vocabulary. A decoding policy then chose the next token, shown here as " on". That is one inference step. Generation repeats the model-evaluate-then-decode loop.
You will move through these stages, in order:
- 01 Orientation — what an LLM does, the end-to-end pipeline
- 02 Tokens — how text becomes integer IDs
- 03 Math — vectors, matrices, projections
- 04 Probability — logits, softmax, sampling
- 05 Transformer Block — attention, FFN, residuals, the full decoder
- 06 Variants — GQA, SWA, shared-KV, MoE
- 07 Inference — prefill, decode, KV cache, batching
- 08 Performance — bottlenecks, quantization, validation
- 09 Case Studies — real llama.cpp code walkthroughs
We have not introduced tensor shapes yet. For now, notice only that the model's input is a sequence of integers and its direct output is a set of scores — one per possible next token.
The model code you will eventually read lives in llama.cpp. Specifically:
What to notice later: this file wires together one full model architecture out of repeated layer-building patterns. It is where the abstract idea of "a decoder model" becomes a concrete graph.
What to notice later: this file contains the reusable graph-building helpers that implement the major model stages. You do not need to understand them yet. The important thing is to know that the machinery we are describing does exist in concrete code.
Most lessons in this course follow a consistent pattern. To build lasting understanding, keep a short journal. After each lesson, fill in whichever of these apply:
Review your journal before each module quiz. By the capstone, you will have a personal reference that maps every concept to a shape, a function, and a cost — the exact thinking pattern this course builds.
Everything you learn in this course matters because inference is real-time work. The model must produce tokens fast enough to be useful. One important preview: tokenization happens once on the input text, but model evaluation and token selection repeat for every generated token. Understanding that boundary is the first step toward understanding where time goes.
Which of these is the direct input to the model?
What does the model itself produce as output, before decoding chooses a token?