L00

What This Course Teaches

5 min

Question

What will this course actually teach me?

Intuition

A large language model receives token IDs derived from text and produces logits: raw scores over the vocabulary for what token should come next. A decoding policy then turns those scores into an actual chosen token. Everything else — the architecture, the math, the engineering — is machinery for producing good scores quickly and reliably.

This course teaches you the machinery. Not as abstract theory, but as something you can trace through real code.

The path is cumulative. You will start from how text becomes numbers, move through the math and the model structure, and end by reading real llama.cpp code and reasoning about performance.

This course focuses on inference and systems understanding. It teaches only the minimum training and evaluation concepts needed for that purpose. It is not a full machine learning curriculum.

How to Use This Course

Treat the course as a guided code-reading apprenticeship, not as a glossary dump. Each lesson introduces one durable mental model, one toy example, one code hook, and one performance consequence. The gates are not there to make the course feel academic. They are there to stop you from drifting past a concept without being able to say it back clearly.

Use the running example aggressively. When a lesson mentions "The cat sat" and the token IDs [791, 2368, 3290], do not dismiss it as childish. That tiny example is the scaffold the later modules keep reusing. The point is not the sentence. The point is that one stable object lets you compare ideas across lessons without constantly changing context.

One more practical rule: whenever you see a shape, narrate it in plain English. Whenever you see a file path, ask what role that file plays in the pipeline. Those two habits carry a large fraction of this curriculum.

Toy Example

Here is the simplest possible picture of what an LLM does:

"The cat sat" → tokenize → [791, 2368, 3290] → model → logits → decode → " on"

The model received three token IDs and produced scores over the vocabulary. A decoding policy then chose the next token, shown here as " on". That is one inference step. Generation repeats the model-evaluate-then-decode loop.

The Journey

You will move through these stages, in order:

01 Orientation — what an LLM does, the end-to-end pipeline
02 Tokens — how text becomes integer IDs
03 Math — vectors, matrices, projections
04 Probability — logits, softmax, sampling
05 Transformer Block — attention, FFN, residuals, the full decoder
06 Variants — GQA, SWA, shared-KV, MoE
07 Inference — prefill, decode, KV cache, batching
08 Performance — bottlenecks, quantization, validation
09 Case Studies — real llama.cpp code walkthroughs

Shapes

We have not introduced tensor shapes yet. For now, notice only that the model's input is a sequence of integers and its direct output is a set of scores — one per possible next token.

input: sequence of token IDs e.g. [791, 2368, 3290]

model output: one score per vocabulary entry e.g. [2.1, 0.8, 5.4, ...]

decoded result: one chosen next token e.g. " on"

Implementation Hook

The model code you will eventually read lives in llama.cpp. Specifically:

src/models/gemma.cpp — layer loop (L22)

What to notice later: this file wires together one full model architecture out of repeated layer-building patterns. It is where the abstract idea of "a decoder model" becomes a concrete graph.

src/llama-graph.cpp

What to notice later: this file contains the reusable graph-building helpers that implement the major model stages. You do not need to understand them yet. The important thing is to know that the machinery we are describing does exist in concrete code.

Study Habit: The Lesson Journal

Most lessons in this course follow a consistent pattern. To build lasting understanding, keep a short journal. After each lesson, fill in whichever of these apply:

Lesson: ___

Mental model: ___ (the one-sentence intuition)

Shape in plain English: ___ (e.g., "[n_tokens, d_model] = one row per token")

Code hook: ___ (which function, which file)

Performance consequence: ___ (what makes this fast or slow)

Review your journal before each module quiz. By the capstone, you will have a personal reference that maps every concept to a shape, a function, and a cost — the exact thinking pattern this course builds.

Performance Hook

Everything you learn in this course matters because inference is real-time work. The model must produce tokens fast enough to be useful. One important preview: tokenization happens once on the input text, but model evaluation and token selection repeat for every generated token. Understanding that boundary is the first step toward understanding where time goes.

Check Yourself

conceptualQ1

Which of these is the direct input to the model?

Raw text charactersA sequence of token IDs (integers)A probability distribution

conceptualQ2

What does the model itself produce as output, before decoding chooses a token?

The next word in the sentenceA set of scores (logits), one per vocabulary entryAn embedding vector