M0/Orientation
L00

What This Course Teaches

5 min
What will this course actually teach me?

A large language model receives token IDs derived from text and produces logits: raw scores over the vocabulary for what token should come next. A decoding policy then turns those scores into an actual chosen token. Everything else — the architecture, the math, the engineering — is machinery for producing good scores quickly and reliably.

This course teaches you the machinery. Not as abstract theory, but as something you can trace through real code.

The path is cumulative. You will start from how text becomes numbers, move through the math and the model structure, and end by reading real llama.cpp code and reasoning about performance.

This course focuses on inference and systems understanding. It teaches only the minimum training and evaluation concepts needed for that purpose. It is not a full machine learning curriculum.

Treat the course as a guided code-reading apprenticeship, not as a glossary dump. Each lesson introduces one durable mental model, one toy example, one code hook, and one performance consequence. The gates are not there to make the course feel academic. They are there to stop you from drifting past a concept without being able to say it back clearly.

Use the running example aggressively. When a lesson mentions "The cat sat" and the token IDs [791, 2368, 3290], do not dismiss it as childish. That tiny example is the scaffold the later modules keep reusing. The point is not the sentence. The point is that one stable object lets you compare ideas across lessons without constantly changing context.

One more practical rule: whenever you see a shape, narrate it in plain English. Whenever you see a file path, ask what role that file plays in the pipeline. Those two habits carry a large fraction of this curriculum.

Here is the simplest possible picture of what an LLM does:

"The cat sat" tokenize [791, 2368, 3290] model logits decode " on"

The model received three token IDs and produced scores over the vocabulary. A decoding policy then chose the next token, shown here as " on". That is one inference step. Generation repeats the model-evaluate-then-decode loop.

You will move through these stages, in order:

  1. 01 Orientationwhat an LLM does, the end-to-end pipeline
  2. 02 Tokenshow text becomes integer IDs
  3. 03 Mathvectors, matrices, projections
  4. 04 Probabilitylogits, softmax, sampling
  5. 05 Transformer Blockattention, FFN, residuals, the full decoder
  6. 06 VariantsGQA, SWA, shared-KV, MoE
  7. 07 Inferenceprefill, decode, KV cache, batching
  8. 08 Performancebottlenecks, quantization, validation
  9. 09 Case Studiesreal llama.cpp code walkthroughs

We have not introduced tensor shapes yet. For now, notice only that the model's input is a sequence of integers and its direct output is a set of scores — one per possible next token.

input: sequence of token IDs   e.g. [791, 2368, 3290]
model output: one score per vocabulary entry   e.g. [2.1, 0.8, 5.4, ...]
decoded result: one chosen next token   e.g. " on"

The model code you will eventually read lives in llama.cpp. Specifically:

What to notice later: this file wires together one full model architecture out of repeated layer-building patterns. It is where the abstract idea of "a decoder model" becomes a concrete graph.

What to notice later: this file contains the reusable graph-building helpers that implement the major model stages. You do not need to understand them yet. The important thing is to know that the machinery we are describing does exist in concrete code.

Most lessons in this course follow a consistent pattern. To build lasting understanding, keep a short journal. After each lesson, fill in whichever of these apply:

Lesson: ___
Mental model: ___ (the one-sentence intuition)
Shape in plain English: ___ (e.g., "[n_tokens, d_model] = one row per token")
Code hook: ___ (which function, which file)
Performance consequence: ___ (what makes this fast or slow)

Review your journal before each module quiz. By the capstone, you will have a personal reference that maps every concept to a shape, a function, and a cost — the exact thinking pattern this course builds.

Everything you learn in this course matters because inference is real-time work. The model must produce tokens fast enough to be useful. One important preview: tokenization happens once on the input text, but model evaluation and token selection repeat for every generated token. Understanding that boundary is the first step toward understanding where time goes.

Check Yourself
conceptualQ1

Which of these is the direct input to the model?

conceptualQ2

What does the model itself produce as output, before decoding chooses a token?