M0/Orientation
L00

What This Course Teaches

5 minsoft gate
What will this course actually teach me?

A large language model receives token IDs derived from text and predicts the next token. Everything else — the architecture, the math, the engineering — is machinery for doing that prediction well.

This course teaches you the machinery. Not as abstract theory, but as something you can trace through real code.

The path is cumulative. You will start from how text becomes numbers, move through the math and the model structure, and end by reading real llama.cpp code and reasoning about performance.

This course focuses on inference and systems understanding. It teaches only the minimum training and evaluation concepts needed for that purpose. It is not a full machine learning curriculum.

Here is the simplest possible picture of what an LLM does:

"The cat sat" tokenize [791, 2368, 3290] model logits "on"

The model received three token IDs and produced scores over the vocabulary. The highest-scoring token is "on". That is one inference step. Generation repeats this process.

You will move through these stages, in order:

  1. 01 Orientationwhat an LLM does, the end-to-end pipeline
  2. 02 Tokenshow text becomes integer IDs
  3. 03 Mathvectors, matrices, projections
  4. 04 Probabilitylogits, softmax, sampling
  5. 05 Transformer Blockattention, FFN, residuals, the full decoder
  6. 06 VariantsGQA, SWA, shared-KV, MoE
  7. 07 Inferenceprefill, decode, KV cache, batching
  8. 08 Performancebottlenecks, quantization, validation
  9. 09 Case Studiesreal llama.cpp code walkthroughs

We have not introduced tensor shapes yet. For now, notice only that the model's input is a sequence of integers and its output is a set of scores — one per possible next token.

input: sequence of token IDs   e.g. [791, 2368, 3290]
output: one score per vocabulary entry   e.g. [2.1, 0.8, 5.4, ...]

The model code you will eventually read lives in llama.cpp. Specifically:

src/models/gemma.cpp

That file builds the computation graph for a dense Gemma model — one of the case studies at the end of this course. You do not need to understand it yet. You will.

Everything you learn in this course matters because inference is real-time work. The model must produce tokens fast enough to be useful. Understanding the machinery is the first step toward understanding where time goes.

Check Yourselfsoft
conceptualQ1

Which of these is the direct input to the model?

conceptualQ2

What does the model produce as output?