L01A

One Request, End to End

10 min

Question

What does one real inference request look like end to end?

Intuition

A single inference step is a pipeline with clear stages. Text goes in. A next-token prediction comes out. There is nothing magical about any individual stage — each one is a well-defined transformation.

This lesson shows you the whole pipeline at a glance. You do not need to understand each stage deeply yet. That is what the rest of the course is for. The point right now is to see the map.

We will use this same pipeline as a running example throughout the course. Each module will revisit it from a deeper angle.

One boundary matters immediately: tokenization happens once on the incoming text, but the model-evaluate-then-decode loop repeats every time the system generates another token. The rest of the course will keep unpacking that split.

Interactive — Pipeline Stepper

Step through each stage of one inference request. Click a stage or use the navigation buttons.

1/6

Raw Text

A string of characters. This is what the user types.

"The cat sat"

The model cannot process raw text directly. It must first be converted into numbers.

Toy Example

Our running example — one step of predicting the next token after "The cat sat":

Raw text "The cat sat"

Tokenized ["The", " cat", " sat"] → [791, 2368, 3290]

Embedded 3 vectors, each with many numbers (the model's internal width)

Layers N layers transform the hidden states

Logits One score per vocabulary entry for the last token position

Decode argmax or sampling chooses one token from those logits

Chosen token " on" (the selected next token)

Responsibility Map

Stage	Who owns it conceptually	What to remember
raw text → token IDs	tokenizer	done once on the input text
IDs → hidden states → logits	model graph	the transformer machinery lives here
logits → chosen token	decoding/runtime policy	argmax or sampling chooses one token, then the loop repeats

Shapes

A preview of the shapes you will learn to track through each stage:

Token IDs: [3] (3 integers)

After embedding: [3, d_model] (3 vectors)

After N layers: [3, d_model] (same shape, different values)

Logits: [|V|] (one score per vocabulary token)

Chosen next token: scalar token ID (one selected vocabulary entry)

Implementation Hook

This pipeline is not abstract. Here is where it lives in real code:

src/llama-graph.cpp

What to notice later: this file builds the model stages themselves. It is the graph-builder view of the pipeline. The helpers correspond to major parts of the model:

build_attn() — attention stage
build_ffn() — feed-forward stage
build_moe_ffn() — mixture-of-experts stage

Here is one complete model that wires those pieces together:

src/models/gemma.cpp — layer loop (L22)

What to notice later: this file is not the runtime loop. It is the model-definition side.

tools/cli/cli.cpp

What to notice later: this is where user-facing generation control lives. It helps separate "the model produced logits" from "the runtime chose and emitted a token."

Performance Hook

During generation, tokenization happens once on the input text. The model computation (embedding, layers, output projection) and token choice repeat for every generated token. The cost of those repeated stages varies dramatically. Understanding where time goes is the goal of the performance modules later in the course.

Check Yourself

conceptualQ1

Place these stages in the correct order: Embedding, Tokenization, Output Logits, Transformer Layers

Tokenization → Embedding → Transformer Layers → Output LogitsEmbedding → Tokenization → Output Logits → Transformer LayersTransformer Layers → Tokenization → Embedding → Output Logits

shapeQ2

Which stage is responsible for turning logits into one chosen token?

The tokenizerThe embedding tableThe decoding/runtime policy after the model output head