One Request, End to End
A single inference step is a pipeline with clear stages. Text goes in. A next-token prediction comes out. There is nothing magical about any individual stage — each one is a well-defined transformation.
This lesson shows you the whole pipeline at a glance. You do not need to understand each stage deeply yet. That is what the rest of the course is for. The point right now is to see the map.
We will use this same pipeline as a running example throughout the course. Each module will revisit it from a deeper angle.
One boundary matters immediately: tokenization happens once on the incoming text, but the model-evaluate-then-decode loop repeats every time the system generates another token. The rest of the course will keep unpacking that split.
Step through each stage of one inference request. Click a stage or use the navigation buttons.
Raw Text
A string of characters. This is what the user types.
"The cat sat"The model cannot process raw text directly. It must first be converted into numbers.
Our running example — one step of predicting the next token after "The cat sat":
| Stage | Who owns it conceptually | What to remember |
|---|---|---|
| raw text → token IDs | tokenizer | done once on the input text |
| IDs → hidden states → logits | model graph | the transformer machinery lives here |
| logits → chosen token | decoding/runtime policy | argmax or sampling chooses one token, then the loop repeats |
A preview of the shapes you will learn to track through each stage:
This pipeline is not abstract. Here is where it lives in real code:
What to notice later: this file builds the model stages themselves. It is the graph-builder view of the pipeline. The helpers correspond to major parts of the model:
- build_attn() — attention stage
- build_ffn() — feed-forward stage
- build_moe_ffn() — mixture-of-experts stage
Here is one complete model that wires those pieces together:
What to notice later: this file is not the runtime loop. It is the model-definition side.
What to notice later: this is where user-facing generation control lives. It helps separate "the model produced logits" from "the runtime chose and emitted a token."
During generation, tokenization happens once on the input text. The model computation (embedding, layers, output projection) and token choice repeat for every generated token. The cost of those repeated stages varies dramatically. Understanding where time goes is the goal of the performance modules later in the course.
Place these stages in the correct order: Embedding, Tokenization, Output Logits, Transformer Layers
Which stage is responsible for turning logits into one chosen token?