M0/Orientation
L01A

One Request, End to End

10 min
What does one real inference request look like end to end?

A single inference step is a pipeline with clear stages. Text goes in. A next-token prediction comes out. There is nothing magical about any individual stage — each one is a well-defined transformation.

This lesson shows you the whole pipeline at a glance. You do not need to understand each stage deeply yet. That is what the rest of the course is for. The point right now is to see the map.

We will use this same pipeline as a running example throughout the course. Each module will revisit it from a deeper angle.

One boundary matters immediately: tokenization happens once on the incoming text, but the model-evaluate-then-decode loop repeats every time the system generates another token. The rest of the course will keep unpacking that split.

Step through each stage of one inference request. Click a stage or use the navigation buttons.

1/6

Raw Text

A string of characters. This is what the user types.

"The cat sat"

The model cannot process raw text directly. It must first be converted into numbers.

Our running example — one step of predicting the next token after "The cat sat":

Raw text "The cat sat"
Tokenized ["The", " cat", " sat"] → [791, 2368, 3290]
Embedded 3 vectors, each with many numbers (the model's internal width)
Layers N layers transform the hidden states
Logits One score per vocabulary entry for the last token position
Decode argmax or sampling chooses one token from those logits
Chosen token " on" (the selected next token)
Stage Who owns it conceptually What to remember
raw text → token IDs tokenizer done once on the input text
IDs → hidden states → logits model graph the transformer machinery lives here
logits → chosen token decoding/runtime policy argmax or sampling chooses one token, then the loop repeats

A preview of the shapes you will learn to track through each stage:

Token IDs: [3]   (3 integers)
After embedding: [3, d_model]   (3 vectors)
After N layers: [3, d_model]   (same shape, different values)
Logits: [|V|]   (one score per vocabulary token)
Chosen next token: scalar token ID   (one selected vocabulary entry)

This pipeline is not abstract. Here is where it lives in real code:

What to notice later: this file builds the model stages themselves. It is the graph-builder view of the pipeline. The helpers correspond to major parts of the model:

  • build_attn() — attention stage
  • build_ffn() — feed-forward stage
  • build_moe_ffn() — mixture-of-experts stage

Here is one complete model that wires those pieces together:

What to notice later: this file is not the runtime loop. It is the model-definition side.

What to notice later: this is where user-facing generation control lives. It helps separate "the model produced logits" from "the runtime chose and emitted a token."

During generation, tokenization happens once on the input text. The model computation (embedding, layers, output projection) and token choice repeat for every generated token. The cost of those repeated stages varies dramatically. Understanding where time goes is the goal of the performance modules later in the course.

Check Yourself
conceptualQ1

Place these stages in the correct order: Embedding, Tokenization, Output Logits, Transformer Layers

shapeQ2

Which stage is responsible for turning logits into one chosen token?