L01

An LLM Predicts the Next Token

8 min

Question

What does an LLM actually predict?

Intuition

An LLM does one core computation: given a sequence of previous tokens, it produces scores over which token should come next.

It does not directly emit a finished word or sentence from nowhere. It computes a score for every token in its vocabulary — all of them, every time. In the standard setup, the last step is one matrix multiply that produces one score for every vocabulary entry. The model's hidden state already encodes the information, but the scores make it explicit and comparable. A decoding rule such as argmax or sampling then turns those scores into one chosen next token.

Generation — the process of producing long outputs — is just this one-step prediction repeated. The model produces scores, one token is chosen and appended to the sequence, and the model runs again.

This is called autoregressive generation. The model's output becomes its own input on the next step.

Model vs Decoding Policy

This distinction is worth making explicit early because it survives the entire course. The model produces logits. The runtime decoding policy turns those logits into a token choice.

If the runtime uses greedy decoding, it picks the highest-scoring token. If it uses sampling, it can choose a lower-scoring token on purpose according to a probability distribution derived from the logits. Either way, the transformer layers did not directly output text. They output scores.

Toy Example

One-step prediction:

input: ["The", " cat", " sat"]

model: logits over |V| (scores for every possible next token)

decode: " on" (chosen next token)

Autoregressive generation repeats this:

Step 1: model logits over |V| → decode chooses " on"

Step 2: append that token, run model again → decode chooses " the"

Step 3: append again, run model again → decode chooses " mat"

Step 4: append again, run model again → decode chooses "."

What Repeats and What Does Not

Not every stage of inference repeats at the same cadence. Tokenization is done once for the incoming prompt. After that, the core generation loop repeats: model evaluation produces logits, decoding chooses one token, that token is appended, and the next step begins.

That split matters later when you learn prefill and decode. For now, keep just this invariant: prompt preprocessing happens once, next-token prediction happens over and over.

Shapes

We are not working with full tensor math yet. The key idea here is one-step input/output structure and sequence growth:

one model step input: [n_tokens] a token sequence

one model step output: [|V|] logits over the vocabulary

step 1 input length: 3 tokens

step 2 input length: 4 tokens

step 3 input length: 5 tokens

The sequence grows by one token at each step.

Math

No heavy math is required yet. The durable statement to keep is:

the model takes a sequence of token IDs and produces logits for every possible next token
the runtime chooses one token from those logits
that chosen token is appended, and the process repeats
the prediction step does not depend on future tokens — only on what came before

Implementation Hook

In llama.cpp, the autoregressive loop lives in the runtime tools, not inside the model builder itself.

tools/cli/cli.cpp

What to notice later: this is where a user-facing generation loop lives. The runtime repeatedly evaluates the model, applies sampling settings, and appends new tokens.

src/llama-graph.cpp

What to notice later: this file builds the model computation itself. It produces the hidden states and logits, but it is not the whole outer request loop.

Performance Hook

Because generation is iterative, the time to produce N tokens is roughly N times the time for one generation step. Tokenization does not dominate that loop. Model evaluation and token choice do. Later, you will learn why the first step (processing the prompt) and subsequent steps (generating more tokens) behave very differently.

Check Yourself

conceptualQ1

What does the model directly produce on one step?

The meaning of the input sentenceLogits over the possible next tokensAll remaining tokens at once

conceptualQ2

Why is generation called "autoregressive"?

Because it uses regression analysisBecause each prediction depends on the model's own previous outputsBecause it runs automatically without human input