An LLM Predicts the Next Token
An LLM does one core computation: given a sequence of previous tokens, it produces scores over which token should come next.
It does not directly emit a finished word or sentence from nowhere. It computes a score for every token in its vocabulary — all of them, every time. In the standard setup, the last step is one matrix multiply that produces one score for every vocabulary entry. The model's hidden state already encodes the information, but the scores make it explicit and comparable. A decoding rule such as argmax or sampling then turns those scores into one chosen next token.
Generation — the process of producing long outputs — is just this one-step prediction repeated. The model produces scores, one token is chosen and appended to the sequence, and the model runs again.
This is called autoregressive generation. The model's output becomes its own input on the next step.
This distinction is worth making explicit early because it survives the entire course. The model produces logits. The runtime decoding policy turns those logits into a token choice.
If the runtime uses greedy decoding, it picks the highest-scoring token. If it uses sampling, it can choose a lower-scoring token on purpose according to a probability distribution derived from the logits. Either way, the transformer layers did not directly output text. They output scores.
One-step prediction:
Autoregressive generation repeats this:
Not every stage of inference repeats at the same cadence. Tokenization is done once for the incoming prompt. After that, the core generation loop repeats: model evaluation produces logits, decoding chooses one token, that token is appended, and the next step begins.
That split matters later when you learn prefill and decode. For now, keep just this invariant: prompt preprocessing happens once, next-token prediction happens over and over.
We are not working with full tensor math yet. The key idea here is one-step input/output structure and sequence growth:
No heavy math is required yet. The durable statement to keep is:
- the model takes a sequence of token IDs and produces logits for every possible next token
- the runtime chooses one token from those logits
- that chosen token is appended, and the process repeats
- the prediction step does not depend on future tokens — only on what came before
In llama.cpp, the autoregressive loop lives in the runtime tools, not inside the model builder itself.
What to notice later: this is where a user-facing generation loop lives. The runtime repeatedly evaluates the model, applies sampling settings, and appends new tokens.
What to notice later: this file builds the model computation itself. It produces the hidden states and logits, but it is not the whole outer request loop.
Because generation is iterative, the time to produce N tokens is roughly N times the time for one generation step. Tokenization does not dominate that loop. Model evaluation and token choice do. Later, you will learn why the first step (processing the prompt) and subsequent steps (generating more tokens) behave very differently.
What does the model directly produce on one step?
Why is generation called "autoregressive"?