M3/Probability
L11

The Output Head Produces One Logit Per Vocabulary Entry

10 min
What is a logit?

You have learned that token IDs become embedding vectors, and that those vectors can be transformed by projections. Inside the model, a series of processing steps (which you will learn in detail in Module 4) transforms the initial embeddings into a final hidden state for the last token position — a dense vector of dimension d_model that encodes everything the model "knows" about what should come next. For now, treat the model as a black box that takes in token embeddings and outputs a hidden-state vector. The important thing is what happens to that vector after the model finishes.

The output head bridges this gap. It is one final linear projection (a matrix multiply) that maps the hidden state from d_model dimensions to |V| dimensions — one number per vocabulary entry. The result is a vector of logits: raw, unbounded scores that rank every token in the vocabulary.

You might ask: why not something more complex? The answer is that the model's internal processing layers (which you will learn about in Module 4) have already done the heavy lifting — they have transformed the initial embeddings into a rich representation that encodes the next-token prediction implicitly. The output head only needs to "read out" that prediction into vocabulary space. A matrix multiply (with a weight matrix Wout of shape [d_model, |V|]) is sufficient for this readout.

There is also a cost argument. The output head weight matrix is large — with d_model = 4096 and |V| = 128,000, it contains over 500 million parameters. Adding nonlinearities or extra layers here would multiply that cost for minimal benefit, since the representation is already rich.

In many models, Wout is actually the same matrix as the embedding table (transposed). This is called weight tying — the model uses the same learned vectors for reading tokens in (embedding) and reading predictions out (output head). This saves memory and often works as well as or better than using separate matrices.

The word "logit" comes from statistics, where it refers to the log-odds of a probability. In the LLM context, the meaning is looser: a logit is a raw score for one vocabulary entry, before any normalization.

Logits are unbounded real numbers. They can be positive, negative, or zero. A larger logit means the model considers that token more likely. But logits are not probabilities:

  • A logit of 5.4 does not mean "54% likely." It just means "higher score than a logit of 3.2."
  • The absolute magnitude of logits depends on the model, the input, and the training dynamics. There is no universal scale.
  • Raw logits are only meaningful relative to other logits at the same position of the same model.

To turn logits into probabilities, you need softmax — the subject of the next lesson.

After processing "The cat sat", the model's final hidden state has shape [4096]. The output head multiplies it by Wout (shape [4096, 128000]) to produce 128,000 logits — one per vocabulary entry. Here is a tiny slice showing the top 5:

Token ID Logit " on" 373 5.4 " the" 279 2.1 " a" 247 0.8 " dog" 5765 1.2 "." 13 -0.3
... plus 127,995 more logits (one for every other vocabulary entry)

The token " on" has the highest logit (5.4), but that number alone is not a probability. To convert these logits into actual probabilities, you need to run softmax over all 128,000 logits. (You can compare two logits pairwise — the odds ratio between " on" and " the" is e(5.4-2.1) ≈ 27 — but for absolute probabilities, the full distribution matters.) That computation is what the next lesson covers.

Adjust the logit values below and see how the bar chart changes. For now, focus on the left side (logits). In the next lesson you will learn how softmax converts these into the probability bars on the right.

Logits (raw scores)
"the"
"a"
"cat"
"dog"
"."
Temperature: 1.0
sharpuniform
Logits
2.1"the"
0.8"a"
5.4"cat"
1.2"dog"
-0.3"."
softmax
Probabilities
3.5%"the"
0.9%"a"
93.9%"cat"
1.4%"dog"
0.3%"."
argmax → "cat"(93.9%)sum → 1.0000
Final hidden state for last position: [d_model]
Output head weight matrix: [d_model, |V|]
Logits: [|V|]   one score per vocabulary entry
Computation: [d_model] × [d_model, |V|] = [|V|]   (one matrix-vector product)

The output head computation is a single matrix-vector product:

z = h · Wout
h ∈ ℝd_model,   Wout ∈ ℝd_model × |V|,   z ∈ ℝ|V|

Each logit zi is the dot product of the hidden state with the i-th column of Wout. You can think of each column as a "template" for one vocabulary token — the logit measures how well the hidden state matches that template. Higher match → higher logit → model thinks that token is more likely.

The examples above use made-up numbers. The widget below runs DistilGPT-2 in your browser and shows the actual logit vector the model produces for your prompt. Notice the top predictions — these are the model's real next-token guesses, before softmax.

In llama.cpp, the output head is the final build_lora_mm(model.output, cur) call in the model builder. If weight tying is used, model.output points to the same tensor as model.tok_embd (the embedding table). The resulting logits are the last node in the computation graph — everything after this (softmax, sampling) happens outside the graph.

The output head projection is [d_model, |V|]. With d_model = 4096 and |V| = 128,000, this single multiply involves 4096 × 128,000 = 524 million multiply-add operations per token. For models with very large vocabularies, this is a significant fraction of per-token compute — sometimes more expensive than a single attention or FFN projection. Vocabulary size is not just a tokenizer concern; it is a performance parameter.

Logits are raw scores — not probabilities. You cannot say "the model is 70% confident" from a logit of 5.4, because you do not know what the other 127,999 logits look like. The number 5.4 is only meaningful in comparison to the other logits at this position.

The next lesson introduces softmax, which takes all |V| logits at once and converts them into a probability distribution: positive numbers that sum to 1. That is the specific transformation that makes logits interpretable as relative likelihoods.

Check Yourself
mathQ1

The output head weight matrix has shape [4096, 128000]. How many multiply-add operations does one output-head computation require per token?

reasoningQ2

Two models produce logits for the same prompt. Model A gives token "on" a logit of 8.2. Model B gives "on" a logit of 3.1. Which model thinks "on" is more likely?