Softmax Turns Scores into a Distribution
The previous lesson left us with logits — raw, unbounded scores, one per vocabulary entry. Logits are the model's internal ranking of how likely each next token is. For greedy decoding (always pick the highest) or top-k filtering (pick from the k highest), logits alone are sufficient — only the ranking matters. But for probabilistic sampling, top-p (nucleus) filtering, and calibrated confidence, we need actual probabilities. Logits are not probabilities:
- They can be negative. You cannot interpret a logit of -2.3 as a probability.
- They do not sum to 1. You cannot sample from them or compute cumulative thresholds (needed for top-p). (Logit differences do encode relative odds — exp(zi - zj) = pi/pj — but absolute probabilities require normalization.)
Softmax fixes both problems in one operation. It converts a vector of arbitrary real numbers into a vector of positive numbers that sum to exactly 1 — a proper probability distribution. Crucially, softmax preserves the ranking: the highest logit becomes the highest probability.
Softmax has three stages. Understanding each stage separately makes the whole formula easy to follow.
Stage 1: Exponentiate every logit.
Take each logit zi and compute ezi. This is the key step — it guarantees every output is positive, because ex > 0 for any real number x. It does not matter if the logit is -100 or +100; the exponential always produces a positive result.
Why exponentials and not, say, just adding a constant to make everything positive? Because exponentials preserve the ordering of the logits (larger input → larger output) while also amplifying differences. The ratio between e5 and e3 is much larger than the difference between 5 and 3. This amplification is exactly what makes softmax produce a "peaked" distribution — the most likely token stands out.
Stage 2: Sum all the exponentials.
Compute Z = ∑j ezj. This is the normalization constant. It ensures we can divide each exponentiated logit by the total and get numbers that sum to 1.
Stage 3: Divide each exponentiated logit by the total.
The softmax probability for token i is:
Every pi is positive (because the numerator is an exponential). The sum of all pi is exactly 1 (because each shares the same denominator Z = total of all numerators).
Three logits: z = [2.0, 1.0, 0.1]. Walk through softmax step by step.
p1 = 2.718 / 11.212 = 0.242
p2 = 1.105 / 11.212 = 0.099
Check: 0.659 + 0.242 + 0.099 = 1.000. The largest logit (2.0) gets 65.9% of the probability mass. The difference between logits 2.0 and 1.0 was just 1.0, but the probability ratio is 0.659 / 0.242 ≈ 2.7 — the exponential amplified a small difference into a noticeable gap.
The amplification effect means softmax behavior depends strongly on the spread of the logits:
- Large spread If the top logit is much larger than the second-highest (e.g., top = 10.0, second = 2.0, rest near 0), the top token dominates the probability mass. How much depends on the gap and the number of competitors: with 3 classes and a gap of 10, the top token gets >99%. With 128,000 classes all at 0, a top logit of 10 only gets ~15%. The key factor is the gap between the top logits, not the gap between the top and the average.
- Small spread If all logits are close together (e.g., 1.0, 1.1, 0.9), the probabilities are nearly uniform. The model is "uncertain" — no token stands out strongly.
- Identical logits If all logits are the same value (e.g., all zeros), every probability equals 1/|V|. Maximum uncertainty. Note that the absolute value does not matter — softmax([0, 0, 0]) and softmax([100, 100, 100]) produce the same result, because only relative differences matter.
Temperature is a scalar T that divides the logits before softmax:
Dividing by T changes the spread of the logits before the exponential sees them:
- T < 1 (e.g., 0.5): Logit differences are amplified. z/0.5 = 2z, so the spread doubles. The distribution becomes sharper — the model acts more "confident." At T → 0, softmax approaches argmax.
- T = 1: No change. Standard softmax.
- T > 1 (e.g., 2.0): Logit differences are compressed. z/2 halves the spread. The distribution becomes flatter — more tokens have a reasonable chance of being picked. At T → ∞, the distribution approaches uniform (1/|V| for each token).
Temperature does not change the model's logits. It changes how the decoding policy interprets them. This is why temperature is a runtime setting, not a model property — you can adjust it between requests.
Adjust the logits and temperature below. Watch how the probability bars change. Try: setting one logit much higher than the rest, setting all logits equal, and adjusting temperature to see sharpening and flattening.
There is one practical problem with computing softmax directly. If a logit is very large (say, 1000),
then e1000 overflows — it is too large for floating-point numbers to represent. The result is Inf,
and the division produces NaN.
The standard fix is to subtract the maximum logit from all logits before exponentiating:
Why does this work? Because subtracting a constant from all logits does not change the softmax output. The proof is one line: e(zi - m) / ∑ e(zj - m) = (ezi · e-m) / (∑ ezj · e-m) = ezi / ∑ ezj. The e-m cancels from numerator and denominator. The result is identical, but now the largest exponent is e0 = 1, which never overflows.
Every real softmax implementation uses this trick. If you see max before exp in model code,
this is what it is doing.
You have seen softmax here at the output head, turning logits into next-token probabilities. In Module 4, you will see it again inside the model's layers: when tokens compare their queries against keys to decide "how much should I attend to each previous token?", softmax converts those raw comparison scores into weights that sum to 1. The same formula, the same properties (positive, sums to 1), the same numerical stability trick — just applied at a different point in the computation. When you reach that lesson, the math will already be familiar.
The full softmax formula, including temperature and the stability trick:
The inspector above uses hand-tuned logits. The widget below runs the same temperature slider on real DistilGPT-2 output. Watch how the probability distribution sharpens at low temperature and flattens at high temperature — on actual model predictions, not synthetic numbers.
In llama.cpp, softmax is implemented via ggml_soft_max. The max-subtraction trick
is built into this operator — you do not need to apply it manually. Temperature scaling is handled
separately in the sampling code, not in the model graph. The model always produces raw logits at T=1;
temperature is applied by the runtime before sampling.
Softmax itself is cheap — three passes over the data (max, exp-and-sum, divide). For the output head, the vector has |V| elements (e.g., 128K), which is small compared to the projection that produced it. For attention, the softmax is over the key dimension (sequence length), and each head does its own softmax. With very long sequences, the attention softmax can become noticeable in profiles, but it is rarely the dominant cost.
softmax([5.0, 5.0, 5.0]) produces what distribution?
You lower the temperature from T=1.0 to T=0.1. What happens to the probability of the highest-logit token?
Why do real softmax implementations subtract the max logit before exponentiating?