L12

Softmax Turns Scores into a Distribution

15 min

Question

What does softmax change?

Intuition

The previous lesson left us with logits — raw, unbounded scores, one per vocabulary entry. Logits are the model's internal ranking of how likely each next token is. For greedy decoding (always pick the highest) or top-k filtering (pick from the k highest), logits alone are sufficient — only the ranking matters. But for probabilistic sampling, top-p (nucleus) filtering, and calibrated confidence, we need actual probabilities. Logits are not probabilities:

They can be negative. You cannot interpret a logit of -2.3 as a probability.
They do not sum to 1. You cannot sample from them or compute cumulative thresholds (needed for top-p). (Logit differences do encode relative odds — exp(z_i - z_j) = p_i/p_j — but absolute probabilities require normalization.)

Softmax fixes both problems in one operation. It converts a vector of arbitrary real numbers into a vector of positive numbers that sum to exactly 1 — a proper probability distribution. Crucially, softmax preserves the ranking: the highest logit becomes the highest probability.

The Formula, Step by Step

Softmax has three stages. Understanding each stage separately makes the whole formula easy to follow.

Stage 1: Exponentiate every logit.

Take each logit z_i and compute e^z_i. This is the key step — it guarantees every output is positive, because e^x > 0 for any real number x. It does not matter if the logit is -100 or +100; the exponential always produces a positive result.

Why exponentials and not, say, just adding a constant to make everything positive? Because exponentials preserve the ordering of the logits (larger input → larger output) while also amplifying differences. The ratio between e⁵ and e³ is much larger than the difference between 5 and 3. This amplification is exactly what makes softmax produce a "peaked" distribution — the most likely token stands out.

Stage 2: Sum all the exponentials.

Compute Z = ∑_j e^z_j. This is the normalization constant. It ensures we can divide each exponentiated logit by the total and get numbers that sum to 1.

Stage 3: Divide each exponentiated logit by the total.

The softmax probability for token i is:

p_i = e^z_i / Z where Z = ∑_j e^z_j

Every p_i is positive (because the numerator is an exponential). The sum of all p_i is exactly 1 (because each shares the same denominator Z = total of all numerators).

Worked Example

Three logits: z = [2.0, 1.0, 0.1]. Walk through softmax step by step.

Step 1: Exponentiate

e^2.0 = 7.389 e^1.0 = 2.718 e^0.1 = 1.105

Step 2: Sum

Z = 7.389 + 2.718 + 1.105 = 11.212

Step 3: Divide

p₀ = 7.389 / 11.212 = 0.659
p₁ = 2.718 / 11.212 = 0.242
p₂ = 1.105 / 11.212 = 0.099

Check: 0.659 + 0.242 + 0.099 = 1.000. The largest logit (2.0) gets 65.9% of the probability mass. The difference between logits 2.0 and 1.0 was just 1.0, but the probability ratio is 0.659 / 0.242 ≈ 2.7 — the exponential amplified a small difference into a noticeable gap.

What Happens at the Extremes

The amplification effect means softmax behavior depends strongly on the spread of the logits:

Large spread If the top logit is much larger than the second-highest (e.g., top = 10.0, second = 2.0, rest near 0), the top token dominates the probability mass. How much depends on the gap and the number of competitors: with 3 classes and a gap of 10, the top token gets >99%. With 128,000 classes all at 0, a top logit of 10 only gets ~15%. The key factor is the gap between the top logits, not the gap between the top and the average.
Small spread If all logits are close together (e.g., 1.0, 1.1, 0.9), the probabilities are nearly uniform. The model is "uncertain" — no token stands out strongly.
Identical logits If all logits are the same value (e.g., all zeros), every probability equals 1/|V|. Maximum uncertainty. Note that the absolute value does not matter — softmax([0, 0, 0]) and softmax([100, 100, 100]) produce the same result, because only relative differences matter.

Temperature: Controlling the Sharpness

Temperature is a scalar T that divides the logits before softmax:

p_i = e^{z_i / T} / ∑_j e^{z_j / T}

Dividing by T changes the spread of the logits before the exponential sees them:

T < 1 (e.g., 0.5): Logit differences are amplified. z/0.5 = 2z, so the spread doubles. The distribution becomes sharper — the model acts more "confident." At T → 0, softmax approaches argmax.
T = 1: No change. Standard softmax.
T > 1 (e.g., 2.0): Logit differences are compressed. z/2 halves the spread. The distribution becomes flatter — more tokens have a reasonable chance of being picked. At T → ∞, the distribution approaches uniform (1/|V| for each token).

Temperature does not change the model's logits. It changes how the decoding policy interprets them. This is why temperature is a runtime setting, not a model property — you can adjust it between requests.

Interactive — Softmax Inspector

Adjust the logits and temperature below. Watch how the probability bars change. Try: setting one logit much higher than the rest, setting all logits equal, and adjusting temperature to see sharpening and flattening.

Logits (raw scores)

"the"

"a"

"cat"

"dog"

"."

Temperature: 1.0

sharpuniform

Logits

2.1"the"

0.8"a"

5.4"cat"

1.2"dog"

-0.3"."

softmax

Probabilities

3.5%"the"

0.9%"a"

93.9%"cat"

1.4%"dog"

0.3%"."

argmax → "cat"(93.9%)sum → 1.0000

Numerical Stability: The Max-Subtraction Trick

There is one practical problem with computing softmax directly. If a logit is very large (say, 1000), then e¹⁰⁰⁰ overflows — it is too large for floating-point numbers to represent. The result is Inf, and the division produces NaN.

The standard fix is to subtract the maximum logit from all logits before exponentiating:

m = max(z₀, z₁, ..., z_n-1)

p_i = e^{(z_i - m)} / ∑_j e^{(z_j - m)}

Why does this work? Because subtracting a constant from all logits does not change the softmax output. The proof is one line: e^{(z_i - m)} / ∑ e^{(z_j - m)} = (e^z_i · e^-m) / (∑ e^z_j · e^-m) = e^z_i / ∑ e^z_j. The e^-m cancels from numerator and denominator. The result is identical, but now the largest exponent is e⁰ = 1, which never overflows.

Every real softmax implementation uses this trick. If you see max before exp in model code, this is what it is doing.

Softmax Appears Twice in the Model

You have seen softmax here at the output head, turning logits into next-token probabilities. In Module 4, you will see it again inside the model's layers: when tokens compare their queries against keys to decide "how much should I attend to each previous token?", softmax converts those raw comparison scores into weights that sum to 1. The same formula, the same properties (positive, sums to 1), the same numerical stability trick — just applied at a different point in the computation. When you reach that lesson, the math will already be familiar.

Shapes

Input logits: [|V|]

Output probabilities: [|V|] same shape, but all positive and sum to 1

In attention: input scores [n_tokens, n_tokens] → output weights [n_tokens, n_tokens] (softmax along the key axis)

Math

The full softmax formula, including temperature and the stability trick:

m = max_j(z_j)

p_i = exp((z_i - m) / T) / ∑_j exp((z_j - m) / T)

Properties: p_i > 0 for all i, ∑_i p_i = 1, T = 1 by default

Try It Live — Temperature on Real Output

The inspector above uses hand-tuned logits. The widget below runs the same temperature slider on real DistilGPT-2 output. Watch how the probability distribution sharpens at low temperature and flattens at high temperature — on actual model predictions, not synthetic numbers.

Implementation Hook

In llama.cpp, softmax is implemented via ggml_soft_max. The max-subtraction trick is built into this operator — you do not need to apply it manually. Temperature scaling is handled separately in the sampling code, not in the model graph. The model always produces raw logits at T=1; temperature is applied by the runtime before sampling.

ggml/include/ggml.h — ggml_soft_max() (L1701)

Performance Hook

Softmax itself is cheap — three passes over the data (max, exp-and-sum, divide). For the output head, the vector has |V| elements (e.g., 128K), which is small compared to the projection that produced it. For attention, the softmax is over the key dimension (sequence length), and each head does its own softmax. With very long sequences, the attention softmax can become noticeable in profiles, but it is rarely the dominant cost.

Check Yourself

mathQ1

softmax([5.0, 5.0, 5.0]) produces what distribution?

[5.0, 5.0, 5.0] — softmax does not change equal inputs[1/3, 1/3, 1/3] — equal logits produce a uniform distribution[1.0, 0.0, 0.0] — softmax always picks the first element

reasoningQ2

You lower the temperature from T=1.0 to T=0.1. What happens to the probability of the highest-logit token?

It stays the same — temperature does not affect which token is most probableIt increases sharply — dividing by 0.1 amplifies logit differences, making the top token dominateIt decreases — lower temperature means less confident

reasoningQ3

Why do real softmax implementations subtract the max logit before exponentiating?

To change the output distribution to be centered at zeroTo prevent numerical overflow — subtracting a constant does not change the result but keeps exponents smallTo remove the effect of outlier logits on the distribution