L13

Argmax, Sampling, and Next-Token Choice

10 min

Question

How is the next token chosen?

Intuition

After softmax, you have a probability distribution over |V| tokens. Every token has a positive probability, and they sum to 1. Now something must choose one token from that distribution.

This choice is called the decoding strategy. It is not part of the model — it is a policy applied after the model runs. The model produces the same logits regardless of which decoding strategy is used. (The probabilities can change if temperature or filtering is applied, but the raw logits — the model's output — are identical.) This separation is fundamental: changing how you pick tokens changes the quality and diversity of the output, not the model's computation.

Argmax (Greedy Decoding)

The simplest strategy: always pick the token with the highest probability.

next_token = argmax_i(p_i)

Argmax is deterministic: the same input always produces the same output. There is no randomness. This makes it predictable and reproducible, which is useful for testing and debugging.

The downside: greedy decoding can produce repetitive, bland text. Because it always takes the locally highest-probability path, it can get stuck in loops ("the the the...") or choose boring continuations when a slightly less probable token would have led to a better overall sentence.

Sampling

Instead of always picking the top token, sampling randomly draws a token from the probability distribution. A token with probability 0.65 has a 65% chance of being picked; a token with probability 0.05 has a 5% chance.

Sampling is stochastic: the same input can produce different outputs on different runs. This is what makes LLM outputs feel creative and varied — each generation is a different random walk through the probability space.

Raw sampling from the full distribution can produce incoherent text, because very low-probability tokens (the "tail" of the distribution) occasionally get selected. This is where filtering strategies help.

Filtering: Top-k and Top-p

In practice, most systems do not sample from the full distribution. They apply filters first:

Top-k sampling

Keep only the k most probable tokens; set all others to probability zero; renormalize so the remaining probabilities sum to 1; then sample. For example, with k=50, sampling draws from the top 50 tokens only. This prevents the model from picking wildly unlikely tokens that would break coherence.

Top-p (nucleus) sampling

Instead of a fixed number of tokens, keep the smallest set of tokens whose cumulative probability exceeds p. For example, with p=0.9, add tokens from most probable to least probable until their probabilities sum to at least 0.9, then sample from only those tokens.

Top-p adapts to the shape of the distribution. If the model is very confident (one token has probability 0.95), top-p might keep just 1-2 tokens. If the model is uncertain (many tokens with similar probabilities), top-p might keep hundreds.

Top-k and top-p can be combined (apply both filters). These filters, together with temperature (from the softmax lesson), give the runtime fine-grained control over the diversity-quality tradeoff without changing the model at all.

Worked Example

After "The cat sat", softmax produced these probabilities (showing top 6 of 128,000):

" on"0.52 " in"0.18 " down"0.11 " on the"0.07 " and"0.04 " there"0.02 ... 127,994 more0.06 total

Argmax: Always picks " on"

Top-k=3: Samples from (" on": 0.64, " in": 0.22, " down": 0.14) (renormalized)

Top-p=0.9: Samples from (" on", " in", " down", " on the", " and") — cumulative = 0.52+0.18+0.11+0.07+0.04 = 0.92 ≥ 0.9

Notice how top-k always keeps exactly 3 tokens regardless of distribution shape, while top-p adapts — here it keeps 5 because the probability mass is spread across many tokens and the top 4 only sum to 0.88.

The Full Decoding Toolkit

Putting it all together, a typical generation runtime applies these controls in order:

Model produces logits — fixed, deterministic, same regardless of decoding settings.
Temperature scales the logits: z_i / T.
Softmax converts scaled logits to probabilities.
Top-k / top-p filtering removes unlikely tokens.
Sampling or argmax picks the final token.

All of steps 2-5 are runtime policy. The model only does step 1. This is why you can serve the same model with different temperature, top-k, and top-p settings for different use cases — coding assistance might use low temperature (more deterministic), while creative writing might use high temperature (more varied).

Shapes

Input to decoding: probabilities [|V|] (or logits [|V|] before softmax)

Output of decoding: one token ID (a single integer)

The decoding step reduces |V| probabilities to 1 choice.

Math

Argmax: next = arg max_i p_i

Sampling: next ~ Categorical(p₀, p₁, ..., p_|V|-1)

Top-k: keep k largest p_i, zero out the rest, renormalize, then sample

Top-p: keep smallest set S where ∑_{i ∈ S} p_i ≥ p, renormalize, then sample

Implementation Hook

In llama.cpp, decoding happens entirely outside the model computation graph. The graph ends at logits. The sampling module then applies temperature, top-k, top-p, and other filters before selecting a token. This clean separation means you can change sampling behavior without rebuilding the model graph.

In the wild: Common CLI flags like --temp 0.7, --top-k 40, and --top-p 0.9 all control this post-model decoding pipeline. They never touch the model's weights or computation.

tools/cli/cli.cpp — sampling loop

Performance Hook

Sampling itself is trivially cheap — a sort (for top-k/top-p) and a random draw. The runtime cost of generating a token is dominated by the model forward pass (attention + FFN + output head), not by the decoding strategy. The performance-relevant insight is that changing temperature or top-k does not change the cost of running the model — it only changes which token gets appended for the next step.

Try It Live — Full Generation Pipeline

Everything from this lesson and the previous two in action. Load the model and run the full pipeline: tokenize, forward pass, logits, softmax with temperature, and autoregressive generation — all on real model weights in your browser.

Check Yourself

reasoningQ1

You run the same model on the same prompt twice: once with temperature 0.7, once with temperature 1.5. Are the logits different?

Yes — different temperatures cause the model to compute differentlyNo — the model produces identical logits; temperature is applied afterwardIt depends on the model architecture

mathQ2

The top-5 token probabilities are [0.40, 0.25, 0.15, 0.10, 0.05]. With top-p=0.9, how many tokens are in the candidate set?

3 tokens (cumulative: 0.40 + 0.25 + 0.15 = 0.80 < 0.9, need one more)4 tokens (cumulative: 0.40 + 0.25 + 0.15 + 0.10 = 0.90 ≥ 0.9)5 tokens (always include all that have nonzero probability)