← Back to course

Glossary

Quick reference for all terms introduced in the guided path.

Orientation

LLM
Large Language Model. A neural network trained to predict the next token in a sequence.
Inference
Running a trained model to produce predictions. Distinct from training.
Autoregressive generation
Producing output by predicting one token at a time, appending it, and repeating.

Discrete Input

Token
A piece of text from the model's vocabulary. Can be a word, subword, character, or punctuation.
Token ID
The integer index of a token in the vocabulary table.
Vocabulary (V)
The finite set of all tokens the model knows. Size often written |V|.
Tokenizer
The algorithm that splits raw text into token pieces and maps them to IDs.
Context length
Maximum number of tokens the model can process in one pass. Prompt + output share this budget.

Linear Algebra

Scalar
A single number.
Vector
An ordered list of numbers. Dimension = how many numbers.
Embedding
A dense vector representation of a token, obtained by looking up the token ID in a learned table.
Hidden state
The current vector representation of a token at any point in the model. Changes layer by layer.
d_model
The dimension of hidden states in the model. A core hyperparameter.
Dot product
Multiply corresponding elements, sum the results. Measures similarity between two vectors.
Matrix multiplication
[m, k] × [k, n] → [m, n]. Many dot products organized into a grid.
Linear projection
A matrix multiply with a learned weight matrix that changes vector dimension.
Tensor shape
The list of dimensions of a multi-dimensional array. E.g. [n_tokens, d_model].

Probability

Logits
Raw, unbounded scores output by the model — one per vocabulary entry.
Output head
The final projection layer that maps the last hidden state to logits over the vocabulary.
Softmax
Converts logits to a probability distribution: all positive, sums to 1. Formula: p_i = exp(z_i) / ∑ exp(z_j).
Argmax
Deterministic decoding: always pick the highest-probability token.
Sampling
Stochastic decoding: randomly pick a token according to the probability distribution.
Temperature
A scaling factor applied to logits before softmax. Lower = sharper, higher = more uniform.

Transformer Block

Transformer layer
One complete block: norm → attention → residual → norm → FFN → residual.
Residual connection
Adding the input of a sublayer back to its output. Preserves information.
RMSNorm
Root Mean Square normalization. Stabilizes hidden state magnitudes between sublayers.
Q (Query)
A learned projection of hidden states that represents "what am I looking for?"
K (Key)
A learned projection of hidden states that represents "what do I contain?"
V (Value)
A learned projection of hidden states that represents "what information do I carry?"
Attention score
The dot product of Q and K vectors. Measures how much one token should attend to another.
Causal mask
Prevents tokens from attending to future tokens. Enforces left-to-right generation.
Multi-head attention
Splitting Q/K/V into multiple heads, each attending in a smaller subspace, then merging results.
FFN (Feed-Forward Network)
Per-token transformation: project up, activate, project down. Does not mix tokens.
RoPE
Rotary Position Embedding. Injects positional information into Q and K vectors.

Advanced

GQA
Grouped-Query Attention. Fewer K/V heads than Q heads, saving memory and bandwidth.
MQA
Multi-Query Attention. A single K/V head shared across all Q heads.
SWA
Sliding-Window Attention. Each layer attends only to a local window of tokens.
MoE
Mixture of Experts. Replaces the FFN with multiple expert FFNs, only a few activated per token.
Router
In MoE, the network that scores which experts to activate for each token.
KV cache
Stored K/V tensors from previous tokens, reused during decode to avoid recomputation.
Prefill
Processing the full prompt through the model. Involves large matrix multiplies.
Decode
Generating tokens one at a time, reusing cached K/V from previous steps.
Ubatch
Microbatch / physical batch. Operational setting that affects execution without changing math.
Perplexity
A metric of how surprised the model is by the data. Lower = model predicts better.
Quantization
Reducing numerical precision of weights (e.g. FP16 → Q8_0) to save memory and speed up inference.
Repack
Rearranging weight memory layout for faster kernel access patterns.