← Back to course

Glossary

Quick reference for all terms introduced in the guided path.

Orientation

LLM
Large Language Model. A neural network trained to predict the next token in a sequence.
Inference
Running a trained model to produce predictions. Distinct from training.
Autoregressive generation
Producing output by predicting one token at a time, appending it, and repeating.

Discrete Input

Token
A piece of text from the model's vocabulary. Can be a word, subword, character, or punctuation.
Token ID
The integer index of a token in the vocabulary table.
Vocabulary (V)
The finite set of all tokens the model knows. Size often written |V|.
Tokenizer
The algorithm that splits raw text into token pieces and maps them to IDs.
Context length
Maximum number of tokens the model can process in one pass. Prompt + output share this budget.

Linear Algebra

Scalar
A single number.
Vector
An ordered list of numbers. Dimension = how many numbers.
Embedding
A dense vector representation of a token, obtained by looking up the token ID in a learned table.
Hidden state
The current vector representation of a token at any point in the model. Changes layer by layer.
d_model
The dimension of hidden states in the model. A core hyperparameter.
Dot product
Multiply corresponding elements, sum the results. Measures similarity between two vectors.
Matrix multiplication
[m, k] × [k, n] → [m, n]. Many dot products organized into a grid.
Linear projection
A matrix multiply with a learned weight matrix that changes vector dimension.
Tensor shape
The list of dimensions of a multi-dimensional array. E.g. [n_tokens, d_model].

Probability

Logits
Raw, unbounded scores output by the model — one per vocabulary entry.
Output head
The final projection layer that maps the last hidden state to logits over the vocabulary.
Softmax
Converts logits to a probability distribution: all positive, sums to 1. Formula: p_i = exp(z_i) / ∑ exp(z_j).
Argmax
Deterministic decoding: always pick the highest-probability token.
Sampling
Stochastic decoding: randomly pick a token according to the probability distribution.
Temperature
A scaling factor applied to logits before softmax. Lower = sharper, higher = more uniform.

Transformer Block

Transformer layer
One complete block: norm → attention → residual → norm → FFN → residual.
Residual connection
Adding the input of a sublayer back to its output. Preserves information.
RMSNorm
Root Mean Square normalization. Stabilizes hidden state magnitudes between sublayers.
Q (Query)
A learned projection of hidden states that represents "what am I looking for?"
K (Key)
A learned projection of hidden states that represents "what do I contain?"
V (Value)
A learned projection of hidden states that represents "what information do I carry?"
Attention score
The dot product of Q and K vectors. Measures how much one token should attend to another.
Causal mask
Prevents tokens from attending to future tokens. Enforces left-to-right generation.
Multi-head attention
Splitting Q/K/V into multiple heads, each attending in a smaller subspace, then merging results.
FFN (Feed-Forward Network)
Per-token transformation: project up, activate, project down. Does not mix tokens.
RoPE
Rotary Position Embedding. Injects positional information into Q and K vectors.

Advanced

GQA
Grouped-Query Attention. Fewer K/V heads than Q heads, saving memory and bandwidth.
MQA
Multi-Query Attention. A single K/V head shared across all Q heads.
SWA
Sliding-Window Attention. Each layer attends only to a local window of tokens.
MoE
Mixture of Experts. Replaces the FFN with multiple expert FFNs, only a few activated per token.
Router
In MoE, the network that scores which experts to activate for each token.
Shared-KV layers
Layers that skip K/V projections and reuse K/V from an earlier layer's cache.
KV cache
Stored K/V tensors from previous tokens, reused during decode to avoid recomputation.
Continuous batching
Server scheduling that inserts new requests into a running batch as slots free up, instead of waiting for all requests to finish.
Prefill
Processing the full prompt through the model. Involves large matrix multiplies.
Decode
Generating tokens one at a time, reusing cached K/V from previous steps.
Ubatch
Microbatch / physical batch. Operational setting that affects execution without changing math.
GEMM
General Matrix Multiply. Matrix × matrix. Dominates prefill compute.
GEMV
General Matrix-Vector Multiply. Matrix × vector. Dominates decode — memory-bound.
Arithmetic intensity
FLOPs per byte of memory traffic. Determines whether an operation is compute-bound or memory-bound.
Machine balance
Peak FLOP/s ÷ peak bytes/s for a given hardware. The threshold between compute-bound and memory-bound.
Perplexity
A metric of how surprised the model is by the data. Lower = model predicts better. Typical LLM scores: 5–8 on wikitext-2.
Quantization
Reducing numerical precision of weights (e.g. FP16 → Q8_0) to save memory and speed up inference.
Repack
Rearranging weight memory layout for faster kernel access patterns.
Roofline model
Performance analysis tool that plots achievable FLOP/s against arithmetic intensity. Divides workloads into memory-bound (below ridge point) and compute-bound (above ridge point) regimes.
TTFT
Time-To-First-Token. The delay between sending a prompt and receiving the first output token. Determined by prefill time.
Prefill interference
When a large prefill operation stalls decode steps for other users in a serving scenario. Mitigated by chunking prefills.
Expert collapse
In MoE, when training causes most tokens to route to the same few experts while others are underused. Prevented by load-balancing loss.
Thread affinity
Pinning each thread to a specific CPU core to avoid cache eviction from thread migration.
BPE
Byte Pair Encoding. The dominant tokenization algorithm. Iteratively merges frequent byte pairs into longer tokens.
Weight tying
Sharing the same weight matrix for the embedding table and the output head. Saves parameters; used in models like Gemma.
d_model
The hidden dimension — the width of the residual stream. Every token is a d_model-dimensional vector throughout the model.
d_head
The per-head dimension. d_head = d_model / n_heads. Each attention head operates in this subspace.
d_ff
The FFN intermediate dimension, typically 3–4× d_model. The FFN expands to d_ff then contracts back to d_model.
SwiGLU / GELU / SiLU
Activation functions used in FFN layers. SwiGLU is a gated variant commonly used in modern LLMs (Llama, Gemma). GELU is used in GPT-2 and DistilGPT-2.
Pre-norm
Applying RMSNorm before each sub-layer (attention, FFN) rather than after. Provides cleaner gradient flow for deep networks. Used by virtually all modern LLMs.