Glossary

Quick reference for all terms introduced in the guided path.

Orientation

LLM: Large Language Model. A neural network trained to predict the next token in a sequence.
Inference: Running a trained model to produce predictions. Distinct from training.
Autoregressive generation: Producing output by predicting one token at a time, appending it, and repeating.

Discrete Input

Token: A piece of text from the model's vocabulary. Can be a word, subword, character, or punctuation.
Token ID: The integer index of a token in the vocabulary table.
Vocabulary (V): The finite set of all tokens the model knows. Size often written |V|.
Tokenizer: The algorithm that splits raw text into token pieces and maps them to IDs.
Context length: Maximum number of tokens the model can process in one pass. Prompt + output share this budget.

Linear Algebra

Scalar: A single number.
Vector: An ordered list of numbers. Dimension = how many numbers.
Embedding: A dense vector representation of a token, obtained by looking up the token ID in a learned table.
Hidden state: The current vector representation of a token at any point in the model. Changes layer by layer.
d_model: The dimension of hidden states in the model. A core hyperparameter.
Dot product: Multiply corresponding elements, sum the results. Measures similarity between two vectors.
Matrix multiplication: [m, k] × [k, n] → [m, n]. Many dot products organized into a grid.
Linear projection: A matrix multiply with a learned weight matrix that changes vector dimension.
Tensor shape: The list of dimensions of a multi-dimensional array. E.g. [n_tokens, d_model].

Probability

Logits: Raw, unbounded scores output by the model — one per vocabulary entry.
Output head: The final projection layer that maps the last hidden state to logits over the vocabulary.
Softmax: Converts logits to a probability distribution: all positive, sums to 1. Formula: p_i = exp(z_i) / ∑ exp(z_j).
Argmax: Deterministic decoding: always pick the highest-probability token.
Sampling: Stochastic decoding: randomly pick a token according to the probability distribution.
Temperature: A scaling factor applied to logits before softmax. Lower = sharper, higher = more uniform.

Transformer Block

Transformer layer: One complete block: norm → attention → residual → norm → FFN → residual.
Residual connection: Adding the input of a sublayer back to its output. Preserves information.
RMSNorm: Root Mean Square normalization. Stabilizes hidden state magnitudes between sublayers.
Q (Query): A learned projection of hidden states that represents "what am I looking for?"
K (Key): A learned projection of hidden states that represents "what do I contain?"
V (Value): A learned projection of hidden states that represents "what information do I carry?"
Attention score: The dot product of Q and K vectors. Measures how much one token should attend to another.
Causal mask: Prevents tokens from attending to future tokens. Enforces left-to-right generation.
Multi-head attention: Splitting Q/K/V into multiple heads, each attending in a smaller subspace, then merging results.
FFN (Feed-Forward Network): Per-token transformation: project up, activate, project down. Does not mix tokens.
RoPE: Rotary Position Embedding. Injects positional information into Q and K vectors.

Advanced

GQA: Grouped-Query Attention. Fewer K/V heads than Q heads, saving memory and bandwidth.
MQA: Multi-Query Attention. A single K/V head shared across all Q heads.
SWA: Sliding-Window Attention. Each layer attends only to a local window of tokens.
MoE: Mixture of Experts. Replaces the FFN with multiple expert FFNs, only a few activated per token.
Router: In MoE, the network that scores which experts to activate for each token.
Shared-KV layers: Layers that skip K/V projections and reuse K/V from an earlier layer's cache.
KV cache: Stored K/V tensors from previous tokens, reused during decode to avoid recomputation.
Continuous batching: Server scheduling that inserts new requests into a running batch as slots free up, instead of waiting for all requests to finish.
Prefill: Processing the full prompt through the model. Involves large matrix multiplies.
Decode: Generating tokens one at a time, reusing cached K/V from previous steps.
Ubatch: Microbatch / physical batch. Operational setting that affects execution without changing math.
GEMM: General Matrix Multiply. Matrix × matrix. Dominates prefill compute.
GEMV: General Matrix-Vector Multiply. Matrix × vector. Dominates decode — memory-bound.
Arithmetic intensity: FLOPs per byte of memory traffic. Determines whether an operation is compute-bound or memory-bound.
Machine balance: Peak FLOP/s ÷ peak bytes/s for a given hardware. The threshold between compute-bound and memory-bound.
Perplexity: A metric of how surprised the model is by the data. Lower = model predicts better. Typical LLM scores: 5–8 on wikitext-2.
Quantization: Reducing numerical precision of weights (e.g. FP16 → Q8_0) to save memory and speed up inference.
Repack: Rearranging weight memory layout for faster kernel access patterns.
Roofline model: Performance analysis tool that plots achievable FLOP/s against arithmetic intensity. Divides workloads into memory-bound (below ridge point) and compute-bound (above ridge point) regimes.
TTFT: Time-To-First-Token. The delay between sending a prompt and receiving the first output token. Determined by prefill time.
Prefill interference: When a large prefill operation stalls decode steps for other users in a serving scenario. Mitigated by chunking prefills.
Expert collapse: In MoE, when training causes most tokens to route to the same few experts while others are underused. Prevented by load-balancing loss.
Thread affinity: Pinning each thread to a specific CPU core to avoid cache eviction from thread migration.
BPE: Byte Pair Encoding. The dominant tokenization algorithm. Iteratively merges frequent byte pairs into longer tokens.
Weight tying: Sharing the same weight matrix for the embedding table and the output head. Saves parameters; used in models like Gemma.
d_model: The hidden dimension — the width of the residual stream. Every token is a d_model-dimensional vector throughout the model.
d_head: The per-head dimension. d_head = d_model / n_heads. Each attention head operates in this subspace.
d_ff: The FFN intermediate dimension, typically 3–4× d_model. The FFN expands to d_ff then contracts back to d_model.
SwiGLU / GELU / SiLU: Activation functions used in FFN layers. SwiGLU is a gated variant commonly used in modern LLMs (Llama, Gemma). GELU is used in GPT-2 and DistilGPT-2.
Pre-norm: Applying RMSNorm before each sub-layer (attention, FFN) rather than after. Provides cleaner gradient flow for deep networks. Used by virtually all modern LLMs.