- GQA
- Grouped-Query Attention. Fewer K/V heads than Q heads, saving memory and bandwidth.
- MQA
- Multi-Query Attention. A single K/V head shared across all Q heads.
- SWA
- Sliding-Window Attention. Each layer attends only to a local window of tokens.
- MoE
- Mixture of Experts. Replaces the FFN with multiple expert FFNs, only a few activated per token.
- Router
- In MoE, the network that scores which experts to activate for each token.
- Shared-KV layers
- Layers that skip K/V projections and reuse K/V from an earlier layer's cache.
- KV cache
- Stored K/V tensors from previous tokens, reused during decode to avoid recomputation.
- Continuous batching
- Server scheduling that inserts new requests into a running batch as slots free up, instead of waiting for all requests to finish.
- Prefill
- Processing the full prompt through the model. Involves large matrix multiplies.
- Decode
- Generating tokens one at a time, reusing cached K/V from previous steps.
- Ubatch
- Microbatch / physical batch. Operational setting that affects execution without changing math.
- GEMM
- General Matrix Multiply. Matrix × matrix. Dominates prefill compute.
- GEMV
- General Matrix-Vector Multiply. Matrix × vector. Dominates decode — memory-bound.
- Arithmetic intensity
- FLOPs per byte of memory traffic. Determines whether an operation is compute-bound or memory-bound.
- Machine balance
- Peak FLOP/s ÷ peak bytes/s for a given hardware. The threshold between compute-bound and memory-bound.
- Perplexity
- A metric of how surprised the model is by the data. Lower = model predicts better. Typical LLM scores: 5–8 on wikitext-2.
- Quantization
- Reducing numerical precision of weights (e.g. FP16 → Q8_0) to save memory and speed up inference.
- Repack
- Rearranging weight memory layout for faster kernel access patterns.
- Roofline model
- Performance analysis tool that plots achievable FLOP/s against arithmetic intensity. Divides workloads into memory-bound (below ridge point) and compute-bound (above ridge point) regimes.
- TTFT
- Time-To-First-Token. The delay between sending a prompt and receiving the first output token. Determined by prefill time.
- Prefill interference
- When a large prefill operation stalls decode steps for other users in a serving scenario. Mitigated by chunking prefills.
- Expert collapse
- In MoE, when training causes most tokens to route to the same few experts while others are underused. Prevented by load-balancing loss.
- Thread affinity
- Pinning each thread to a specific CPU core to avoid cache eviction from thread migration.
- BPE
- Byte Pair Encoding. The dominant tokenization algorithm. Iteratively merges frequent byte pairs into longer tokens.
- Weight tying
- Sharing the same weight matrix for the embedding table and the output head. Saves parameters; used in models like Gemma.
- d_model
- The hidden dimension — the width of the residual stream. Every token is a d_model-dimensional vector throughout the model.
- d_head
- The per-head dimension. d_head = d_model / n_heads. Each attention head operates in this subspace.
- d_ff
- The FFN intermediate dimension, typically 3–4× d_model. The FFN expands to d_ff then contracts back to d_model.
- SwiGLU / GELU / SiLU
- Activation functions used in FFN layers. SwiGLU is a gated variant commonly used in modern LLMs (Llama, Gemma). GELU is used in GPT-2 and DistilGPT-2.
- Pre-norm
- Applying RMSNorm before each sub-layer (attention, FFN) rather than after. Provides cleaner gradient flow for deep networks. Used by virtually all modern LLMs.