- GQA
- Grouped-Query Attention. Fewer K/V heads than Q heads, saving memory and bandwidth.
- MQA
- Multi-Query Attention. A single K/V head shared across all Q heads.
- SWA
- Sliding-Window Attention. Each layer attends only to a local window of tokens.
- MoE
- Mixture of Experts. Replaces the FFN with multiple expert FFNs, only a few activated per token.
- Router
- In MoE, the network that scores which experts to activate for each token.
- KV cache
- Stored K/V tensors from previous tokens, reused during decode to avoid recomputation.
- Prefill
- Processing the full prompt through the model. Involves large matrix multiplies.
- Decode
- Generating tokens one at a time, reusing cached K/V from previous steps.
- Ubatch
- Microbatch / physical batch. Operational setting that affects execution without changing math.
- Perplexity
- A metric of how surprised the model is by the data. Lower = model predicts better.
- Quantization
- Reducing numerical precision of weights (e.g. FP16 → Q8_0) to save memory and speed up inference.
- Repack
- Rearranging weight memory layout for faster kernel access patterns.