CAPSTONE

Capstone: From Concept to Performance Diagnosis

30 min

Overview

This capstone covers the full arc of the course: from the structure of a transformer block, through attention variants and expert routing, to performance profiling and correctness validation.

There are six questions below. Each tests a different core competency. You need to pass all of them.

Task 1 — Dense Transformer Block

A dense transformer block has a specific internal structure. You should be able to name every phase in order and explain what each one does. The pattern is: two sub-blocks (attention and FFN), each wrapped in normalization and a residual connection.

Task 2 — MoE Expert Selection

A Mixture-of-Experts layer replaces the dense FFN with a router and multiple expert FFNs. You should be able to explain how the router scores experts, how top-k selection works, and how the selected expert outputs are combined.

Task 3 — Positional Information in Attention

Transformers have no inherent notion of position. You should be able to explain how positional information enters the attention computation — whether through absolute embeddings, relative encodings, or rotary position embeddings (RoPE) — and why it matters for the model to know token order.

Task 4 — Prefill vs Decode

Inference has two distinct phases with different computational characteristics. You should be able to explain what happens during prefill versus decode, why they have different bottlenecks, and how batch size (number of tokens processed at once) determines whether the system is compute-bound or memory-bound.

Task 5 — Profile Interpretation

Given a profile showing operator-level time distribution, you should be able to identify the bottleneck class and propose a plausible diagnosis. The key is to read the distribution, not just the hottest function.

Task 6 — Correctness Validation

Speed without correctness is a bug, not an optimization. You should be able to explain why correctness validation is required after any optimization and how to detect a regression.

Course Complete

You have demonstrated understanding of transformer architecture, attention variants, expert routing, inference phases, profile interpretation, and correctness validation. You can now read real model code, interpret performance profiles, and reason about optimizations with confidence.

From zero to hero. Well done.

Where to Go Next

Run

Clone llama.cpp, build it, download a small GGUF model (e.g., Gemma 2B Q4), and run the perplexity tool on wikitext-2. Then try llama-server to serve it locally and send prompts via the API. You now understand every stage of what happens inside.

Read

Attention Is All You Need (Vaswani et al., 2017) — the original transformer paper. You now have the vocabulary to read it. RoFormer (Su et al., 2021) — the RoPE paper. Then FlashAttention (Dao et al., 2022) for the dominant attention kernel optimization, and Efficient Memory Management for Large Language Model Serving with PagedAttention (Kwon et al., 2023) for how vLLM solves KV cache fragmentation.

Compare

Map the same concepts to other frameworks: vLLM (PagedAttention, continuous batching), SGLang (RadixAttention, compiler-driven serving), Hugging Face Transformers (Python-first, research-oriented). Each makes different tradeoffs — you now have the foundation to understand why.

Follow llama.cpp PRs to see performance optimizations land in real time — you can now read them.