M6/Inference Mechanics
Q-M6

Quiz: Inference Mechanics

12 min

This quiz covers the inference pipeline: prefill, decode, the KV cache, batching, continuous batching, and context scaling. You need 80% or better to proceed.

Check Yourself
reasoningQ1

A user sends a 500-token prompt and requests a 100-token response. Which phase determines the time-to-first-token (TTFT) — the delay before any output appears?

reasoningQ2

During decode, the new token computes Q but reads K and V from a cache. Why is Q not cached alongside K and V?

conceptualQ3

A server is using continuous batching with 4 slots. Request R2 finishes while R1, R3, and R4 are still decoding. What happens next?

shapeQ4

During decode, the attention score vector for the new token has shape [1, n_past + 1] (where n_past is the number of previously cached tokens). If 500 tokens were cached before this step, what is the length of the score vector?

conceptualQ5

Why does doubling the context length more than double the total prefill time?