Quiz: Inference Mechanics
This quiz covers the inference pipeline: prefill, decode, the KV cache, batching, continuous batching, and context scaling. You need 80% or better to proceed.
A user sends a 500-token prompt and requests a 100-token response. Which phase determines the time-to-first-token (TTFT) — the delay before any output appears?
During decode, the new token computes Q but reads K and V from a cache. Why is Q not cached alongside K and V?
A server is using continuous batching with 4 slots. Request R2 finishes while R1, R3, and R4 are still decoding. What happens next?
During decode, the attention score vector for the new token has shape [1, n_past + 1] (where n_past is the number of previously cached tokens). If 500 tokens were cached before this step, what is the length of the score vector?
Why does doubling the context length more than double the total prefill time?