Quiz: Inference Mechanics

12 min

Module 6 Quiz

This quiz covers the inference pipeline: prefill, decode, the KV cache, batching, continuous batching, and context scaling. You need 80% or better to proceed.

Check Yourself

reasoningQ1

A user sends a 500-token prompt and requests a 100-token response. Which phase determines the time-to-first-token (TTFT) — the delay before any output appears?

Prefill — the user waits while all 500 prompt tokens are processed before seeing any outputDecode — generating 100 tokens takes longer than processing 500Neither — the total time is split evenly between prefill and decode

reasoningQ2

During decode, the new token computes Q but reads K and V from a cache. Why is Q not cached alongside K and V?

Q is too large to cache efficientlyQ is only needed for the current token — it asks "what am I looking for?" and is never reused by future tokensQ is always identical to K so caching it would be redundant

conceptualQ3

A server is using continuous batching with 4 slots. Request R2 finishes while R1, R3, and R4 are still decoding. What happens next?

The server waits for all requests to finish before accepting new onesR2's slot is immediately available for a new incoming requestThe server restarts all requests from scratch

shapeQ4

During decode, the attention score vector for the new token has shape [1, n_past + 1] (where n_past is the number of previously cached tokens). If 500 tokens were cached before this step, what is the length of the score vector?

1501d_model

conceptualQ5

Why does doubling the context length more than double the total prefill time?

Because the model weights are loaded twiceBecause FFN cost quadruples when tokens doubleBecause attention cost scales quadratically with sequence length, so doubling tokens roughly quadruples the attention workBecause the tokenizer is slower on longer inputs

← Long Context Changes Which Costs Grow Where Time Goes in an LLM →