L37

Threads, Affinity, and Scheduling

16 min

Question

Why does thread count affect inference?

Intuition

Inference workloads can be split across multiple CPU threads. Each thread processes a portion of the matrix multiply, attention computation, or FFN. More threads means more parallelism — up to a point.

The ideal thread count depends on the hardware and the workload. A machine has a fixed number of physical cores, each with its own execution resources and private L1 cache (L2/L3 may be private or shared depending on the CPU). Many machines also support SMT (Simultaneous Multi-Threading, also called Hyper-Threading on Intel), which exposes two logical cores per physical core. For memory-bound workloads, using SMT threads rarely helps because both logical cores share the same execution resources and memory port — you get more threads competing for the same bandwidth.

Thread affinity pins each thread to a specific core. Without affinity, the operating system may migrate threads between cores, evicting warm caches. With affinity, each thread stays on its core, improving cache locality — though data can still be evicted by capacity misses or other work on that core.

Different inference phases benefit from different thread counts. Prefill is compute-bound (large GEMM), so using all physical cores helps. Decode is memory-bound (GEMV), so adding more threads beyond the number needed to saturate memory bandwidth can actually slow things down due to synchronization overhead.

Toy Example

Machine: 8 physical cores, 16 logical cores (SMT). Memory bandwidth: 50 GB/s.

Prefill (compute-bound GEMM)

4 threads: 55% of peak FLOPS (cores underutilized)

8 threads: 92% of peak FLOPS (all physical cores busy)

16 threads: 88% of peak FLOPS (SMT contention hurts slightly)

Decode (memory-bound GEMV)

4 threads: 85% of peak bandwidth (already near bandwidth limit)

8 threads: 95% of peak bandwidth (saturated)

16 threads: 90% of peak bandwidth (sync overhead eats gains)

The best thread count is 8 for both, but for different reasons. Prefill wants all ALUs. Decode saturates bandwidth early and does not benefit from more threads.

Shapes

Work partitioning for [m, k] × [k, n] across T threads:

GEMM: split m rows across T threads → each thread does [m/T, k] × [k, n]

GEMV: split n columns across T threads → each thread does [1, k] × [k, n/T]

Threads must synchronize to combine results. More threads means more synchronization.

Math

Ideal speedup with T threads: time = time_single / T

Real speedup: time = (work / T) + sync_overhead(T)

For memory-bound work: time ≥ bytes_moved / peak_bandwidth, regardless of T.

Adding threads past the bandwidth saturation point only increases sync_overhead without reducing the memory wait.

Why More Threads Can Be Slower

This is counterintuitive but critical: for memory-bound operations, adding threads past a certain point makes things slower. Why?

Memory bandwidth is shared. All threads share the same memory bus. If 4 threads already saturate the bus, threads 5-16 just compete for the same bandwidth while adding synchronization overhead.
Thread synchronization costs. At the end of each matrix multiply, threads must synchronize (barrier). More threads means more synchronization events, each adding latency.
Cache thrashing. Each thread works on a different portion of the weight matrix. More threads means more concurrent memory regions competing for L2/L3 cache space.

The practical rule: for decode (memory-bound), start at n_physical_cores / 2 and benchmark. For prefill (compute-bound), start at n_physical_cores and benchmark. The optimal is always empirical, not theoretical.

Implementation Hook

In llama.cpp, the -t flag sets the thread count for computation. The backend threadpool is created at startup and threads are dispatched per operation. Thread affinity can be set via environment variables or OS-level tools like taskset on Linux.

ggml/src/ggml-cpu/ggml-cpu.c — ggml_graph_compute() (L3231)

Performance Hook

The best practice is to benchmark with different thread counts rather than assuming "more is better." Start with the number of physical cores, then try fewer. On big.LITTLE architectures (common in ARM), restricting threads to performance cores and avoiding efficiency cores often helps. Separate thread counts for prefill and decode (-t vs -tb in some builds) can yield the best overall throughput.

Check Yourself

reasoningQ1

Why can the best thread count differ between prefill and decode?

Prefill uses different CPU instructions than decodePrefill is compute-bound and benefits from more threads doing arithmetic; decode is memory-bound and saturates bandwidth with fewer threads, so extra threads just add synchronization costDecode requires threads to be on different machines

conceptualQ2

Why does using SMT (Hyper-Threading) threads often not help memory-bound workloads?

SMT threads run on separate physical cores with separate cachesSMT threads share the same physical core and memory port, so they compete for the same bandwidth instead of adding moreSMT threads cannot execute SIMD instructions