Threads, Affinity, and Scheduling
Inference workloads can be split across multiple CPU threads. Each thread processes a portion of the matrix multiply, attention computation, or FFN. More threads means more parallelism — up to a point.
The ideal thread count depends on the hardware and the workload. A machine has a fixed number of physical cores, each with its own execution resources and private L1 cache (L2/L3 may be private or shared depending on the CPU). Many machines also support SMT (Simultaneous Multi-Threading, also called Hyper-Threading on Intel), which exposes two logical cores per physical core. For memory-bound workloads, using SMT threads rarely helps because both logical cores share the same execution resources and memory port — you get more threads competing for the same bandwidth.
Thread affinity pins each thread to a specific core. Without affinity, the operating system may migrate threads between cores, evicting warm caches. With affinity, each thread stays on its core, improving cache locality — though data can still be evicted by capacity misses or other work on that core.
Different inference phases benefit from different thread counts. Prefill is compute-bound (large GEMM), so using all physical cores helps. Decode is memory-bound (GEMV), so adding more threads beyond the number needed to saturate memory bandwidth can actually slow things down due to synchronization overhead.
Machine: 8 physical cores, 16 logical cores (SMT). Memory bandwidth: 50 GB/s.
The best thread count is 8 for both, but for different reasons. Prefill wants all ALUs. Decode saturates bandwidth early and does not benefit from more threads.
This is counterintuitive but critical: for memory-bound operations, adding threads past a certain point makes things slower. Why?
- Memory bandwidth is shared. All threads share the same memory bus. If 4 threads already saturate the bus, threads 5-16 just compete for the same bandwidth while adding synchronization overhead.
- Thread synchronization costs. At the end of each matrix multiply, threads must synchronize (barrier). More threads means more synchronization events, each adding latency.
- Cache thrashing. Each thread works on a different portion of the weight matrix. More threads means more concurrent memory regions competing for L2/L3 cache space.
The practical rule: for decode (memory-bound), start at n_physical_cores / 2 and benchmark. For prefill (compute-bound), start at n_physical_cores and benchmark. The optimal is always empirical, not theoretical.
In llama.cpp, the -t flag sets the thread count for computation. The backend threadpool is created at startup and threads are dispatched per operation. Thread affinity can be set via environment variables or OS-level tools like taskset on Linux.
The best practice is to benchmark with different thread counts rather than assuming "more is better." Start with the number of physical cores, then try fewer. On big.LITTLE architectures (common in ARM), restricting threads to performance cores and avoiding efficiency cores often helps. Separate thread counts for prefill and decode (-t vs -tb in some builds) can yield the best overall throughput.
Why can the best thread count differ between prefill and decode?
Why does using SMT (Hyper-Threading) threads often not help memory-bound workloads?