Repack and Layout Matter as Much as ISA Tricks
When a CPU or GPU loads data from memory, it does not fetch one number at a time. It loads cache lines — contiguous blocks of 64 or 128 bytes. If the data your kernel needs is laid out contiguously in memory, one cache-line fetch gives you many useful values. If the data is scattered, each fetch brings mostly useless bytes, and you waste bandwidth.
Weight matrices in a model file are stored in whatever order the training framework produced. This layout may not match the access pattern of the fastest inference kernel. Repacking means rearranging the same weight values into a different memory order that matches the kernel's access pattern — typically tiled or interleaved so that SIMD instructions can load full vectors of useful data in each memory access.
The model's numerical values do not change at all. The same weights, the same model, the same answers — just faster, because the kernel wastes fewer memory accesses.
A 4×4 weight matrix, two layouts:
Two hardware facts explain why layout matters more than you might expect:
1. Memory loads come in cache lines (64-128 bytes)
The CPU does not fetch individual floats from RAM. It fetches entire cache lines — typically 64 bytes (32 FP16 values) or 128 bytes. If the next 32 values the kernel needs are contiguous in memory, one cache line load gets all of them. If they are scattered (e.g., every 4096th element because you are traversing a column of a row-major matrix), each value requires its own cache line load, wasting 31/32 of the bandwidth.
2. SIMD registers process multiple values at once
Modern CPUs have wide SIMD registers (256-bit AVX2 can process 8 FP32 values or 16 INT16 values simultaneously; 512-bit AVX-512 doubles that). For quantized inference, the kernel loads blocks of quantized weights into SIMD registers and processes multiple values per instruction. But the register must be loaded from contiguous memory. If the weight layout does not match the access pattern, the CPU must gather values from scattered addresses into the register — which is much slower than a single aligned load.
Repacking arranges the weight matrix so that the values the kernel needs next are always contiguous. This turns scattered gathers into sequential loads, often doubling or tripling the effective memory bandwidth utilization.
There is no new formula. The same matrix multiply math applies:
In llama.cpp, the CPU backend repacks weight tensors at model load time. The repack logic transforms weights into layouts optimized for specific SIMD instruction sets (e.g., ARM NEON, x86 AVX). The kernel then reads the repacked weights in sequential memory order, maximizing cache line utilization.
Repacking can deliver double-digit percentage speedups on memory-bound operations without changing any model values. The repack happens once during model loading (adding a few seconds of startup time), then every subsequent inference call benefits. It is one of the highest-leverage optimizations because it costs nothing at inference time and requires no approximation.
Why can a layout change speed up inference without changing the model file?
When does the repack cost get paid?