L36

Repack and Layout Matter as Much as ISA Tricks

16 min

Question

Why does memory layout matter?

Intuition

When a CPU or GPU loads data from memory, it does not fetch one number at a time. It loads cache lines — contiguous blocks of 64 or 128 bytes. If the data your kernel needs is laid out contiguously in memory, one cache-line fetch gives you many useful values. If the data is scattered, each fetch brings mostly useless bytes, and you waste bandwidth.

Weight matrices in a model file are stored in whatever order the training framework produced. This layout may not match the access pattern of the fastest inference kernel. Repacking means rearranging the same weight values into a different memory order that matches the kernel's access pattern — typically tiled or interleaved so that SIMD instructions can load full vectors of useful data in each memory access.

The model's numerical values do not change at all. The same weights, the same model, the same answers — just faster, because the kernel wastes fewer memory accesses.

Toy Example

A 4×4 weight matrix, two layouts:

Row-major (as stored in file):

[a00 a01 a02 a03] [a10 a11 a12 a13] [a20 a21 a22 a23] [a30 a31 a32 a33]

Repacked in 2×2 tiles (for a kernel that processes 2 rows at a time):

[a00 a01 a10 a11] [a02 a03 a12 a13] [a20 a21 a30 a31] [a22 a23 a32 a33]

Effect:

One cache-line load now gives the kernel exactly the values it needs for a 2-row tile, instead of loading two separate rows from different memory locations.

Why Layout Matters: Cache Lines and SIMD

Two hardware facts explain why layout matters more than you might expect:

1. Memory loads come in cache lines (64-128 bytes)

The CPU does not fetch individual floats from RAM. It fetches entire cache lines — typically 64 bytes (32 FP16 values) or 128 bytes. If the next 32 values the kernel needs are contiguous in memory, one cache line load gets all of them. If they are scattered (e.g., every 4096th element because you are traversing a column of a row-major matrix), each value requires its own cache line load, wasting 31/32 of the bandwidth.

2. SIMD registers process multiple values at once

Modern CPUs have wide SIMD registers (256-bit AVX2 can process 8 FP32 values or 16 INT16 values simultaneously; 512-bit AVX-512 doubles that). For quantized inference, the kernel loads blocks of quantized weights into SIMD registers and processes multiple values per instruction. But the register must be loaded from contiguous memory. If the weight layout does not match the access pattern, the CPU must gather values from scattered addresses into the register — which is much slower than a single aligned load.

Repacking arranges the weight matrix so that the values the kernel needs next are always contiguous. This turns scattered gathers into sequential loads, often doubling or tripling the effective memory bandwidth utilization.

Shapes

Logical shape is unchanged: [d_in, d_out]

Physical layout changes: row-major → tiled, interleaved, or block-transposed

The repack is a one-time cost at model load. After that, every inference step benefits from better memory access patterns.

Math

There is no new formula. The same matrix multiply math applies:

C[i,j] = ∑ₖ A[i,k] × B[k,j]

The kernel computes the same result. The difference is how B[k,j] values are arranged in physical memory — tiled layouts align with SIMD register widths so each load fills the register with useful values.

Implementation Hook

In llama.cpp, the CPU backend repacks weight tensors at model load time. The repack logic transforms weights into layouts optimized for specific SIMD instruction sets (e.g., ARM NEON, x86 AVX). The kernel then reads the repacked weights in sequential memory order, maximizing cache line utilization.

ggml/src/ggml-cpu/repack.cpp

Performance Hook

Repacking can deliver double-digit percentage speedups on memory-bound operations without changing any model values. The repack happens once during model loading (adding a few seconds of startup time), then every subsequent inference call benefits. It is one of the highest-leverage optimizations because it costs nothing at inference time and requires no approximation.

Check Yourself

reasoningQ1

Why can a layout change speed up inference without changing the model file?

The repack changes weight values to be smallerThe repack rearranges the same values in memory so the kernel can load them more efficiently, reducing wasted memory bandwidthThe repack skips certain weight rows that are close to zero

conceptualQ2

When does the repack cost get paid?

Once during each decode stepOnce at model load time, before inference beginsIt is free — there is no repack cost