No, it's faster because the working set of 64 \* 64 \* 4 \* 2 bytes can (almost)...

fulafel · on Feb 15, 2019

To add background, this is a standard optimization technique that has been employed in eg fortran compilers since at least the 1980s.

Asooka · on Feb 15, 2019

Doesn't this rely on the CPU prefetching the memory to cache? Do current CPUs from Intel&AMD detect access patterns like this successfully? I.e. where you're accessing 64-element slices from a bigger array with a specific stride.

fulafel · on Feb 15, 2019

The idea is that the Y dimension is going to have a limited nr (here 64) of hot cache lines while a tile is processed. After going through one set of 64 vertical lines, the Y accesses are going to be near the Y accesses from the previous outer-tile-loop iteration.

(Stride detecting prefetch can help, especially on the first iteration of a tile, but is not required for a speedup).

BTW this is the motivation for GPUs (and sometimes other graphics applications) using "swizzled" texture/image formats, where pixels are organised into various kinds of screen-locality preserving clumps. https://fgiesen.wordpress.com/2011/01/17/texture-tiling-and-...