Doesn't this rely on the CPU prefetching the memory to cache? Do current CPUs fr...

fulafel · on Feb 15, 2019

The idea is that the Y dimension is going to have a limited nr (here 64) of hot cache lines while a tile is processed. After going through one set of 64 vertical lines, the Y accesses are going to be near the Y accesses from the previous outer-tile-loop iteration.

(Stride detecting prefetch can help, especially on the first iteration of a tile, but is not required for a speedup).

BTW this is the motivation for GPUs (and sometimes other graphics applications) using "swizzled" texture/image formats, where pixels are organised into various kinds of screen-locality preserving clumps. https://fgiesen.wordpress.com/2011/01/17/texture-tiling-and-...