Hacker News new | past | comments | ask | show | jobs | submit login

Doesn't this rely on the CPU prefetching the memory to cache? Do current CPUs from Intel&AMD detect access patterns like this successfully? I.e. where you're accessing 64-element slices from a bigger array with a specific stride.



The idea is that the Y dimension is going to have a limited nr (here 64) of hot cache lines while a tile is processed. After going through one set of 64 vertical lines, the Y accesses are going to be near the Y accesses from the previous outer-tile-loop iteration.

(Stride detecting prefetch can help, especially on the first iteration of a tile, but is not required for a speedup).

BTW this is the motivation for GPUs (and sometimes other graphics applications) using "swizzled" texture/image formats, where pixels are organised into various kinds of screen-locality preserving clumps. https://fgiesen.wordpress.com/2011/01/17/texture-tiling-and-...




Consider applying for YC's first-ever Fall batch! Applications are open till Aug 27.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: