Interesting post! I wonder if the author experimented with different gcc command-line optimisation options as well? gcc might be able to insert some prefetches by itself with some settings?
Thanks! I tried -fprefetch-loop-arrays before figuring out that it only prefetches the array memory, (ie, memory with addresses like &ob_item[i]), rather than the memory referenced by the array memory, (ie, memory with addresses like ob_item[i]). And empirically adding -fprefetch-loop-arrays got me no speedups.
I didn't try any other GCC command line arguments because I honestly didn't know what other relevant options there were.