A prefetcher reduces cache misses by retrieving the data first; the number of me...

tempguy9999 · on Aug 22, 2019

> A prefetcher reduces cache misses by retrieving the data first

I'm struggling with that. To avoid a cache miss the prefetcher woul have to prefetch the data from memory so early, and memory is so slow, that an entire train of in-flight instructions could complete, and then likely have the CPU pause, before the fetch-from-ram completed.

I may be misunderstanding you though.

mlyle · on Aug 22, 2019

The whole point of a prefetcher is to avoid cache misses. What do you think a prefetcher is for?

Memory is very latent, and moderately slow. Hardware prefetching tends to mostly make memory accesses longer, so you don't pay the RAS/CAS penalties and instead stream just more memory words back to fill an additional cache line beyond the one demanded.

e.g. One key part of Intel L2 hardware prefetching is working upon cache line pairs. If you miss on an even numbered cache line, and counters support it, it decides to also retrieve the next odd-numbered cache line at the same time.

The downside of a hardware prefetcher is three things, both stemming from it perhaps retrieving memory that isn't needed. 1) it can cause useful information to be evicted from cache if a prefetch is errant. 2) it can tie up the memory bus with an unnecessary access and make a necessary one more latent. 3) it can consume power for a prefetch that isn't needed.

tempguy9999 · on Aug 22, 2019

> What do you think a prefetcher is for?

We may be talking past each other. My understanding of a prefetcher is to get the data ASAP. It typically prefetches from cache, not from memory (if it prefetches from mem it's so slow it's useless).

> Memory is very latent, and moderately slow

Those 2 terms seem synonymous, but memory access is slooow - from doc in front of me, for haswell:

L1 access hit, 4 or 5 cycles latency;

L2 hit 12 cycles latency;

L3 hit 36 to 66 cycles latency;

ram latency 36 cycles + 57 NS / 62 cycles + 100NS depending

(found ref, it's https://www.7-cpu.com/cpu/Haswell.html)

for the ram latencies that's for different setups (single/dual cpu , I'm not sure) running ~3.5GHz. So for 36C + 57NS that 36 + (3.5 * 57) = 235 cycles latency. And it can get worse. Much worse.

I'm sorry i can't find it now but in hennesy & patterson I recall it being said that prefetchers can hide most of the latency of a L1 or L2 miss if it hits L3, but if it needs main mem then it's stuffed. Sorry, that is from memory and may be wrong! But I'm pretty sure a prefetcher is wasted if it needs to hit ram.

> If you miss on an even numbered cache line, and counters support it, it decides to also retrieve the next odd-numbered cache line at the same time.

I see, but I think it does better these days, it'll remember a stride and prefetch by the next stride. Even striding backwards through mem (cache)! But it won't cross a page boundary.

Really need an expert to chime in here, anyone?

mlyle · on Aug 22, 2019

For clarity:

https://i.imgur.com/iqJ19W5.png

This is a DDR4 pipelined read timing diagram. The bank select and row select latencies are significant, "off the left of this diagram," and contribute to the latency of random reads.

Further sequential reads --- until we hit a boundary -- can continue with no interruption of output, and do not pay the latency penalty (because there's no further bank select and row select, and because the prefetch and memory controller logic strobed the new column address before the previous read completed).

mlyle · on Aug 22, 2019

> My understanding of a prefetcher is to get the data ASAP. It typically prefetches from cache, not from memory (if it prefetches from mem it's so slow it's useless).

The prefetcher runs at whatever level it's at (L1/L2) and fetches from whatever level has the memory. So the L2 prefetcher may be grabbing from L3, and may be grabbing from SDRAM.

> ram latency 36 cycles + 57 NS / 62 cycles + 100NS depending

That's the ram LATENCY ON RANDOM ACCESS. If you extend an access to fetch the next sequential line because you think it will be used, you don't pay any of the latency penalty-- you might need to strobe a column access but words keep streaming out. For this reason sequential prefetch is particularly powerful. Even if we're just retrieving from a single SDRAM channel, it's just another 3ns to continue on to retrieve the next line. (DDR4-2400 is 2400MT/s, 8 transactions to fill a line, 8/2400000000=3.3n)

> I see, but I think it does better these days, it'll remember a stride and prefetch by the next stride. Even striding backwards through mem (cache)! But it won't cross a page boundary.

Sure, the even/odd access extender is just one very simple prefetcher that is a part of modern intel processors that I included for illustration. And we're completely ignoring software prefetch.

Go ahead, do the experiment. Run a memory-heavy workload and look at cache miss rates. Then turn off prefetch and see what you get. Most workloads, you'll get a lot more misses. ;)

https://github.com/deater/uarch-configure/blob/master/intel-...

tempguy9999 · on Aug 22, 2019

All I can do is to defer to those who really know their stuff; that's not me.

Please read this thread on exactly this subject from last year https://news.ycombinator.com/item?id=16172686

mlyle · on Aug 22, 2019

That thread is orthogonal, but what's there supports exactly what I'm saying: prefetch improves effective bandwidth to SDRAM at all layers.

A second, successive streamed fetch is basically free from a latency perspective. If you're missing, and have to go to memory, there's a very high chance that L2 is going to prefetch the next line into a stream buffer, and you won't miss to SDRAM next time.

It's reached the point that now that the stream prefetchers hint to the memory controller that a queued access is prefetch, so the memory controller can choose based on contention whether to service the prefetch or not.

Most of what you seem to talk about is L1 prefetch; I agree if L1 prefetch misses all the way to RAM you are probably screwed. The fancy strategies you mention, etc, are mostly L1 prefetch strategies. But L2 has its own prefetcher, and it's there to get rid of memory latency and increase effective use of memory bandwidth...

While we're talking about it... even the SDRAM itself has a prefetcher for burst access ;) Though it's kinda an abuse that it's called as such.

baybal2 · on Aug 22, 2019

> How does a prefetcher reduce unneeded cache evictions? It fetches data that may not be needed, increasing cache evictions.

I wanted to say that a smarter prefetcher, or better to say a whole entirety of on-chip logic that works to minimise cache misses, will lower the rate of unneeded evictions.