I've always wondered: why do CPU caches only key off the low-order bits of the m...

etep · on Feb 18, 2015

Gate count is not the issue - the issue is L1 timing. For clocks at 2 GHz plus, there can be very few stages of logic in between flops. Just decoding an address to a one hot wire (necessary to access the memory bank) takes up a good chunk of that time interval. Reading the bits out and resolving them to back to full rail also takes time. Checking for a tag match, more time. Routing that back to the register file, yet more time. If it's not the L1 cache, say shared L3, the total time of flight across the chip is multiple clock cycles, not to mention all the aforementioned time penalties in addition to the longer access latency into even larger shared L3 cache.

Relatedly, in the L3, there is a hash function used to distribute different address to different regions of the L3. The cost of doing this is less significant in two ways: the L3 access latency is already much much higher (as elaborated on above) and the hash calculation can be done in parallel with other required logic (e.g. in parallel with L2 access).

colanderman · on Feb 18, 2015

Thanks, that was a great answer.

userbinator · on Feb 18, 2015

The addressing is designed to take advantage of spatial locality - adjacent addresses will be adjacent in the cache too.