On the other hand an unified (inclusive) L3 cache helps with maintaining cache c...

sliverstorm · on Aug 18, 2016

A unified L3 is expensive in a number of ways. It is large, which means it is geographically remote, as well as slow (for caches, big == slow). This costs lots of access latency.

It also has a bandwidth problem. If 64 threads are vying for access, you either build it with few access ports and it gets choked, or you build it with many access ports which is costly in area, power, & speed.

Two separate peer caches automatically have twice the bandwidth of one similar double-size cache, for the price of NUMA & cache coherency challenges.

There is no one right answer here. Bandwidth is far more important and coherency much easier in a small L1; as you go down the hierarchy, bandwidth needs shrink and coherency is more expensive.

BlackMonday · on Aug 19, 2016

I remember a rumour about a HPC-APU from AMD which would combine 16 Zen cores with a Vega GPU and HBM (High Bandwith Memory) as a L4 cache. I know a L4 cache would be much slower than a L3 cache, but I'm curious, could HBM as a L4 cache be one of the reasons why they didn't use a unified L3 cache?

Disclaimer: I don't know sh*t about hardware design as you can probably guess from my posting. ;o)

snuxoll · on Aug 19, 2016

L4 cache is more used as embedded memory for the on-die GPU, last I checked Intel only included their eDRAM L4 cache on Iris Pro equipped model as any on-die GPU worth its salt is going to be bandwidth constrained even with a relatively low amount of GPU cores.

Same situation with Zen, if they're going to include even a Polaris it would be highly memory constrained if it had to hit system RAM all the time, so another fat chunk of memory on-die will be necessary to not starve it and keep latency down (as it stands the RX 480 can pump 256GB/s).

BlackMonday · on Aug 19, 2016

Yeah, the bandwith problem is already noticeable with AMD's current APU's even though they use small GPU cores compared to discrete grapics cards. Faster DDR3/4 memory brings noticeable FPS improvements. If they already had HBM they would run circles around Intel (which they probably already do unless the competitor is a Iris Pro with eDRAM).

Could the CPU also profit from the HBM memory? The bandwith is much better than with DDR4 main memory (even if it is 2 or 4 channels), and I would guess the latency as well because it would be on the same die?

snuxoll · on Aug 19, 2016

HBM won't be on-die, but it will be on-package - HBM relies on chip stacking to get the desired throughput in a small surface area, regardless the latency and throughput would stomp system DRAM something awful, and if it's a proper L4 cache then the CPU would benefit as well.

IBM does something similar (though not for graphics) in recent POWER CPU's with the Centaur memory controller(s), they are off-chip memory controllers with a bunch of eDRAM to act as a L4 cache (though the difference here is each system has multiple centaur controllers to handle different DIMM slots). They're able to burst to ~96GB/sec to system memory using this, having a good amount of on-package HBM would probably yield similar gains :)

tcoppi · on Aug 18, 2016

Cache coherency might not necessarily be an issue if they treat each Cores->L3 pair as a NUMA node. I don't think they are doing that since we probably would have already heard about it if they are, but AMD has done crazier things before, and they are pretty good at NUMA architectures.

gpderetta · on Aug 18, 2016

what do you mean? NUMA nodes are still coherent.

tcoppi · on Aug 18, 2016

You're right, I don't know what I was thinking.