Looks like die-to-die latency isn't all that great on EPYC, as expected: "What d...

wumpus · on July 11, 2017

This is extremely workload-dependent. If you have a lot of processes and they have good affinity, you don't mind that the L3 is organized like it is on the AMD chip. On the other hand, single threaded workloads suffer, as do applications with lots of threads that move around a lot.

allenrb · on July 12, 2017

I'm curious to hear from AMD about the L3 cache latency issue. The article shows L3 access just a few ns quicker than going to DRAM, even to the other CCX on the same die. That's ugly!

This strikes me as either a bug or a benchmarking glitch, though other benchmarks seem to imply that the situation is real.

Assuming it's legit, this gives AMD a great opportunity for a boost in their first respin of the 8C/16T die.

sliken · on July 11, 2017

The L3 issue isn't quite that simple. Sure if your dataset fits in Intel's L3 that's great. Problem is that a single shared L3 (for the same amount of effort/transistors) has much lower bandwidth than smaller separate L3s.

So a dual socket AMD has 8 zeppelin chips and 16 8MB L3 caches. I'd be quite surprised if intel could match the bandwidth of those 16 L3 caches. Additionally if there is enough cache misses AMD has a 33% advantage in both outstanding memory references (16 at a time in a dual socket) and bandwidth.

Basically both architectures are HUGELY complicated. Even minor things like which compiler/which compiler flags can make a big difference. Now more than ever it's important to benchmark your workload, any simple rule of thumb is likely to be useless.

shaklee3 · on July 12, 2017

Intel's new chips allow you to logically dedicate segments of l3 to different programs/VMs.

https://software.intel.com/en-us/articles/introduction-to-ca...

wmf · on July 12, 2017

Skylake has up to 28 L3 caches that should provide significant bandwidth.