Hacker News new | past | comments | ask | show | jobs | submit login

Looks like die-to-die latency isn't all that great on EPYC, as expected:

"What does this mean to the end user? The 64 MB L3 on the spec sheet does not really exist. In fact even the 16 MB L3 on a single Zeppelin die consists of two 8 MB L3-caches. There is no cache that truly functions as single, unified L3-cache on the MCM; instead there are eight separate 8 MB L3-caches."

Also:

"AMD's unloaded latency is very competitive under 8 MB, and is a vast improvement over previous AMD server CPUs. Unfortunately, accessing more 8 MB incurs worse latency than a Broadwell core accessing DRAM. Due to the slow L3-cache access, AMD's DRAM access is also the slowest. The importance of unloaded DRAM latency should of course not be exaggerated: in most applications most of the loads are done in the caches. Still, it is bad news for applications with pointer chasing or other latency-sensitive operations."

I was kind of expecting this, but it's still disappointing to see. Looks like if you need a lot of L3, Intel is still the best/only option. Not to say that AMD hasn't made massive improvements though - and it's also worth noting that while AMD's memory latency is generally worse, throughput is also typically better than Intel.




This is extremely workload-dependent. If you have a lot of processes and they have good affinity, you don't mind that the L3 is organized like it is on the AMD chip. On the other hand, single threaded workloads suffer, as do applications with lots of threads that move around a lot.


I'm curious to hear from AMD about the L3 cache latency issue. The article shows L3 access just a few ns quicker than going to DRAM, even to the other CCX on the same die. That's ugly!

This strikes me as either a bug or a benchmarking glitch, though other benchmarks seem to imply that the situation is real.

Assuming it's legit, this gives AMD a great opportunity for a boost in their first respin of the 8C/16T die.


The L3 issue isn't quite that simple. Sure if your dataset fits in Intel's L3 that's great. Problem is that a single shared L3 (for the same amount of effort/transistors) has much lower bandwidth than smaller separate L3s.

So a dual socket AMD has 8 zeppelin chips and 16 8MB L3 caches. I'd be quite surprised if intel could match the bandwidth of those 16 L3 caches. Additionally if there is enough cache misses AMD has a 33% advantage in both outstanding memory references (16 at a time in a dual socket) and bandwidth.

Basically both architectures are HUGELY complicated. Even minor things like which compiler/which compiler flags can make a big difference. Now more than ever it's important to benchmark your workload, any simple rule of thumb is likely to be useless.


Intel's new chips allow you to logically dedicate segments of l3 to different programs/VMs.

https://software.intel.com/en-us/articles/introduction-to-ca...


Skylake has up to 28 L3 caches that should provide significant bandwidth.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: