Hacker News new | past | comments | ask | show | jobs | submit login

Thanks for sharing results and for the write-up. It is another welcome addition to a series of recent reports attesting the SIMD instructions having entered the mainstream computing, from the high performance JSON parsing to now the quick sort.

The white paper is calling out the memory bandwidth as a major bottleneck, which is where I have a question.

The M1 Max sports a unusually wide (512 bit wide) memory bus coupled with 6.4Ghz DDR5 unified memory which, according to the available empirical evidence, allows a single CPU core at achieve 100Gb/sec circa memory transfer speed. M1 cores also feature a very large L1 D-cache as well as a large L2 cache (48 Mb has been reported). Results, however, are approximately 2.4x lower for the M1 Max. I do realise that the NEON SIMD processing will always be slower compared to AVX-512 SIMD based one due to a 4x difference of the vector size, but isn't the M1 Max supposed to perform somewhat faster than in observed figures due to the faster and much wider memory bus that would partially compensate for NEON inefficiency?

Other than the vector size difference between NEON and AVX-512, would you attribute the difference in performance to the small test batch size (~4/8/16 Mb for 32/64/128 unit sizes used in the test) thus being able to able to fit in the L2 cache nearly entirely, or due to GCC being unaware of cache line sizes on M1 Pro/Max therefore resulting in the cache underutilisation or inefficient instruction scheduling, or you would purely attribute it to NEON having not aged gracefully to meet current data processing demands?

Thank you.




:) If I'm reading 100Gb/sec correctly as Gigabits, a Skylake core can also reach such bandwidth usage. The issue is that the system can only sustain that for roughly 8 cores. It would be very interesting to see bench_parallel results for an M1 Max, if anyone would like to give that a try? I suspect it will be comparable to Skylake-X, because 8 (performance) cores are not quite enough to utilize all the M1 Max bandwidth in this app. M1 may be more power efficient, though. It is also great to see more bandwidth per core. A hypothetical M2 with say 256-bit SVE vectors and 16 cores could be very interesting.

L2 caches are typically partitioned and private to a core. If that also applies to the M1 Max, then each core would only access 3 MB, thus the working set is larger than cache as intended.

I do believe NEON is the limiting factor here. I haven't looked into how many IPC we should expect, but even if it is 4 (the number of 128-bit NEON units), Skylake is often reaching 2 (with 512-bit vectors), so the measurement that an M1 core is about half as fast as SKX for this use case seems plausible.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: