Thanks for sharing results and for the write-up. It is another welcome addition ...

janwas · on June 5, 2022

:) If I'm reading 100Gb/sec correctly as Gigabits, a Skylake core can also reach such bandwidth usage. The issue is that the system can only sustain that for roughly 8 cores. It would be very interesting to see bench_parallel results for an M1 Max, if anyone would like to give that a try? I suspect it will be comparable to Skylake-X, because 8 (performance) cores are not quite enough to utilize all the M1 Max bandwidth in this app. M1 may be more power efficient, though. It is also great to see more bandwidth per core. A hypothetical M2 with say 256-bit SVE vectors and 16 cores could be very interesting.

L2 caches are typically partitioned and private to a core. If that also applies to the M1 Max, then each core would only access 3 MB, thus the working set is larger than cache as intended.

I do believe NEON is the limiting factor here. I haven't looked into how many IPC we should expect, but even if it is 4 (the number of 128-bit NEON units), Skylake is often reaching 2 (with 512-bit vectors), so the measurement that an M1 core is about half as fast as SKX for this use case seems plausible.