Yeah, good point. SIMD is also superscalar on most systems. For x86, Intel can d...

moonchild · on June 28, 2022

Only port 5 and port 0/1 can do avx512 instructions on skylake/icelake, so I don't think you can get better throughput than 0.5 on current parts. Unless you count load & store as well, like I mentioned.

moonchild · on June 28, 2022

(Unless you count an avx512-specific instruction operating on a 32- or 16-byte vector as an 'avx512 instruction'.)

mhh__ · on June 28, 2022

Skylake is 7 years old

moonchild · on June 29, 2022

And icelake is 3 years old. If you have a newer avx512-supported chip which can do more than two 512-bit alu ops per cycle, I would love to take a look at it.

mhh__ · on June 29, 2022

throughput and latency did improve with alder lake (...), but i see your point.