Hacker News new | past | comments | ask | show | jobs | submit login

Yeah, good point. SIMD is also superscalar on most systems.

For x86, Intel can do 3x AVX512 instructions per clock tick as long as each instruction is simple enough (add, AND, OR, NOT, XOR, maybe even multiply if you're not counting the latency issue)




Only port 5 and port 0/1 can do avx512 instructions on skylake/icelake, so I don't think you can get better throughput than 0.5 on current parts. Unless you count load & store as well, like I mentioned.


(Unless you count an avx512-specific instruction operating on a 32- or 16-byte vector as an 'avx512 instruction'.)


Skylake is 7 years old


And icelake is 3 years old. If you have a newer avx512-supported chip which can do more than two 512-bit alu ops per cycle, I would love to take a look at it.


throughput and latency did improve with alder lake (...), but i see your point.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: