Unfortunately using AVX512 instructions only gets a speedup in very specific situations and for many real world use cases it actually underperforms due to oddities of scaling and switching delays. Profiling for more than a few milliseconds is one place you see phantom gains, so take care not to be deceived.
Edit: not saying this isn't a true benefit here, just that claims of speed when using AVX512 need to be treated with fair scepticism for actual use cases.
I didn't realise that. My original post was just to warn that AVX512 benchmarks can be highly misleading. Everyone has troubles measuring AVX512 performance:
He is aware, but sidestepped these issues. so this code is only recommended on the newest Cannon Lake processors, but we really want to know for which CPU which method is best. What about AMD Rome e.g.?
See
https://blog.cloudflare.com/on-the-dangers-of-intels-frequen...
https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-us...
https://news.ycombinator.com/item?id=21029417
Edit: not saying this isn't a true benefit here, just that claims of speed when using AVX512 need to be treated with fair scepticism for actual use cases.