AVX512 has so many more features above-and-beyond Intel's typical SIMD implementation. Feature wise, its beginning to be competitive against NVidia's PTX CUDA architecture. Like, AVX512 is a really, really good instruction set (or I guess: a really good set of instruction sets).
Assuming AVX512 F, CD, VL, DQ, and BW (the expected AVX512 instructions in CannonLake):
* AVX512F -- "Standard" 512-bit arithmetic already has major improvements, above and beyond the 256bit -> 512bit upgrade. AVX512 has 32-registers per core (when AVX2 and earlier only have 16). The new set of opmask instructions also allow for way more code to turn into "branch-free" code which is friendly for pipelines. This is already a major step forward alone with huge implications for multimedia code.
* AVX512-CD: Conflict Detection. These instructions allow auto-vectorizers to "resolve loop conflicts" and auto-vectorize more code.
* VL, DQ -- Extend AVX512 to Bytes, Shorts, Longs, Long Longs.
* BW -- Extend AVX512 to operate on only 256-bit and 128-bits at a time.
--------------------
I'm certain that some code, which could not be vectorized in AVX2 (or lower), will be vectorized with AVX512. Maybe even automatically as compiler writers implement high-level features / auto-vectorizers.
I wonder why they did the BW thing instead of just defining a vector length register like other vector ISA's (which would have allowed to get rid of a remainder loop, leading to less code bloat and more efficient execution for short loops where the number of iterations is not an integer multiple of the ISA vector length).
Assuming AVX512 F, CD, VL, DQ, and BW (the expected AVX512 instructions in CannonLake):
* AVX512F -- "Standard" 512-bit arithmetic already has major improvements, above and beyond the 256bit -> 512bit upgrade. AVX512 has 32-registers per core (when AVX2 and earlier only have 16). The new set of opmask instructions also allow for way more code to turn into "branch-free" code which is friendly for pipelines. This is already a major step forward alone with huge implications for multimedia code.
* AVX512-CD: Conflict Detection. These instructions allow auto-vectorizers to "resolve loop conflicts" and auto-vectorize more code.
* VL, DQ -- Extend AVX512 to Bytes, Shorts, Longs, Long Longs.
* BW -- Extend AVX512 to operate on only 256-bit and 128-bits at a time.
--------------------
I'm certain that some code, which could not be vectorized in AVX2 (or lower), will be vectorized with AVX512. Maybe even automatically as compiler writers implement high-level features / auto-vectorizers.