> creating a global constant array Note you can keep int8_t elements in that arr...

dzaima · 2024-07-04T22:13:10 1720131190

vpmovsxbd ymm,[…] still presumably decomposes back into two uops (definitely does on intel, but uops.info doesn't show memory uops for AMD); still better than broadcast+compare though (which does still have a load for the constant; and, for that matter, the original shifting version also has multiple loads). Additionally, the int8_t elements mean no cacheline-crossing loads. (there's the more compressed option of only having a {8×-1, 8×0} array, at the cost of more scalar offset computation)

Const-me · 2024-07-05T16:40:38 1720197638

> definitely does on intel, but uops.info doesn't show memory uops for AMD

Indeed, but it reveals something else interesting. On Zen2 and Zen3 processors, the throughput of vpmovsxbd ymm, [...] is more than twice as efficient compared to sign extension from another vector register i.e. vpmovsxbd ymm, xmm

> the original shifting version also has multiple loads

I believe _mm256_setr_epi32 like that is typically compiled into a sequence of vmovd / vpinsrd / vinserti128 instructions. These involve no loads, just multiple instructions assembling the vector from int32 pieces produced in scalar registers.

dzaima · 2024-07-05T17:59:45 1720202385

Oh yeah, I did forget that, despite being separated into uops, sign-extend-mem is still more efficient than literally being separated as such (some other similar things include memory insert/extract, and, perhaps most significantly, broadcast; with various results across intel & AMD). I imagine the memory subsystem is able to simultaneously supply ≤128-bit results to both 128-bit ALU halves, thus avoiding the need for cross-128-bit transfers.

The _mm256_setr_epi32 by itself would indeed be very inefficient, but clang manages to vectorize it[1] to vpaddd+vpsllvd, which require some constant loads (also it generates some weird blends, idk).

[1]: https://godbolt.org/z/7jq4z39GT - L833-847 or so in the assembly, or on L67 in the source, right click → "Reveal linked code"