Note you can keep int8_t elements in that array, and sign extend bytes into int32_t while loading. The _mm_loadu_si64 / _mm256_cvtepi8_epi32 combo should compile into a single vpmovsxbd instruction with a memory operand. This way the entire constant array fits in a single cache line, as long as it’s aligned properly with alignas(32)
This is good fit for the OP’s use case because they need two masks, the second vpmovsxbd instruction will be a guaranteed L1D cache hit.
vpmovsxbd ymm,[…] still presumably decomposes back into two uops (definitely does on intel, but uops.info doesn't show memory uops for AMD); still better than broadcast+compare though (which does still have a load for the constant; and, for that matter, the original shifting version also has multiple loads). Additionally, the int8_t elements mean no cacheline-crossing loads. (there's the more compressed option of only having a {8×-1, 8×0} array, at the cost of more scalar offset computation)
> definitely does on intel, but uops.info doesn't show memory uops for AMD
Indeed, but it reveals something else interesting. On Zen2 and Zen3 processors, the throughput of vpmovsxbd ymm, [...] is more than twice as efficient compared to sign extension from another vector register i.e. vpmovsxbd ymm, xmm
> the original shifting version also has multiple loads
I believe _mm256_setr_epi32 like that is typically compiled into a sequence of vmovd / vpinsrd / vinserti128 instructions. These involve no loads, just multiple instructions assembling the vector from int32 pieces produced in scalar registers.
Oh yeah, I did forget that, despite being separated into uops, sign-extend-mem is still more efficient than literally being separated as such (some other similar things include memory insert/extract, and, perhaps most significantly, broadcast; with various results across intel & AMD). I imagine the memory subsystem is able to simultaneously supply ≤128-bit results to both 128-bit ALU halves, thus avoiding the need for cross-128-bit transfers.
The _mm256_setr_epi32 by itself would indeed be very inefficient, but clang manages to vectorize it[1] to vpaddd+vpsllvd, which require some constant loads (also it generates some weird blends, idk).
[1]: https://godbolt.org/z/7jq4z39GT - L833-847 or so in the assembly, or on L67 in the source, right click → "Reveal linked code"
Note you can keep int8_t elements in that array, and sign extend bytes into int32_t while loading. The _mm_loadu_si64 / _mm256_cvtepi8_epi32 combo should compile into a single vpmovsxbd instruction with a memory operand. This way the entire constant array fits in a single cache line, as long as it’s aligned properly with alignas(32)
This is good fit for the OP’s use case because they need two masks, the second vpmovsxbd instruction will be a guaranteed L1D cache hit.