Though not at all part of the hot path, the inefficiency of the mask generation ...

salykova · 2024-07-04T05:52:46 1720072366

Hi! I'm the author of the article. It's my really first time optimizing C code and using intrinsics, so I'm definitely not an expert in this area, but Im willing to learn more! Many thanks for your feedback; I truly appreciate comments that provide new perspectives.

Regarding "creating a constant global array and loading from it" - if I recall correctly, I've tested this approach and it was a bit slower than bit mask shifting. But let me re-test this to be 100% sure.

"Comparing a constant vector {0, 1, 2, 3, 4, ...} with broadcasted m and m-8" - good idea, I will try it!

Const-me · 2024-07-04T21:48:32 1720129712

> creating a global constant array

Note you can keep int8_t elements in that array, and sign extend bytes into int32_t while loading. The _mm_loadu_si64 / _mm256_cvtepi8_epi32 combo should compile into a single vpmovsxbd instruction with a memory operand. This way the entire constant array fits in a single cache line, as long as it’s aligned properly with alignas(32)

This is good fit for the OP’s use case because they need two masks, the second vpmovsxbd instruction will be a guaranteed L1D cache hit.

dzaima · 2024-07-04T22:13:10 1720131190

vpmovsxbd ymm,[…] still presumably decomposes back into two uops (definitely does on intel, but uops.info doesn't show memory uops for AMD); still better than broadcast+compare though (which does still have a load for the constant; and, for that matter, the original shifting version also has multiple loads). Additionally, the int8_t elements mean no cacheline-crossing loads. (there's the more compressed option of only having a {8×-1, 8×0} array, at the cost of more scalar offset computation)

Const-me · 2024-07-05T16:40:38 1720197638

> definitely does on intel, but uops.info doesn't show memory uops for AMD

Indeed, but it reveals something else interesting. On Zen2 and Zen3 processors, the throughput of vpmovsxbd ymm, [...] is more than twice as efficient compared to sign extension from another vector register i.e. vpmovsxbd ymm, xmm

> the original shifting version also has multiple loads

I believe _mm256_setr_epi32 like that is typically compiled into a sequence of vmovd / vpinsrd / vinserti128 instructions. These involve no loads, just multiple instructions assembling the vector from int32 pieces produced in scalar registers.

dzaima · 2024-07-05T17:59:45 1720202385

Oh yeah, I did forget that, despite being separated into uops, sign-extend-mem is still more efficient than literally being separated as such (some other similar things include memory insert/extract, and, perhaps most significantly, broadcast; with various results across intel & AMD). I imagine the memory subsystem is able to simultaneously supply ≤128-bit results to both 128-bit ALU halves, thus avoiding the need for cross-128-bit transfers.

The _mm256_setr_epi32 by itself would indeed be very inefficient, but clang manages to vectorize it[1] to vpaddd+vpsllvd, which require some constant loads (also it generates some weird blends, idk).

[1]: https://godbolt.org/z/7jq4z39GT - L833-847 or so in the assembly, or on L67 in the source, right click → "Reveal linked code"