Oh yeah, I did forget that, despite being separated into uops, sign-extend-mem is still more efficient than literally being separated as such (some other similar things include memory insert/extract, and, perhaps most significantly, broadcast; with various results across intel & AMD). I imagine the memory subsystem is able to simultaneously supply ≤128-bit results to both 128-bit ALU halves, thus avoiding the need for cross-128-bit transfers.
The _mm256_setr_epi32 by itself would indeed be very inefficient, but clang manages to vectorize it[1] to vpaddd+vpsllvd, which require some constant loads (also it generates some weird blends, idk).
[1]: https://godbolt.org/z/7jq4z39GT - L833-847 or so in the assembly, or on L67 in the source, right click → "Reveal linked code"
The _mm256_setr_epi32 by itself would indeed be very inefficient, but clang manages to vectorize it[1] to vpaddd+vpsllvd, which require some constant loads (also it generates some weird blends, idk).
[1]: https://godbolt.org/z/7jq4z39GT - L833-847 or so in the assembly, or on L67 in the source, right click → "Reveal linked code"