I think this is marginally more expensive than my suggestion, but I'm not 100% sure, and I think the critical paths are similar. My suggestion requires (after the load):
1 shift (4 bits right), unfortunately no byte-granularity shift is available.
2 PAND operations to ensure that the 'magic high bit' (0x80) is switched off (this bit is magic to PSHUFB)
2 PSHUFB operations to look up a table
1 PAND to combine the table.
... and then rejoins your implementation at the VPMOVMSKB.
I don't think your instructions are high latency unless you are using something surprising - VPOR/VPXOR/VCMP*B are all latency 1 on most recent Intel architecture operations.
Whether these techniques depends on context. I'd be surprised if vector operations do that well against short-to-medium writes. We found that a few of our sequences that are design to 'find the first match' saw no benefit from moving from SSE to AVX2 due to similar effects.
1 shift (4 bits right), unfortunately no byte-granularity shift is available.
2 PAND operations to ensure that the 'magic high bit' (0x80) is switched off (this bit is magic to PSHUFB)
2 PSHUFB operations to look up a table
1 PAND to combine the table.
... and then rejoins your implementation at the VPMOVMSKB.
I don't think your instructions are high latency unless you are using something surprising - VPOR/VPXOR/VCMP*B are all latency 1 on most recent Intel architecture operations.
Whether these techniques depends on context. I'd be surprised if vector operations do that well against short-to-medium writes. We found that a few of our sequences that are design to 'find the first match' saw no benefit from moving from SSE to AVX2 due to similar effects.