I think this is marginally more expensive than my suggestion, but I'm not 100% s...

I think this is marginally more expensive than my suggestion, but I'm not 100% sure, and I think the critical paths are similar. My suggestion requires (after the load):

1 shift (4 bits right), unfortunately no byte-granularity shift is available.

2 PAND operations to ensure that the 'magic high bit' (0x80) is switched off (this bit is magic to PSHUFB)

2 PSHUFB operations to look up a table

1 PAND to combine the table.

... and then rejoins your implementation at the VPMOVMSKB.

I don't think your instructions are high latency unless you are using something surprising - VPOR/VPXOR/VCMP*B are all latency 1 on most recent Intel architecture operations.

Whether these techniques depends on context. I'd be surprised if vector operations do that well against short-to-medium writes. We found that a few of our sequences that are design to 'find the first match' saw no benefit from moving from SSE to AVX2 due to similar effects.