The VSHRN trick is nice (I used it only two hours ago!), but it really does feel like a crutch; I don't understand why they couldn't simply implement a PMOVMSKB-like instruction to begin with (it cannot possibly be very expensive in silicon, at least not if it moved into a vector register). One-bit-per-byte is really the sweet spot for almost any kind of text manipulation, and often requires less setup/post-fixup on either side of the POVMSKB/VSHRN.