Well of course, they're different trade-offs. Here we don't care about flexibili...

glangdale · on May 26, 2017

1 cycle per byte is pretty good, and its properties might be very nice for small strings.

Assuming you can prep everything in advance, the SIMD version is 3 loads and about 5 SIMD operations (1 shift, 2 PANDs, 2 PSHUFBs) and a couple transfer/scan operations (PMOVMSKB and whatever bit scan operation you want) to eat 32 bytes. I'm figuring 10 operations. Depending on your workload you might care about reciprocal throughput or latency. Generally our work has been more throughput oriented so those 10 operations can be scheduled along with lots of others, but this isn't always the case.

The other way SIMD (or for that matter bit-twiddling) can be helpful is that you can cover a wider variety of sizes in a branch-free fashion, so if your sizes are distributed <32 bytes (or <8 for bit-twiddling) the code can be straight line with no branch misses. So a SIMD version of the code might be "slower" than a bit-twiddling version even if both sides always get 9 bytes, but if it's a mix of sizes and thus unpredictable, it might take fewer branch misses (unless some clown starts giving you 33 byte inputs half the time, of course).

Again, benchmark data is always a good cure for folkloric beliefs whether pro-SIMD ("SIMD is always fast, yay!") or anti-SIMD ("the overheads are never worth it").