I felt moved to comment on the thread. HN user burntsushi has been very nice by ...

glangdale · on May 26, 2017

The trick: very briefly:

Use two 16-byte values in SIMD registers as 16-entry lookup tables with "8 buckets" (really just bits) each.

Lookup those two separate tables with the high nibble (bits 4..7) and low nibble (bits 0..3) of your input (you can do this in parallel either 16/32/64 ways depending on whether you are using SSSE3/AVX2/AVX512).

AND together the table results (or OR them, depending on the sense of the table).

PMOVMSKB (or VPMOVMSKB) to extract the results back into GPR-land.

So if you want to match, say 0x12 and 0xab at once, assign 0x12 to 'bucket #0' and 0xab to 'bucket #1'.

High nibble table gives you 0x1 -> 0x1 (that is, 0x10 in the high nibble gives you bucket #0) and 0xa -> 0x2

Low nibble table gives you 0x2 -> 0x1 and 0xb -> 0x2.

Tables are otherwise zero.

You can now look up lots of different useful character classes (not all! That takes three shuffle tables) in about 6 instructions. Sadly the instructions are pretty much a long linear chain of dependencies so that latency is a bit worse than throughput, but we've built some fun stuff around them.

ot · on May 26, 2017

You don't even have to go into SIMD and lookup tables: on 64-bit archs you can process 8 bytes at a time using a few bit tricks. Lookup tables work great in microbenchmarks because the tables are always in cache, but not so great if there's cache pressure from other code.

See this implementation of JSON escaping for example: https://github.com/facebook/folly/commit/2f0cabfb48b8a8df84f...

De-escaping is even simpler because you only have to detect two characters, \ and ".

glangdale · on May 26, 2017

The bit tricks are OK. We used them extensively back when we supported platforms that lacked SSSE3, but they are way slower than the SIMD techniques because of all the bit-bashing and far less flexible.

The technique I outline can match most character classes that you run into (and there's a 3-PSHUFB variant that can handle all predicates on bytes), while the bit-twiddling version will get linearly slower as you add more cases. If there's a way to get at what we did with general purpose operations and bit twiddling hacks I'm too stupid to think of it.

ot · on May 26, 2017

Well of course, they're different trade-offs. Here we don't care about flexibility, we're talking about a specific syntax (JSON).

Also SSE will be beneficial only if your strings are long enough. I'd be curious to benchmark my implementation against an SSE one on a representative dataset.

EDIT: the microbenchmark indicates that the bit-trick implementation spends about 1 cycle per byte (for the entire escaping routine, if there's nothing to escape).

glangdale · on May 26, 2017

1 cycle per byte is pretty good, and its properties might be very nice for small strings.

Assuming you can prep everything in advance, the SIMD version is 3 loads and about 5 SIMD operations (1 shift, 2 PANDs, 2 PSHUFBs) and a couple transfer/scan operations (PMOVMSKB and whatever bit scan operation you want) to eat 32 bytes. I'm figuring 10 operations. Depending on your workload you might care about reciprocal throughput or latency. Generally our work has been more throughput oriented so those 10 operations can be scheduled along with lots of others, but this isn't always the case.

The other way SIMD (or for that matter bit-twiddling) can be helpful is that you can cover a wider variety of sizes in a branch-free fashion, so if your sizes are distributed <32 bytes (or <8 for bit-twiddling) the code can be straight line with no branch misses. So a SIMD version of the code might be "slower" than a bit-twiddling version even if both sides always get 9 bytes, but if it's a mix of sizes and thus unpredictable, it might take fewer branch misses (unless some clown starts giving you 33 byte inputs half the time, of course).

Again, benchmark data is always a good cure for folkloric beliefs whether pro-SIMD ("SIMD is always fast, yay!") or anti-SIMD ("the overheads are never worth it").

glangdale · on May 26, 2017

Also see the definition of 'block' in https://github.com/01org/hyperscan/blob/master/src/nfa/shuft...