Hacker News new | past | comments | ask | show | jobs | submit login

I felt moved to comment on the thread. HN user burntsushi has been very nice by writing up a trick we developed here on the Hyperscan team that should be considerably faster than this system assuming there's enough data to make using SIMD pay off (this would be a lose if the sizes are small enough).

It's possible that a SIMD implementation (assuming that you routinely pass over enough bytes) can blow this away. I'm not sure what the cross-over point is - it depends on whether you are falling out of this code quickly or not. Obviously using a big SIMD sequence to replace a "2 cycles per byte" implementation is pretty stupid if the SIMD sequence is "0.5 cycles a byte" but the fixed cost of using the SIMD means that your costs are >=16 cycles and the typical case is small.

My comment follows:

It's also not clear that you actually need a SIMD instruction - if you've squashed your hotspot to the point that adding the complexity of per-platform SIMD specialization is only a small win, then why bother?

All that being said...

I've been really lazy in writing this up, but someone kindly did a decent write-up for us in the Rust reimplementation of our "Teddy" literal matcher. Please see https://github.com/rust-lang/regex/blob/master/src/simd_acce... if interested or look up the Hyperscan project at https://github.com/01org/hyperscan for a few different versions of this trick.




The trick: very briefly:

Use two 16-byte values in SIMD registers as 16-entry lookup tables with "8 buckets" (really just bits) each.

Lookup those two separate tables with the high nibble (bits 4..7) and low nibble (bits 0..3) of your input (you can do this in parallel either 16/32/64 ways depending on whether you are using SSSE3/AVX2/AVX512).

AND together the table results (or OR them, depending on the sense of the table).

PMOVMSKB (or VPMOVMSKB) to extract the results back into GPR-land.

So if you want to match, say 0x12 and 0xab at once, assign 0x12 to 'bucket #0' and 0xab to 'bucket #1'.

High nibble table gives you 0x1 -> 0x1 (that is, 0x10 in the high nibble gives you bucket #0) and 0xa -> 0x2

Low nibble table gives you 0x2 -> 0x1 and 0xb -> 0x2.

Tables are otherwise zero.

You can now look up lots of different useful character classes (not all! That takes three shuffle tables) in about 6 instructions. Sadly the instructions are pretty much a long linear chain of dependencies so that latency is a bit worse than throughput, but we've built some fun stuff around them.


You don't even have to go into SIMD and lookup tables: on 64-bit archs you can process 8 bytes at a time using a few bit tricks. Lookup tables work great in microbenchmarks because the tables are always in cache, but not so great if there's cache pressure from other code.

See this implementation of JSON escaping for example: https://github.com/facebook/folly/commit/2f0cabfb48b8a8df84f...

De-escaping is even simpler because you only have to detect two characters, \ and ".


The bit tricks are OK. We used them extensively back when we supported platforms that lacked SSSE3, but they are way slower than the SIMD techniques because of all the bit-bashing and far less flexible.

The technique I outline can match most character classes that you run into (and there's a 3-PSHUFB variant that can handle all predicates on bytes), while the bit-twiddling version will get linearly slower as you add more cases. If there's a way to get at what we did with general purpose operations and bit twiddling hacks I'm too stupid to think of it.


Well of course, they're different trade-offs. Here we don't care about flexibility, we're talking about a specific syntax (JSON).

Also SSE will be beneficial only if your strings are long enough. I'd be curious to benchmark my implementation against an SSE one on a representative dataset.

EDIT: the microbenchmark indicates that the bit-trick implementation spends about 1 cycle per byte (for the entire escaping routine, if there's nothing to escape).


1 cycle per byte is pretty good, and its properties might be very nice for small strings.

Assuming you can prep everything in advance, the SIMD version is 3 loads and about 5 SIMD operations (1 shift, 2 PANDs, 2 PSHUFBs) and a couple transfer/scan operations (PMOVMSKB and whatever bit scan operation you want) to eat 32 bytes. I'm figuring 10 operations. Depending on your workload you might care about reciprocal throughput or latency. Generally our work has been more throughput oriented so those 10 operations can be scheduled along with lots of others, but this isn't always the case.

The other way SIMD (or for that matter bit-twiddling) can be helpful is that you can cover a wider variety of sizes in a branch-free fashion, so if your sizes are distributed <32 bytes (or <8 for bit-twiddling) the code can be straight line with no branch misses. So a SIMD version of the code might be "slower" than a bit-twiddling version even if both sides always get 9 bytes, but if it's a mix of sizes and thus unpredictable, it might take fewer branch misses (unless some clown starts giving you 33 byte inputs half the time, of course).

Again, benchmark data is always a good cure for folkloric beliefs whether pro-SIMD ("SIMD is always fast, yay!") or anti-SIMD ("the overheads are never worth it").





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: