Side note: people have been inquiring how I made the diagrams and whatnot. I'll put a little section like that on the next article. For now though:
The two main diagrams, depicting struct layout and then the 5-pane register walk through, were done in Visio, then saved as SVG, with fonts edited directly in Vim (took a few cycles to figure out which fonts looked best across all devices).
The charts were done in Excel, connected to the CSV data files produced by the benchmark utility, then wired up using a PivotTable. A PivotChart was created, Save As -> PDF, edit PDF in Inkscape, delete surrounding border, crop canvas to fit diagram, Save As -> SVG. Edit SVG fonts in Vim, voila, nice vectorized graph!
Sounds laborious, and it took forever to get the flow down, once I'd settled on a process it was pretty easy to whip up new graphs.
Thanks! Joe Walnes should get credit for the underlying HTML and CSS and site design and whatnot... I lovingly pilfered[1] his websocketd[2] site a while back for PyParallel[3] and liked it so much I decided it to use it for this personal site as well.
SIMD-units of the x86-64 CPU are not discussed enough IMO. And every time they are discussed, I'm always blown away by their efficiency in solving some tasks.
---------------
Full disclosure: I don't understand what you've done yet. I've just skimmed the article really quick, but I already have some questions. Hopefully you didn't answer them already elsewhere, lol.
* Any reason to highly-optimize "strings of length 16" instead of "strings of length 32" ?? The vast majority of your code uses "xmm" registers (16-bytes), but it would seem like it'd be straightforward to target ymm registers (32-bytes) instead.
Were there any particular reason why you chose 16-bytes instead of 32-bytes? Like maybe targeting AVX but not AVX2?? (Then again, I see a rare ymm register here and there... sooo... I guess I'm just a bit confused with the design decision)
* SSE4.2 has string-specific instructions. Did you look into any of the SSE4.2 string instructions to see if they'd help in your case? (Ironically, forcing you back down to xmm registers, since its SSE only. So kind of a contradiction with my first question, lol) I fully admit that I don't actually know how to use the SSE4.2 string instructions, so don't look into my question too hard.
------------
I'll have to spend some time studying your code and explanation. Thank you very much for the writeup! It will be helpful for my SIMD-studies. There really isn't enough material out there on AVX / SSE optimizations.
> Were there any particular reason why you chose 16-bytes instead of 32-bytes? Like maybe targeting AVX but not AVX2?? (Then again, I see a rare ymm register here and there... sooo... I guess I'm just a bit confused with the design decision)
Yeah, my target use case rarely had strings longer than 16 (which is quite common for prefix matching problems in the real world), so, 16 made the most sense. I'll probably do an AVX2 and AVX-512 version down the track; as I'll have 32-byte and 64-byte registers to play with, it will allow me to do some more interesting things (either longer prefix strings, or more comparisons).
> * SSE4.2 has string-specific instructions. Did you look into any of the SSE4.2 string instructions to see if they'd help in your case? (Ironically, forcing you back down to xmm registers, since its SSE only. So kind of a contradiction with my first question, lol) I fully admit that I don't actually know how to use the SSE4.2 string instructions, so don't look into my question too hard.
The instructions like pcmpistri were considered and explicitly rejected :-) They are surprisingly slow, and they only process 16 characters at a time, so you don't get the implicit benefits on longer strings that you would expect if you could do like a `rep pcmpistri` or something (like the accelerated `rep stosq` etc).
The first twelve instructions for the negative match logic execute in about 6 CPU cycles. A pcmpistri-type instruction will often clock in around 7-14 cycles. So, with my approach, using the "basically free" instructions like vpcmpeqb, vpcmpgtb etc, I can detect if my input string doesn't have any prefix matches in a table of 16 prefixes in ~6 cycles, which is pretty neat.
On benefit of sticking with 16 bytes is it work on any 64bit machine, and as well you don't run into weird throttling issues that you do with AVX/AVX2/AVX512
So it is a lot easier to make something general purpose, where almost anyone can use it and it will always be faster.
He's using the "v" instructions, which requires AVX, and probably AVX2 (I haven't memorized all the instructions... but they're at least AVX). Even if he sticks with 16-bytes, the use of the "v" instructions basically requires a relatively recent processor.
Yeah it's actually a pretty inconsistent mish-mash of ISA usage. I could go straight SSE4.x but I'd lose out on the vp* goodies you get with AVX and AVX2, even if I'm only using XMM registers.
The "length of 16, num elements is 16" parity is nice though and simplifies the logic of the algorithm.
Yeah I generally agree with regards to throttling/caps/license issues, especially re: AVX-512. AVX and AVX2 aren't too bad though. I'm not using computationally-intensive AVX instructions, just your standard vectorized and/not/or/test/cmp/gt etc. These are incredibly cheap from a CPU cycle perspective.
Note that some of the AVX instructions will slow down the rest of your CPU -- not a reason to avoid them, but you you need to understand your workload (you can't just throw them willy-nilly into the code path).
Fun tidbits: a) I bombed, like, 6 coding interviews whilst writing this article. Two of them were related to string processing, too, which I thought was funny. (In a depressingly ironic kind of way.) No point, other than I seemingly suck at the "leetcode" coding interview style companies seem to rely upon these days.
And b) there are at least two Easter eggs in this post that no-one seems to have noticed.
Oh, and c) perk of religiously keeping line length to <= 80 characters... the code is surprisingly readable on mobile devices!
Here is a way to do it with a perfect hash table (so only one lookup).
Find a 128-bit mask that will be used to select bits of the input string. You want to minimize the number of 1 bits in the mask, since the table size is 2^(number of 1 bits). On the other hand, you need to pick enough 1 bits so that there is only one string per table entry.
AND this mask with the input string. COMPRESS these bits using the mask. The result is the index into the table. Each entry has a string to compare and thermometer coded mask (number of 1 bits equals length of string in bits). AND the input string with the thermometer coded mask and compare it with the string. If they match, you have a hit.
COMPRESS (or generalized extract) means to move selected bits to the right so that they are all next to each other and adjacent to the LSB. There is a clever algorithm for this, see section 7-4 of Hacker's Delight:
Excellent! I built a string matcher based on these same principles at Intel while working on Hyperscan but never released it. I'm pleased to see that you have discovered and publicized this yourself as I can say the cat is out of the bag. :-)
You can use PEXT alone without even messing with SIMD as long as there's enough good distinguishing bits in the early (first/last 8) parts of the string. The only thing that complicates this is that, as in the example in the article, sometimes some strings are proper prefixes/suffixes of others (which one is problematic depends on which direction you're using).
But seriously though, I played around with compression a bit initially: https://github.com/tpn/tracer/blob/master/StringTable/String.... Referenced Hacker's Delight when I was working on it. Granted, this was just for compressing the USHORT lengths into BYTE-sized lengths.
All of the compress/expand routines mentioned in that Hacker's Delight section you're quoting seem to mention upwards of 160+ instructions. My negative match logic fast-path only requires 12 instructions.
True. A gperf alike hash table (with special key indices) works only ok with compile-time known strings. But a simple CMP-alike perfect hash table with double indirection would be worthwhile to test against.
I called it Hanov after http://stevehanov.ca/blog/index.php?id=119
Problem is there that you have to calc a hash for each string, which is only fast with __builtin_crc/_mm_crc32_u64
A hash alone isn't going to help much; he has both a 1-char string in the set and to top it off, most of the other strings begin with (a distinct from the 1-char string one) the same string. There are some tricks (overpopulate the hash table with short string) but hashing is tricky here.
Source: 11 years of string matching work. :-) We built a multiple hash table string matcher in the early days of Hyperscan (it's gone now).
There is an Intel PEXT instruction, but I don't know the cycle count and I think it's limited to 64 bits.
With PEXT the whole prefix match is four instructions.
PEXT is pretty cheap... but, I still don't think I understand how what you're suggesting could optimally apply to prefix matching against all 16 strings at once, or why it would offer a substantial speed-up to the approach I've used? Can you elaborate?
I can't see how this avoids needing to do a comparison against the string in the slot once you've actually found the index though. I think I need to see this in C or asm to grok what the benefits are.
Is there any way to quickly compute 32-bit hash code of the string using SIMD? If yes, then we get a very fast perfect hash map: https://github.com/tatumizer/pigeon_map. Hash code in question is not supposed to be of crypto quality, just "good enough" will be good enough :)
Where's all this newfound interest in hash maps coming from? Once you hash something, you lose the prefix information, making it kinda' useless for this activity. Unless you create a hash for every string length variation... but, that seems like a lot of overhead for very little gain.
I also can't see how you'd get that negative match fast path performance if you were reliant on hashes. That owes its speed to the length and unique char filter working in concert.
I was investigating why an earlier assembly version was so slow, then ended up writing a version that unequivocally beat all the C and previous assembly versions across the board (both prefix and negative matching, and worst case false-positive matching (where up to three comparisons needed to be done)).
TL;DR with optimized data structures specifically geared toward SIMD instructions, you can definitely get sizable performance improvements in both C with intrinsics, and then again with raw assembly.
> A reference implementation was written in C as a baseline, which simply looped through an array of strings, comparing each one, byte-by-byte, looking for a prefix match.
This is a poor choice; the 1970s boyer-moore also is significantly faster for most case as it avoids many pointless searches. And if you have a dedicated search table you can optimize the search even more (e.g. pre-sorting keys and targets, using a more dedicated structure like a trie, etc)
I think these approaches would also be amenable to deliberate acceleration with intrinsics, as you did with your brute force search.
The super-simple baseline referred to in the intro was whipped up as a trivial example of the simplest approach you could take for the problem.
Trie was never in contention as I didn't want to rely on any algorithms that involved pointer chasing. There's this comment at the top of StringTable.h:
The design is optimized for relatively short strings (less than or equal to
16 chars), and relatively few of them (less than or equal to 16). These
restrictive size constraints facilitate aggressive SIMD optimizations when
searching for the strings within the table, with the goal to minimize the
maximum possible latency incurred by the lookup mechanism. The trade-off
is usability and flexibility -- two things which can be better served by
prefix trees if the pointer-chasing behavior of said data structures can
be tolerated.
I'm not sure if Boyer-Moore is well suited to the specific use case I wanted to address. It is geared toward looking for occurrences of a pattern in a stream of text bytes, which isn't really what I'm doing at all. I have an input string of known length. I have up to 16 strings in a table, and I want to quickly check if any of those strings "start with or are equal to" my input string, with special emphasis on negative matching being on the fast path, and preserving the location of the match in the table (so it can be cast to an enum, for example).
So, this article was specifically about the SIMD-oriented STRING_TABLE structure, not necessarily about the history of string processing. Another article potentially comparing this approach to tries, Boyer-Moor or Aho-Corasick would make sense in the context of larger prefix table data sets (and a larger text corpus being searched, e.g. wikipedia dumps).
What's the issue with "pointer-chasing"? That it might be inefficient in practice (despite preferable runtime complexity?) due to making poor use of CPU caches?
Consider that his algorithm completes 16-searches in 20-cycles, or roughly 5 nanoseconds.
A single fetch from main-memory will take 50-nanoseconds, or be ~10x slower than the methodology discussed in this post. Even a fetch from L3 cache is typically 10-nanoseconds, while a fetch from L1 cache is 1ns or ~4-cycles. This SIMD-methodology is incredibly FAST.
The "best-case" scenario of this SIMD code is 6-cycles, which is faster than two L1 fetches. (!!!)
Note: two L1 fetches would likely go into the reorder buffer and be out-of-ordered into an efficient manner. But... ignore that plz since that destroys the point I'm trying to make. Lol. Still, you can see how repeatedly forcing even L1 fetches would slow down the code, and "pointer chasing" is more likely to fill up the Reorder buffers, since you cannot rely upon prefetchers to pre-fetch the data, or other such tricks of the CPU.
Algorithmic complexity means that eventually, the linear / SIMD methodology eventually will be slower with a big-enough dataset. But when looking at small data-sets and optimizing them to the maximum ability (ie: searching 16-strings ASAP), linear / sequential scans are king. Kinda like why insertion sort ends up winning in a lot of small cases over other sorts. CPUs and RAM are incredibly well optimized for sequential data scans.
I think with SIMD type stuff and small data sets (e.g. <=16 strings), conventional big-O complexity analysis can leave a bit to be desired.
Similarly, with SIMD and small data sets, you’re going to be faster than anything involving a pointer chasing data structure. Pointers tend to mean branchy-ness; they require more backend resources; they often introduce dependencies that can inhibit out of order pipelines.
in itself pointer chasing does not mean misusing CPU caches (your dataset could fit in cache and you could be pointer chasing inside it).
the main issue that is strictly due to pointer chasing is that you risk having long dependency chains in your instruction stream (address of the load depends on the result of the previous load, the address of which depends on the result of ...).
This is pretty bad for a modern CPU since a lot of the speed comes from overlapping independent computations.
To the person who downvoted this: you're using downvotes wrongly. I was asking a question about computer science. Down-voting people who ask questions about computer science is not the way to encourage the sort of conversation we're trying to encourage on HN.
Is this somewhere a Bloom Filter (or similar probabilistic data structures) could be used to improve performance? From a quick survey of cryptographic hash functions something like BLAKE2 would serve your needs at ~4ops/byte. If you had a wide enough bloom filter you can turn your check, in most cases, into a binary &.
Just a thought. The author here obviously already knows what they're doing and has likely thought of this but I'd be interested in hearing why this ASM-optimized approach was chosen instead.
The initial length + unique character test allows us to negative match in a way that is effectively identical to a bloom filter.
If the initial negative match test indicates no match, then we can be certain there is not a match. Otherwise, we might have a match, and need to compare the candidate strings in more detail.
I can't imagine how hashing would help in either the fast-path negative match case, or the prefix match case. Hashing is sort of at odds with prefix matching, in my opinion.
Could you get good-enough performance with something simpler and more portable?
The optimized assembly is much more complicated and specialized. Porting platforms/architectures is very difficult and you loose out on the compiler's abilities to make assumptions and optimize your code.
String length is a very poor probabilistic filter as you're specifically looking substrings.
Also if you already have a hash of your input string's prefix, and the hash of all of your known strings, in the case where you actually need to check against your "known" strings becomes a fast search of a set of numbers making the fail-through case of your code negligible. If you impose a limit of strings in your "known" pool the compiler can have a field day.
> Could you get good-enough performance with something simpler and more portable?
Hmmm. Did you read the article? Any of the C versions are relatively simple and portable. They basically work the same as the assembly ones and give great performance... I'm not sure what you're trying to get at.
> The optimized assembly is much more complicated and specialized.
Well, it's my article, I wanted to see if I could beat a PGO C compiler with all its optimizations turned on :-)
> Porting platforms/architectures is very difficult and you loose out on the compiler's abilities to make assumptions and optimize your code.
I know the trade-offs that come with assembly. But if you're okay with them, you can get the best possible performance out of the underlying hardware. That was the goal of this article.
> String length is a very poor probabilistic filter as you're specifically looking substrings.
I don't understand this comment. String length is a great filter because if any strings in the table are longer than the input search string, then there's no way we can have a prefix match.
> Also if you already have a hash of your input string's prefix, and the hash of all of your known strings, in the case where you actually need to check against your "known" strings becomes a fast search of a set of numbers making the fail-through case of your code negligible. If you impose a limit of strings in your "known" pool the compiler can have a field day.
I don't know where this fascination with hashes comes from... have you looked at the implementation? Even when it has to do a string match, the timings now are around 13-14 cycles. The nature of the data structure means you're really only ever doing max 2 comparisons... 3 if someone is sending you worst case data (and even so, with 3, you're only clocking ~20 cycles).
Hashing and prefix matching just don't play well nicely together.
This is fine work, and well presented. The only flaw (which I've discussed on Twitter with Trent) is the performance analysis against a single string at a time means we can't really analyze the effects of branch prediction on a realistic input (as the branch predictor will converge to 'perfection' almost immediately). I think these effects would be small, but it really does need to be properly analyzed.
I bought a bunch of assembly/MASM books from the 90s recently, if that helps. Writing assembly is a skill that atrophies so quickly without constant exercise, so once you spool up to an acceptable level of assembly productivity, you really need to make a conscious effort keep practicing to maintain it.
If I go for a couple of weeks, I’ll come back and be like “wait, how do I check if this is a NULL pointer again?”.
The Computer Archtecture books are good, you can pick up 2nd hand copies of older versions for cheap. Intel and AMD manuals are good, too.
This seems like a complicated way to speed up a linear scan. What's the advantage of this approach over just sorting the strings and doing a binary search, which is log N?
(Not a naive binary search using strcmp, but the trick where you find the lower and upper bounds given index 0, refine it using index 1, etc. until your input is exhausted.)
Chasing pointers is cache death. Many theoretically optimal algorithms have been struck down by the God of Memory Latency.
eg, If you measure it, tree performance will often be much improved by storing slightly less than one cache line's worth of data at each node. The CPU can pipeline and predict the shit out of the linear search at each node to the point that all the theoretical O(1) or O(log N) lookups in the world get demolished by a CS101 scan of the array. This makes a relatively good general use cache aware data structure because small collections tend to fit into a single cache line while large data structures have relatively few pointers to chase.
Of course as the OP demonstrates, optimizing for your specific use case can yield even more dramatic improvements.
Sorting the strings by length, or lexicographical? Length sort wouldn't work for the prefix-oriented nature of the comparisons. Lexicographical sorting and a binary search involves pointer chasing, and doesn't provide any way to fast-path the negative match lookup.
If I'm getting called a trillion times on a hot-path, but only a million of those times I'm receiving a module name (search string) I'm interested in, then it behooves me to have the fastest negative match logic possible.
As it stands, the final assembly version can negative match in about 6 cycles, prefix match in about 14 cycles, and worst-case false-positive negative match (where three strings need to be compared, but there is no match) in 21 cycles. The CPI for the final routine is reported as around 0.266, which is pretty close to the optimal of 0.25.
I think you'd struggle to meet that performance with anything that relies on pointer chasing. I do comment on the likelihood of a log(n)-based algorithm overtaking performance when the data set is large enough here: http://trent.me/is-prefix-of-string-in-table/#other-applicat...
This particular use case wasn't intended for prefix matching against large data sets though, and thus, wasn't optimized for that.
Ok, I see further down the article does describe some of the assumptions: short strings, very small set size, optimized for negative matching.
Some constructive feedback: move that information up to the "Goal" section, maybe with a motivating example. The fact that the set size is <= 16 is important to know before discussing cycle counts.
> A SIMD-friendly C structure called STRING_TABLE was derived. It is optimized for up to 16 strings, ideally of length less than or equal 16 characters.
He was only searching 16 strings. The overhead of a binary search on such a small N would likely overwhelm the algorithmic improvements of moving from an O(N) to O(log N) implementation.
The table is optimized for 16 strings, but there's a reference to "linear scanning an array of string tables" so I thought the total input was larger. Anyways the size of the set is important information.
I love tries for larger data sets, and have used them many times for longest prefix matching problems.
However, they inevitably involve pointer chasing, are quite branchy, and not really amenable to SIMD optimizations. They also do not have any means in place for doing a fast-path negative match, which was important for me.
I think you'd really struggle to get anything based on pointer chasing and binary search to compete with the CPU cycle counts I was seeing toward the end for the assembly versions (e.g. 6 cycles for negative match, 14 for prefix match, 21 for worst-case false-positive negative match).
> The other nice side-effect is that it forces you to pick which table a given string should go in. I made this decision by looking at which types occurred most frequently, and simply put those in the first table. Less frequent types go in subsequent tables.
> I have a hunch there's a lot of mileage in that approach; that is, linear scanning an array of string tables until a match is found. There will be an inflection point where some form of a log(n) binary tree search will perform better overall, but it would be very interesting to see how many strings you need to potentially match against before that point is hit.
> Unless the likelihood of matching any given string in your set is completely random, by ordering the strings in your tables by how frequently they occur, the amortized cost of parsing a chunk of text would be very competitive using this approach, I would think.
I'm optimizing for ultra low latency here; i.e. what's the absolute fastest I can detect a prefix match against this known string table? The final assembly version is down around the 6-20 cycle range, depending on what you're doing.
The overhead of dispatching a kernel to the GPU (and subsequent memory copy of the search string) would take tens of thousands (if not hundreds of thousands) of CPU cycles, so, it's not really applicable in this single-shot prefix lookup case.
Where it would be interesting: much larger prefix data sets (say, the name of all known cities, states, countries etc) and identifying where they occur in an even larger text corpus (like a wikipedia dump).
In that approach, you'd obviously compare against a multi-threaded CPU implementation... but, it'd be interesting to see the comparison.
The two main diagrams, depicting struct layout and then the 5-pane register walk through, were done in Visio, then saved as SVG, with fonts edited directly in Vim (took a few cycles to figure out which fonts looked best across all devices).
The charts were done in Excel, connected to the CSV data files produced by the benchmark utility, then wired up using a PivotTable. A PivotChart was created, Save As -> PDF, edit PDF in Inkscape, delete surrounding border, crop canvas to fit diagram, Save As -> SVG. Edit SVG fonts in Vim, voila, nice vectorized graph!
Sounds laborious, and it took forever to get the flow down, once I'd settled on a process it was pretty easy to whip up new graphs.