The initial length + unique character test allows us to negative match in a way that is effectively identical to a bloom filter.
If the initial negative match test indicates no match, then we can be certain there is not a match. Otherwise, we might have a match, and need to compare the candidate strings in more detail.
I can't imagine how hashing would help in either the fast-path negative match case, or the prefix match case. Hashing is sort of at odds with prefix matching, in my opinion.
Could you get good-enough performance with something simpler and more portable?
The optimized assembly is much more complicated and specialized. Porting platforms/architectures is very difficult and you loose out on the compiler's abilities to make assumptions and optimize your code.
String length is a very poor probabilistic filter as you're specifically looking substrings.
Also if you already have a hash of your input string's prefix, and the hash of all of your known strings, in the case where you actually need to check against your "known" strings becomes a fast search of a set of numbers making the fail-through case of your code negligible. If you impose a limit of strings in your "known" pool the compiler can have a field day.
> Could you get good-enough performance with something simpler and more portable?
Hmmm. Did you read the article? Any of the C versions are relatively simple and portable. They basically work the same as the assembly ones and give great performance... I'm not sure what you're trying to get at.
> The optimized assembly is much more complicated and specialized.
Well, it's my article, I wanted to see if I could beat a PGO C compiler with all its optimizations turned on :-)
> Porting platforms/architectures is very difficult and you loose out on the compiler's abilities to make assumptions and optimize your code.
I know the trade-offs that come with assembly. But if you're okay with them, you can get the best possible performance out of the underlying hardware. That was the goal of this article.
> String length is a very poor probabilistic filter as you're specifically looking substrings.
I don't understand this comment. String length is a great filter because if any strings in the table are longer than the input search string, then there's no way we can have a prefix match.
> Also if you already have a hash of your input string's prefix, and the hash of all of your known strings, in the case where you actually need to check against your "known" strings becomes a fast search of a set of numbers making the fail-through case of your code negligible. If you impose a limit of strings in your "known" pool the compiler can have a field day.
I don't know where this fascination with hashes comes from... have you looked at the implementation? Even when it has to do a string match, the timings now are around 13-14 cycles. The nature of the data structure means you're really only ever doing max 2 comparisons... 3 if someone is sending you worst case data (and even so, with 3, you're only clocking ~20 cycles).
Hashing and prefix matching just don't play well nicely together.
If the initial negative match test indicates no match, then we can be certain there is not a match. Otherwise, we might have a match, and need to compare the candidate strings in more detail.
I can't imagine how hashing would help in either the fast-path negative match case, or the prefix match case. Hashing is sort of at odds with prefix matching, in my opinion.