It'd be interesting to see how the 128-bit Murmur versions compare.
Something not mentioned in the "Quality" section is that FNV and djb2 produce may produce decent results on words/filenames/source code, but they can fail catastrophically (produce far more collisions than expected) on certain patterns of keys - SMHasher is designed not just to check performance, but to try and find patterns of keys that 'break' a hash function.
Murmur3 tries to be short, simple, portable, and efficient on small keys while still passing SMHasher. xxHash, Spooky, and the Farm/City/Metro suite all use a lot of loop unrolling and/or platform-specific optimizations to get the highest possible performance when hashing large chunks of data.
First, thanks for all your work on Murmur and SMHasher!
I wonder if you have any recommendations for a one-byte-at-a-time hash that survives SMHasher? I've been using Jenkins OAAT and haven't bumped into any problems but I've seen conflicting reports about its robustness. Any thoughts on Jenkins OAAT or an alternative one-at-a-time hash would be appreciated :)
It's not just a random mush of xors and shifts, it's:
- Add the next character value to the current value;
- Left-shift the current value by 4 (because most of the entropy in plain text ASCII is in the lower 4 bits);
- XOR the top 4 bits into the bottom four bits of the current value, because they're going to get shifted out next time; then clear the top four bits.
It's still not a great hash function (obviously). Part of the issue is that with a 32-bit unsigned int it's effectively only a 28 bit hash.
From the title, I was expecting either a discussion of the CPython implementation or an associative memory system.
It would be amusing to have a system with hardware associative memory, where everything is a key-value tuple. Such things have been built (most of an MMU is such an associative memory) but never caught on.
If anyone isn't familiar with him, Aras is one of the key architects behind the Unity game engine. An excellent hacker. Whenever I see his blog come up, I know it's going to be some awesome technical stuff.
1. This seems to disagree with Rust's tests for FVN vs xxHash.[1] This article has xxHash winning a bit after 10 bytes. The Rust one has FVN winning until after 32 bytes. Rust recently added FVN. Is someone wrong or am I missing something?
2. Why is 64-bit slower on asm.js? If the host is 64-bit shouldn't it get 64-bit? I know JS doesn't have proper ints, but I thought asm.js-aware browsers were supposed to handle this well?
Curious: Is there no uniform, fast, way to switch the algorithm for small inputs? Require the caller to hint that the key might be small (<8 or 16 bytes), otherwise in streaming mode, just restart if it turns out you only get <16 bytes. Or will that branching+code kill the gain?
as 1) SMHasher is not the right test for performance for a hash function in a hash table. In a hash table FNV will always win over xxHash and all those file digests, which are optimized for aligned larger blocks. SMHasher is a good test for network throughput or hashing files, but not for hash tables.
The reason is that an easily short inlinable hash function like FNV will always win over a big non-inlinable digest like Murmur, xxHash, Farm, Metro ...
Furthermore, even if small worse hash functions produce more collisions, their small icache size will still outperform bigger functions with less collisions. More work, but less cache misses.
For small hash tables it is even faster to do a linear search because of cache advantages, and also a linked-list of buckets is better be written as array. ("Cache-Conscious Collision Resolution in String Hash Tables" by Nikolas Askitis and Justin Zobel, Melbourne 2005)
ad branching) Google successfully experimented with hash variants based on alignment and size. Newer CPU's explore those branches beforehead, so the overhead was not that big and worth it. See their sparse hashtable.
PX4/XB1 use a fairly low-end AMD CPU (Jaguar core, ~1.6GHz). Apparently it also does not quite appreciate heavy use of 64 bit multiplies (as used in xxHash64 for example).
Modern phones (e.g. iPhone 6) in terms of CPU are actually very impressive indeed!
This is evaluating hash functions for use in something like a hash table where the concern is first and foremost for speed at goodish distribution, not collision resistance.
The security (DoS rather) issue with hash tables is more that 1) attackers often control the input, and H(a) == H(a) obviously and 2) the hash function output is reduced to a few bits to decide which bucket to put something into. In those cases making your hash function slower just makes you more susceptible.
Yeah I forgot to mention that even if most of the time you don't care about "security concerns" with regular hashtables, there are cases where a weakness in your hashtable can be used to bring some of your services down. That has happened!
However, many of the times that is not a concern, but as always, case by case basis.
For hashing, you want the maximum possible throughput. If you want to hash a password, you can always iterate the hash function to add difficulty, or better yet, use a mechanism specifically designed for hashing passwords, such as scrypt, bcrypt, or PKBDF2.
Something like FNVHash is so small (like a handful of lines), that it's extremely easy to just copy-paste the function into some place, without realizing it's already copy-pasted somewhere else.
Was it? The only issue I remember was with String before Java 1.2, where only few characters where used in hash (instead of all) leading to abysmal performance in many applications.
Something not mentioned in the "Quality" section is that FNV and djb2 produce may produce decent results on words/filenames/source code, but they can fail catastrophically (produce far more collisions than expected) on certain patterns of keys - SMHasher is designed not just to check performance, but to try and find patterns of keys that 'break' a hash function.
Murmur3 tries to be short, simple, portable, and efficient on small keys while still passing SMHasher. xxHash, Spooky, and the Farm/City/Metro suite all use a lot of loop unrolling and/or platform-specific optimizations to get the highest possible performance when hashing large chunks of data.
-Austin (author of Murmur and SMHasher)