Hacker News new | past | comments | ask | show | jobs | submit login
Hash Functions all the way down (aras-p.info)
148 points by based2 on Aug 3, 2016 | hide | past | favorite | 30 comments



It'd be interesting to see how the 128-bit Murmur versions compare.

Something not mentioned in the "Quality" section is that FNV and djb2 produce may produce decent results on words/filenames/source code, but they can fail catastrophically (produce far more collisions than expected) on certain patterns of keys - SMHasher is designed not just to check performance, but to try and find patterns of keys that 'break' a hash function.

Murmur3 tries to be short, simple, portable, and efficient on small keys while still passing SMHasher. xxHash, Spooky, and the Farm/City/Metro suite all use a lot of loop unrolling and/or platform-specific optimizations to get the highest possible performance when hashing large chunks of data.

-Austin (author of Murmur and SMHasher)


Thank you for Murmur3! I have been using it in a consistent hashing implementation as part of my bachelor's thesis and it is performing very well.


> It'd be interesting to see how the 128-bit Murmur versions compare.

Added them (and some more hash functions) in a follow-up post: http://aras-p.info/blog/2016/08/09/More-Hash-Function-Tests/


First, thanks for all your work on Murmur and SMHasher!

I wonder if you have any recommendations for a one-byte-at-a-time hash that survives SMHasher? I've been using Jenkins OAAT and haven't bumped into any problems but I've seen conflicting reports about its robustness. Any thoughts on Jenkins OAAT or an alternative one-at-a-time hash would be appreciated :)


If you can only hash data a byte at a time, you have bigger problems. ;)

I have no real suggestions, but it would be trivial to feed Murmur one byte per loop iteration and then ditch the 'tail' section.


Yeah, should revisit this and test more hash functions. Just ran out of time I wanted to spend on this (hey, I got some actual work to do too :))


The "ZOMGBadHash" isn't unique to that codebase - it's the hash specified by the ELF ABI (see http://www.sco.com/developers/gabi/latest/ch5.dynamic.html#h... ).

It's not just a random mush of xors and shifts, it's:

  - Add the next character value to the current value;
  - Left-shift the current value by 4 (because most of the entropy in plain text ASCII is in the lower 4 bits);
  - XOR the top 4 bits into the bottom four bits of the current value, because they're going to get shifted out next time; then clear the top four bits.
It's still not a great hash function (obviously). Part of the issue is that with a 32-bit unsigned int it's effectively only a 28 bit hash.



What's the point of linking a Reddit submission of the same article?


The comments, probably.


From the title, I was expecting either a discussion of the CPython implementation or an associative memory system.

It would be amusing to have a system with hardware associative memory, where everything is a key-value tuple. Such things have been built (most of an MMU is such an associative memory) but never caught on.


If anyone isn't familiar with him, Aras is one of the key architects behind the Unity game engine. An excellent hacker. Whenever I see his blog come up, I know it's going to be some awesome technical stuff.


Great article.

1. This seems to disagree with Rust's tests for FVN vs xxHash.[1] This article has xxHash winning a bit after 10 bytes. The Rust one has FVN winning until after 32 bytes. Rust recently added FVN. Is someone wrong or am I missing something?

2. Why is 64-bit slower on asm.js? If the host is 64-bit shouldn't it get 64-bit? I know JS doesn't have proper ints, but I thought asm.js-aware browsers were supposed to handle this well?

Curious: Is there no uniform, fast, way to switch the algorithm for small inputs? Require the caller to hint that the key might be small (<8 or 16 bytes), otherwise in streaming mode, just restart if it turns out you only get <16 bytes. Or will that branching+code kill the gain?

1: http://cglab.ca/~abeinges/blah/hash-rs/


as 1) SMHasher is not the right test for performance for a hash function in a hash table. In a hash table FNV will always win over xxHash and all those file digests, which are optimized for aligned larger blocks. SMHasher is a good test for network throughput or hashing files, but not for hash tables.

The reason is that an easily short inlinable hash function like FNV will always win over a big non-inlinable digest like Murmur, xxHash, Farm, Metro ... Furthermore, even if small worse hash functions produce more collisions, their small icache size will still outperform bigger functions with less collisions. More work, but less cache misses. For small hash tables it is even faster to do a linear search because of cache advantages, and also a linked-list of buckets is better be written as array. ("Cache-Conscious Collision Resolution in String Hash Tables" by Nikolas Askitis and Justin Zobel, Melbourne 2005)

ad branching) Google successfully experimented with hash variants based on alignment and size. Newer CPU's explore those branches beforehead, so the overhead was not that big and worth it. See their sparse hashtable.


asm.js is essentially a "32 bit" compilation target, it does not (to my knowledge) support 64 bit arithmetic efficiently right now.


This really needs to take into account hardware instruction support. The hardware instructions are quite fast.


I'm surprised how fast the iPhone CPU is compared to the Xbox One CPU (for this application).


PX4/XB1 use a fairly low-end AMD CPU (Jaguar core, ~1.6GHz). Apparently it also does not quite appreciate heavy use of 64 bit multiplies (as used in xxHash64 for example).

Modern phones (e.g. iPhone 6) in terms of CPU are actually very impressive indeed!


How about Siphash?


Wait, don't you want minimum acceptable throughput on security-related hashing, so that it becomes less likely that somebody can just brute-force it?


Not everything hashish ends up being salted passwords. Hashing is great for just hashtables and partioning.


Ah, that makes sense. Thanks.


This is evaluating hash functions for use in something like a hash table where the concern is first and foremost for speed at goodish distribution, not collision resistance.

The security (DoS rather) issue with hash tables is more that 1) attackers often control the input, and H(a) == H(a) obviously and 2) the hash function output is reduced to a few bits to decide which bucket to put something into. In those cases making your hash function slower just makes you more susceptible.


Yeah I forgot to mention that even if most of the time you don't care about "security concerns" with regular hashtables, there are cases where a weakness in your hashtable can be used to bring some of your services down. That has happened!

However, many of the times that is not a concern, but as always, case by case basis.


For hashing, you want the maximum possible throughput. If you want to hash a password, you can always iterate the hash function to add difficulty, or better yet, use a mechanism specifically designed for hashing passwords, such as scrypt, bcrypt, or PKBDF2.


Or the most recent Argon2, but I've learned to give crypto hashers a good 5 years before using in production.


How did they get to a place where there are multiple copies of the same external codebase? Shared libraries FTW.


Something like FNVHash is so small (like a handful of lines), that it's extremely easy to just copy-paste the function into some place, without realizing it's already copy-pasted somewhere else.


it would be interesting check against Java hash implementation. I remember that Java hash was "bad"


Was it? The only issue I remember was with String before Java 1.2, where only few characters where used in hash (instead of all) leading to abysmal performance in many applications.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: