Hash Functions all the way down

aappleby · on Aug 3, 2016

It'd be interesting to see how the 128-bit Murmur versions compare.

Something not mentioned in the "Quality" section is that FNV and djb2 produce may produce decent results on words/filenames/source code, but they can fail catastrophically (produce far more collisions than expected) on certain patterns of keys - SMHasher is designed not just to check performance, but to try and find patterns of keys that 'break' a hash function.

Murmur3 tries to be short, simple, portable, and efficient on small keys while still passing SMHasher. xxHash, Spooky, and the Farm/City/Metro suite all use a lot of loop unrolling and/or platform-specific optimizations to get the highest possible performance when hashing large chunks of data.

-Austin (author of Murmur and SMHasher)

ComNik · on Aug 3, 2016

Thank you for Murmur3! I have been using it in a consistent hashing implementation as part of my bachelor's thesis and it is performing very well.

aras_p · on Aug 10, 2016

> It'd be interesting to see how the 128-bit Murmur versions compare.

Added them (and some more hash functions) in a follow-up post: http://aras-p.info/blog/2016/08/09/More-Hash-Function-Tests/

_dps · on Aug 4, 2016

First, thanks for all your work on Murmur and SMHasher!

I wonder if you have any recommendations for a one-byte-at-a-time hash that survives SMHasher? I've been using Jenkins OAAT and haven't bumped into any problems but I've seen conflicting reports about its robustness. Any thoughts on Jenkins OAAT or an alternative one-at-a-time hash would be appreciated :)

aappleby · on Aug 4, 2016

If you can only hash data a byte at a time, you have bigger problems. ;)

I have no real suggestions, but it would be trivial to feed Murmur one byte per loop iteration and then ditch the 'tail' section.

aras_p · on Aug 4, 2016

Yeah, should revisit this and test more hash functions. Just ran out of time I wanted to spend on this (hey, I got some actual work to do too :))

caf · on Aug 4, 2016

The "ZOMGBadHash" isn't unique to that codebase - it's the hash specified by the ELF ABI (see http://www.sco.com/developers/gabi/latest/ch5.dynamic.html#h... ).

It's not just a random mush of xors and shifts, it's:

  - Add the next character value to the current value;
  - Left-shift the current value by 4 (because most of the entropy in plain text ASCII is in the lower 4 bits);
  - XOR the top 4 bits into the bottom four bits of the current value, because they're going to get shifted out next time; then clear the top four bits.

It's still not a great hash function (obviously). Part of the issue is that with a 32-bit unsigned int it's effectively only a 28 bit hash.

based2 · on Aug 3, 2016

https://www.reddit.com/r/programming/comments/4vxtyz/hash_fu...

the_duke · on Aug 3, 2016

What's the point of linking a Reddit submission of the same article?

fegu · on Aug 3, 2016

The comments, probably.

Animats · on Aug 3, 2016

From the title, I was expecting either a discussion of the CPython implementation or an associative memory system.

It would be amusing to have a system with hardware associative memory, where everything is a key-value tuple. Such things have been built (most of an MMU is such an associative memory) but never caught on.

gregschlom · on Aug 3, 2016

If anyone isn't familiar with him, Aras is one of the key architects behind the Unity game engine. An excellent hacker. Whenever I see his blog come up, I know it's going to be some awesome technical stuff.

MichaelGG · on Aug 4, 2016

Great article.

1. This seems to disagree with Rust's tests for FVN vs xxHash.[1] This article has xxHash winning a bit after 10 bytes. The Rust one has FVN winning until after 32 bytes. Rust recently added FVN. Is someone wrong or am I missing something?

2. Why is 64-bit slower on asm.js? If the host is 64-bit shouldn't it get 64-bit? I know JS doesn't have proper ints, but I thought asm.js-aware browsers were supposed to handle this well?

Curious: Is there no uniform, fast, way to switch the algorithm for small inputs? Require the caller to hint that the key might be small (<8 or 16 bytes), otherwise in streaming mode, just restart if it turns out you only get <16 bytes. Or will that branching+code kill the gain?

1: http://cglab.ca/~abeinges/blah/hash-rs/

rurban · on Aug 4, 2016

as 1) SMHasher is not the right test for performance for a hash function in a hash table. In a hash table FNV will always win over xxHash and all those file digests, which are optimized for aligned larger blocks. SMHasher is a good test for network throughput or hashing files, but not for hash tables.

The reason is that an easily short inlinable hash function like FNV will always win over a big non-inlinable digest like Murmur, xxHash, Farm, Metro ... Furthermore, even if small worse hash functions produce more collisions, their small icache size will still outperform bigger functions with less collisions. More work, but less cache misses. For small hash tables it is even faster to do a linear search because of cache advantages, and also a linked-list of buckets is better be written as array. ("Cache-Conscious Collision Resolution in String Hash Tables" by Nikolas Askitis and Justin Zobel, Melbourne 2005)

ad branching) Google successfully experimented with hash variants based on alignment and size. Newer CPU's explore those branches beforehead, so the overhead was not that big and worth it. See their sparse hashtable.

aras_p · on Aug 4, 2016

asm.js is essentially a "32 bit" compilation target, it does not (to my knowledge) support 64 bit arithmetic efficiently right now.

arielweisberg · on Aug 3, 2016

This really needs to take into account hardware instruction support. The hardware instructions are quite fast.

JosephRedfern · on Aug 3, 2016

I'm surprised how fast the iPhone CPU is compared to the Xbox One CPU (for this application).

aras_p · on Aug 4, 2016

PX4/XB1 use a fairly low-end AMD CPU (Jaguar core, ~1.6GHz). Apparently it also does not quite appreciate heavy use of 64 bit multiplies (as used in xxHash64 for example).

Modern phones (e.g. iPhone 6) in terms of CPU are actually very impressive indeed!

edwintorok · on Aug 3, 2016

How about Siphash?

qwertyuiop924 · on Aug 3, 2016

Wait, don't you want minimum acceptable throughput on security-related hashing, so that it becomes less likely that somebody can just brute-force it?

xxs · on Aug 3, 2016

Not everything hashish ends up being salted passwords. Hashing is great for just hashtables and partioning.

qwertyuiop924 · on Aug 3, 2016

Ah, that makes sense. Thanks.

revelation · on Aug 3, 2016

This is evaluating hash functions for use in something like a hash table where the concern is first and foremost for speed at goodish distribution, not collision resistance.

The security (DoS rather) issue with hash tables is more that 1) attackers often control the input, and H(a) == H(a) obviously and 2) the hash function output is reduced to a few bits to decide which bucket to put something into. In those cases making your hash function slower just makes you more susceptible.

aras_p · on Aug 4, 2016

Yeah I forgot to mention that even if most of the time you don't care about "security concerns" with regular hashtables, there are cases where a weakness in your hashtable can be used to bring some of your services down. That has happened!

However, many of the times that is not a concern, but as always, case by case basis.

JoshTriplett · on Aug 3, 2016

For hashing, you want the maximum possible throughput. If you want to hash a password, you can always iterate the hash function to add difficulty, or better yet, use a mechanism specifically designed for hashing passwords, such as scrypt, bcrypt, or PKBDF2.

vmasto · on Aug 4, 2016

Or the most recent Argon2, but I've learned to give crypto hashers a good 5 years before using in production.

ashitlerferad · on Aug 4, 2016

How did they get to a place where there are multiple copies of the same external codebase? Shared libraries FTW.

aras_p · on Aug 4, 2016

Something like FNVHash is so small (like a handful of lines), that it's extremely easy to just copy-paste the function into some place, without realizing it's already copy-pasted somewhere else.

Zardoz84 · on Aug 4, 2016

it would be interesting check against Java hash implementation. I remember that Java hash was "bad"

ucho · on Aug 4, 2016

Was it? The only issue I remember was with String before Java 1.2, where only few characters where used in hash (instead of all) leading to abysmal performance in many applications.