Hacker News new | past | comments | ask | show | jobs | submit login
Hashing (samwho.dev)
318 points by signa11 on June 20, 2023 | hide | past | favorite | 64 comments



> The Avalance Effect

> Another way hash functions get evaluated is on something called the "avalanche effect." This refers to how many bits in the output value change when just a single bit of the input changes. To say that a hash function has a good avalanche effect, a single bit flip in the input should result in an average of 50% the output bits flipping.

I think it's important to note that this isn't a property that is necessary for a hash function, there are hash functions that deliberately try to minimize the avalanche effect, such as in locality sensitive hashing; which is another form of hash function that has different use cases that are also very cool.

You can use LSHs to for example remove near duplicate search results in a search engine without having to actually comparing the texts, even tolerating single-word differences; or to help with nearest neighbor searches.

A lot of what's written about hash function tends to assume you want to use them for whatever thing the author has in mind, it's not a unique fault of the author -- many textbooks have this problem, and add all these properties to them that are accidental to what a hash function actually is; it's really just a mapping function to a fixed width representation with an even distribution across the domain.


> You can use LSHs […] to help with nearest neighbor searches.

Yes! A good example of this in graphics is a spatial nearest neighbor search. Space filling curves like the Morton and Hilbert curves are often used as hash functions that can preserve some amount of locality for 2-D / 3-D / N-D points. A locally sensitive hash function can basically be used to provide a sort key for data that doesn’t otherwise have a key, while increasing the odds that any two keys that are similar point to data that are also similar. This is useful on GPUs in order to improve coherence/performance of threads operating on spatial points.

https://en.wikipedia.org/wiki/Z-order_curve

https://en.wikipedia.org/wiki/Hilbert_curve


Yeah there's a ton of really cool things you can do with LSHs. They're really overlooked outside of a few niches.


The OP mentioned murmur3 due to its speed and high avalanche effect.

So what's the best LSH function for this use case?


Hi I’m not sure I understand the question. Avalanche and local sensitivity are opposing goals, so typically the more avalanche, the less locally sensitive, and vice versa. You could even state the avalanche goal as being locally insensitive - the whole idea of avalanche is one bit change in the input produces a maximally different and uniformly distributed output.


I'm basically asking for the most efficient and unavalanched hash map library recommended for the purposes of this use case.


I don't think you're gonna get one-size-fits-all answers the same way you will with avalanching hash functions. They can be one-size-fits-all because the requirement is that there is no (discernible) correlation between the input and output.

Locality-sensitive hashes depend on what sort of thing you are hashing. Is the data ordered, is it a sequence, is it points in space? The requirement is that given similarity in the input given a specific metric, there is a similarity in the output given another metric.

There are no universal locality sensitive hash functions because the function depends on in what sense you're trying to preserve locality.


Ah, got it. I don’t even know of one, TBH, let alone have a best one in mind. Every time I’ve used an LSH, the hashing code has been ad-hoc and tailored to pretty specific data and optimization goals. (An added layer of difficulty is that “locality” isn’t universally defined.)

I’m not sure but it might be simplest to just look for a decent Map library in your favorite language, and keep the hashing part separate. It’s generally very easy to insert your own hash transform wherever you need, whether it’s on the fly, or saved as a sort key in memory. This way all you would need is to find a Morton order function, for example, if you want to hash with a z-order curve.


I feel like there's a more general concept of "bijective index" that subsumes LSHs and cryptographic hashes. When I think of hash I usually think of modulos, randomness, etc. when in general they're just ways of creating a set that's 1:1 to your inputs.


You can't really say much in general about hash functions - they're just general functions. The more you try to infer additional rules about hash functions, the more use cases you start to exclude. For example, _mostly_ hash functions have a fixed bit size for values in their range, but there's a small handful with variable length output (HAS-V & HAVAL, eg) that would otherwise be considered examples of cryptographic hash functions.

At the risk of making absolute statements in a field with vague and imprecise definitions, only perfect hash functions are injective. Imperfect hash functions (typically) map a large domain to a smaller range. The presence of collisions indicates a non-injective function.

Surjection is not required for all hash functions. Cryptographic hashes should be surjective, but indexing hashes that not surjective may have other desirable properties.

Since in general hash functions are neither injective nor surjective, they're definitely not bijective. You can have bijective hash functions, but the practicality of them would be extremely narrow.


You’d better hope your hashmap is bijective, unless you’re ok with not being able to retrieve some data. Though granted the differentiating aspect of a hashmap AFAIK is the hash function.


The hash key is only an index, and the same input always gives the same hash value, so you can retrieve your item just fine even if you can't recover the original key from the hashed value. A hash _map_ is not a hash _function_.


> the same input always gives the same hash value, so you can retrieve your item just fine

You’ll retrieve a bunch of items that hash to the same value, so not exactly. You can retrieve the set of items that have the same hash as your item just fine - which is not exactly the same.

> A hash _map_ is not a hash _function_.

This is more or less what I was getting at. I know hash functions aren’t bijective, but hashmaps are often used as if they are. I guess I’m not sure if LSHs refer to a family of functions or hashmaps, and I guess the word hashes often implies functions - my bad.


I didn't know that, but it makes perfect sense after reading your explanation of it. Thank you, TIL!


If you're interested, I built a stupid simple deduplication algorithm[1]. It ironically uses the avalanche effect of another hash function to implement random projection onto a locality preserving bit vector. It sounds like real computer science bingo but it's actually surprisingly simple and it works much better than it has any business doing.

[1] https://github.com/MarginaliaSearch/MarginaliaSearch/blob/ma...


Is there a world in which hashes could be used as an alternative to word2vec style vector embeddings? Where you basically convert text inputs to an ideal fixed width representation and then calculate the "similarity" between them to derive some kind of meaning?


Adjacent algorithms are used all over that space.

In general random projection, which is used to for dimension reduction in vector spaces, can also be used to construct LSHs; for example.


I can easily meet those literal requirements with a function that is total garbage.

1. Take all the input bits and XOR them together. Now we have a value which flips between 0 and 1 every time we change a single bit of the input.

2. Use this value as an index into the sequence { 0xFFFFFFFF, 0xF0F0F0F0 }

Now we get 50% of the bits flipping when we change a single bit of the input.


Hold your plucked chicken, Diogenes. Note the "evenly distributed" requirement. This is not.

But in general, sure, there's a lot of hash functions that aren't great. f(x)=x is one, that's actually even seen in the wild. Java's Integer class does that for its hashCode() function, and it even does a passable job of indexing a hash table.


That doesn’t meet the Avalanche Effect requirements. The requirement is that 50% of the bits change independently (on average) whenever an input bit changes. In other words, the output bit values need to be both random and uniformly distributed relative to every input bit.

https://en.wikipedia.org/wiki/Avalanche_effect#Bit_independe...


So technically truncating and then padding is a hash function


Division by size is also very popular. 1:1 mapping ditto


Technically the identity function on 64-bit integers is a hash function. There's just not much you can say about hash functions in general.


Great intro to hashing but if the concept of mmhash3’s seed is going to be brought up I think it’s only natural to mention its design limitations and why you still need something actually ddos resistant like SIP hash (even if we don’t get into the details of the latter).

Also, it’s important to distinguish between hashing, hashing, and hashing. That is, hashing to map input to buckets, hashing to obscure the original string, and hashing to find similarities in the input data. They’re all called hashing but they have different (and conflicting!) requirements. There’s a reason you want mmhash3 to be fast but scrypt to be slow, and a reason why you want mmhash3 to avalanche but certainly don’t want your perceptual hashing algo of choice to do the same.


I did have the distinction between cryptographic and non-cryptographic in there originally but found when junior folks read it they’d get confused about which use-case was being discussed. So I decided to focus on just one. With more time, I would have liked to cover at least scrypt.

I’ll be honest, I don’t know what the design limitations of the seeding in murmur3 are. What I wanted to show is the concept of seeding and what it’s there to prevent. I’m hoping that comes across, even without any deeper exploration of seeding.


Seeding is kind of a hack. It doesn’t guarantee you’ll avoid a (or even the same) collision, and quite a few “popular” hashes only change the output (but don’t change whether or not there is a collision) when you change the seed.

Thanks for the article, though!


Ahh I see, the limitation you’re talking about is that there isn’t any inherent guarantee that a different seed will mean two values no longer collide? It’s just likely (in the case of murmur3 at least).


Yes, but it's not just that there's only "a chance" that the collision is avoided - that would be enough if it was actually just a random probability (after all, everything about hashing and collisions - in the best textbook case - is just chance). The problem is that there are methods for obtaining what we call "seed-independent collisions" where by analyzing the hashing algorithm itself you can actually determine a priori values that will collide regardless of the seed.

If you have half an hour to spare, I really recommend you take the time to read this: http://emboss.github.io/blog/2012/12/14/breaking-murmur-hash...


Oh damn, I didn’t know that. Have bookmarked that post, thank you so much for your comments. I’ve been very fortunate in my writing that the comment sections are always full of people with awesome insight that I missed during my own reading on the topic.


If I ever write Murmur4 it's going to be the smallest function that I can prove (via sat solver) to have no seed-independent collisions :D


I approve of the use of relative metrics, Austin! Just be prepared for mmhash4 to be a few megabytes in size ;)


One thing I wish the original article would explain is the use case of buckets. I'm sort of imagining weird use cases where the buckets represent different servers or storage systems for the purpose of spreading out data efficiently...but I'm totally guessing here. Because I would assume a proper load balancer would be based on the dynamics of real-time server performance, not because of some simple hash function. What are common hash to bucket use cases?


The idea of "buckets" here is purely to speed up search time, at a very low level. Don't think about it in the context of a web server or database storage, its way more base than that. Its literally just arrays in memory.

The idea is: if you don't have a Hash or Map already implemented in your language, how would you build a fast one? You cant write my_object['my_key'], that doesn't exist, you don't have Key-Value storage. You need instead to somehow store those pieces of information, and find them later.

Obviously, you could just stick every value inside one big array. Then when you call MyHash.get('key'), you simply do an array search. But that would be slow.

Instead, you can hash the 'key', stick it into a smaller bucket based on the hash, and then more quickly search for it later. In the future, you know the hash of 'key', so you know which bucket to look in.

The author does make it confusing, since in their example each bucket contains Entry (entry['value']), meaning they are already using a JS HashMap implementation in their rebuilding of a HashMap, but you could rewrite the example to do it without any objects. The code would be harder to read though.


This is my eternal struggle: be correct, or be simple.

I wrote a version that used 2-element arrays but the code became more dense and I worried about losing people. I was hoping that how easy it would be to translate it to not use objects would give me a pass here, but apparently not :D


The buckets, in the context of hash maps, are typically lists. Either arrays or linked lists. They're used to store the key-value pairs.

Some load balancers do use hashing in much the same way hash maps do. Usually they'll take a combination of: source IP, source port, destination IP, destination port, and hash it. They'll then use that hash to pick a server. The practical impact of this is that each user always gets mapped to the same server. This is typically called "session sticky load balancing" because it means session information about that user can live on the server, safe in the knowledge that the user will always end up on that server and not get routed to any others.


Off-topic, but it just occurred to me that the "dialog" in this article between the author and the "Haskie" dog is a very traditional way of writing pedagogical works.

Some historical examples of books written in this way are Galileo's "Dialogue Concerning the Two Chief World Systems" (the one that landed him in hot water), and "The Study of Counterpoint" by Johann Joseph Fux (also in the 17th century).

A small excerpt from "The Study of Counterpoint":

>Joseph.— Why did you leave out B between A and C?

> Aloys.— Because it has no perfect fifth and therefore cannot be the final of a mode -- which we shall discuss more fully in its proper place.

Both of these books are written as an entire discussion or argument, as was commonplace in teaching books. I honestly often find myself disliking it, but I think it is a great way to learn for many people.


I didn't know this. I created Haskie during the writing of my Memory Allocation post because I found myself trying to pre-empt reader questions a lot and it was weird to read. Creating a proxy for the reader in Haskie made it flow a lot better.

I've noticed in this post that a 2nd character that's a proxy for an "expert" in a topic would also be handy. Taking suggestions for a good dog breed to represent this character.


Fantastic article! Love the visualizations and your explanations are concise and very clear.


Thank you! I'm really glad you liked it.


Murmur3 author here, very nice writeup.


Thank you! That puts to bed some of my anxieties around whether I’ve misrepresented the qualities of murmur3


Very clear explanation! There's also a famous answer on SO that goes into collisions and has similar visual maps for viewing how random the hashing distributions are for a variety of different non-cryptographic hashing algorithms (top answer):

https://softwareengineering.stackexchange.com/questions/4955...


I saw this during the writing of the post! Extremely cool, I mention it a little in my behind-the-scenes post.


Amazingly well done article, thanks for sharing and making it!


Thank you! I absolutely love making these, and have plans for several more.


Between Sam and Bartosz, I think it's time to bite the bullet and get an RSS reader set up. Anyone got any good app recommendations for Android?


To be mentioned in the same breath as Bartosz Ciechanowski is an incredible compliment, thank you <3

I will say up front: my posts don't read well in an RSS reader. Sorry about that. But at least you'll get a notification when new ones come out.


I’m surprised xxhash wasn’t mentioned. Isn’t it faster and more hash collision resistant than murmur?

I think I read a very good website on another hash proposal that can supersede xxhash (roughly similar performance but more methodical in construction and has other nice properties) but I can’t recall what it’s called (it’s not on the smasher tests yet if I recall correctly)


This was purely due to familiarity. I knew murmur3 was considered a good, current hashing algorithm. I was able to find a good implementation of it in JS, so I didn't really look any further.

Thanks for exposing me to xxhash! I'll store that away as an alternative to murmur3 if I ever need one in future. :)


That's a good, visual primer! Congrats.


Thank you <3


Thanks for the delightful read!

I would've also liked to know how hash maps are typically implemented in programming languages. For instance how the number of buckets is chosen and if buckets are added at runtime when the amount of items in the map changes. But I'll do some further research myself!


You're welcome!

I deliberately didn't go in to that for a few reasons.

1. It would have made this article very, very long. 2. It's a bit out of scope for an article on hashing. 3. I think I might give hash maps their own article in future.

Hash maps are fantastically deep. So many different ways to do it. You'll find a lot of material online but I'd recommend Raymond Hettinger's talk on how Pythons hash map data structure has evolved over time: https://www.youtube.com/watch?v=p33CVV29OG8.


This guide, just like your others, gave me the gift of several ‘ah-ha’ moments that not only helped hash maps click, but also several other linked concepts.

Thank you so much for making the time and effort to create such quality content.

Please keep producing more!


Its kind of confusing that the author use JS objects (aka hashes) as an entry inside the HashMap he's building. I think building the concept of a hash from the ground up would be more illuminating.


I agonised over this. I did originally have a version that used 2-element arrays instead, but the resulting code felt more dense and hard to follow.

I figured that people who know enough to point this out already know what's up. People who don't will benefit from the code being easier to read.


Excellent work, again! Your intros somehow elicit the best out of HN too: the comments are always full of interesting angles and variations.

Wanna do quaternions next? :) Many have tried..


There’s a name I haven’t heard for a long time.

Thank you, sir! As soon as I can reliably spell “quaternions” I will think about trying it.

Funnily, one of the ideas I have cooking in my head is going to require a foray into 3D.


cool, I enjoyed reading article and playing with visualizations. I liked it a lot.

How can I learn to design a hash function? It is possible to understand that stringSum is bad compared to murmur3 by evaluating it against test cases, but what properties make it bad. Is it summation compared to xoring in murmur3? I intuit that summation is kinda lossy, but ofc there is much more rigorous work put into it. It would be really cool to learn more about this. Thank you


I'm actually not sure, to be honest. The post explicitly doesn't touch on the implementation of murmur3, and part of the reason is that while I can see what it's doing, I have no idea why it's doing it.

The initial announcement of murmur (https://tanjent.livejournal.com/756623.html) makes it seem like it's trial and error.


Murmur author here, it is a lot of trial and error but in general you're trying to pack as much "nonlinearity" as possible into as few instructions as possible. Multiplication is commutative, xor is commutative, but if you mix the two together you get functions that are strongly non-linear (in the algebra sense) because the mathematical "spaces" of integers and bits aren't connected together the same way. Ensuring that the nonlinearity applies to all the bits roughly equally is hard and requires trial and error.


The famous Bob Jenkins has lots of info on hash function design:

http://www.burtleburtle.net/bob/index.html

http://www.burtleburtle.net/bob/hash/index.html


Perhaps look at some of what paul khuong has written about umash; it is dense, but good, and includes good references.


Good stuff.

I’d perhaps add a paragraph on bcrypt and why it must be slow.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: