Hacker News new | past | comments | ask | show | jobs | submit login

Some major blunders in the first paragraphs. Iterating over all 32 bits is trivial, over 64 bit not possible in normal time. She mixed up the numbers there.

Thus I don't see the real need to extend to 128 bits for the birthday paradox. At Google scale there might be, but not for pedestrians.




No blunder here. They're saying that with a 64-bit hash you will likely see hash collisions with 2^32 items. It's easy to generate 2^32 items, so it follows that it is easy to generate a collision.

The high likelihood of collisions is explained by the birthday paradox.

With a 128-bit hash you don't have this problem anymore.


But if you are then mapping the elements to (most likely vastly fewer than 2^32) buckets... why do the hash collisions matter? Just use the top 64 bits of the hash to compute a bucket, using ((hash >> 64) * buckets as u128) >> 64.

The reason this works is that hash64 / 2^64 is almost uniformly distributed on [0, 1), so floor(hash64 * buckets / 2^64) is almost uniformly distributed on [0, buckets). This compiles to a single mul instruction on x86-64, a single mulhi instruction on ARM.


Agreed on multiplying the top 64 bits rather than dividing: if the hash is calculated via prime multiplications, the top bits are "more random" than the bottom bits.

But we don't really know the hash table type. It could be doing something like hash table -> linked list -> items, in which case they might still be utilising the full 128-bit hash (I'd usually just store the lower 64 bits TBH).


I'm not saying to throw away the rest of the hash, just to ignore it for the bucket calculation. You can utilize the full 128-bit hash for other parts.


Ah, misunderstood you. Thanks for clarifying.


Calculate the probability then. With 64 bits 2^32 tries have a possibility of ~0.39347 collisions. Which is unlikely, not likely. You need at least 10x more to get a good probability of %97. Which needs ~40m with 8 cores on my CPU. Which is not a practical flyby attack anymore. With a big GPU it would though.

https://kevingal.com/apps/collision.html

I do calculate 32bit collisions for all hash functions regularly. But not for 64bit functions, because they are not practical. Google is the only prominent one who cares for 128bit, but then they should at least get their numbers right. Because most of their arguments are pure security theater.


2^32 is 2 months of 1000 random requests per second, so every 2 months you'd have a random chance that your service breaks for someone if you're relying on there not being any collisions.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: