I mean, if you have arbitrary pre compute time, did they consider generating a p...

nielsole · on Dec 17, 2022

I have very little knowledge on information theory, but wouldn't a perfect hashing algorithm for a set of data have to embed all the variance of that data? At query time this could likely still be made faster than binary search, but would not be O(1) as the complexity of the hashing algorithm depends on the data it is built for.

Edit: Wikipedia says perfect hashes have O(1) runtime for a given dataset. This framing is a bit weird, as I would think the size of "1" still depends on "n". Also Wikipedia gives ~1.44 bits/key as complexity so this should be much better than the 9 bytes the article came up with.

thealig · on Dec 19, 2022

Bit late but ill try - perfect hashing is a variant where the set of keys is known( say N) and the hash table size is set large enough (N^2) such that the no of collision is 0 in a nutshell (expected no of collisions to be precise which is less than 1)

So if i understand correctly the variance of the data would not really matter since we will choose a hash table of size of the total space of passwords thus absolutely no collisions will occur no matter which distribution of keys we pick and hash

Thats where the O(1) comes from since searching for a key in a hash slot would take constant time as there is just one key in each slot (due to no collisions)

As the other commentor said the SHA1 collision has such a low probability that theres no need for perfect hashing

nielsole · on Dec 19, 2022

Ah, ok that makes a lot of sense. But in the use case of the article I assumed we were under the constraint that output space would be exactly the size of input space as we have a non-sparse file on disk.

vlovich123 · on Dec 20, 2022

Pretty sure you get O(n) space usage

assbuttbuttass · on Dec 17, 2022

In practice, the difference between a perfect and imperfect hash is minimal since the chance of a SHA collision is vanishingly small

lern_too_spel · on Dec 17, 2022

I assume they meant computing a minimal perfect hash, so they can seek directly to where the password hash would be in the file if it exists instead of binary searching.

vlovich123 · on Dec 20, 2022

Yes that. Wasn’t even thinking about collisions since the probability is so low.