This problem came up before on Reddit. After a bit of code-golfing style optimis...

vlovich123 · on Dec 17, 2022

I mean, if you have arbitrary pre compute time, did they consider generating a perfect hash? The challenge will be that computing the hash for lookup may be slower than SHA1 due to the lack of HW optimization, despite the O(1) lookup.

nielsole · on Dec 17, 2022

I have very little knowledge on information theory, but wouldn't a perfect hashing algorithm for a set of data have to embed all the variance of that data? At query time this could likely still be made faster than binary search, but would not be O(1) as the complexity of the hashing algorithm depends on the data it is built for.

Edit: Wikipedia says perfect hashes have O(1) runtime for a given dataset. This framing is a bit weird, as I would think the size of "1" still depends on "n". Also Wikipedia gives ~1.44 bits/key as complexity so this should be much better than the 9 bytes the article came up with.

thealig · on Dec 19, 2022

Bit late but ill try - perfect hashing is a variant where the set of keys is known( say N) and the hash table size is set large enough (N^2) such that the no of collision is 0 in a nutshell (expected no of collisions to be precise which is less than 1)

So if i understand correctly the variance of the data would not really matter since we will choose a hash table of size of the total space of passwords thus absolutely no collisions will occur no matter which distribution of keys we pick and hash

Thats where the O(1) comes from since searching for a key in a hash slot would take constant time as there is just one key in each slot (due to no collisions)

As the other commentor said the SHA1 collision has such a low probability that theres no need for perfect hashing

nielsole · on Dec 19, 2022

Ah, ok that makes a lot of sense. But in the use case of the article I assumed we were under the constraint that output space would be exactly the size of input space as we have a non-sparse file on disk.

vlovich123 · on Dec 20, 2022

Pretty sure you get O(n) space usage

assbuttbuttass · on Dec 17, 2022

In practice, the difference between a perfect and imperfect hash is minimal since the chance of a SHA collision is vanishingly small

lern_too_spel · on Dec 17, 2022

I assume they meant computing a minimal perfect hash, so they can seek directly to where the password hash would be in the file if it exists instead of binary searching.

vlovich123 · on Dec 20, 2022

Yes that. Wasn’t even thinking about collisions since the probability is so low.

jojohohanon · on Dec 17, 2022

If we could support a larger file, we can just zero pad the counts (stored binary) to max needed length. Then we get accurate seek to entry # x.

If that is not possible, precompute the average count representation length to get an accurate AVG offset per entry. Accurate in aggregate.

But I wonder if I wouldn’t use some prefix representation, like a trie, but that’s cheating I think

Retric · on Dec 16, 2022

Computing a SHA1 (Ed: for every password in the file) is hardly free, which suggests rather than minimize total time you want to minimize latency for a service.

In which case the obvious improvement is to load as much of the password file into RAM as possible while waiting for the request and then only search the disk if it’s non in RAM. As to whether you want to load prefixes so you can quickly reject passwords not in the file or full passwords to acknowledge the password as soon as possible probably depends on how likely the matches are.

nine_k · on Dec 17, 2022

Modern CPUs have hardware acceleration for computing SHA1 and SHA2. They can be effectively free if they can be computed by CPU cores faster than the data from disk (likely NVMe now) are committed to RAM.

But a crypto hash like SHA may be a wrong hash here. Something like murmur / city / metro hash may be a better solution, since they compute 3-5 bytes of hash per clock.

Retric · on Dec 17, 2022

You are focusing on the wrong step. “Compute the SHA1 of each password. Sort. Store in a binary file”

Preprocessing the file and adding hashes isn’t free.

le-mark · on Dec 17, 2022

You are being down voted because although your point is valid, it has zero relevance because the goal is fast lookup. Preprocessing cost is not a factor.

Retric · on Dec 17, 2022

Sure, my point was if prepossessing is allowed then arguably so should having a hot cache.

It’s an arbitrary problem, so you could argue that pre computation is fine but you need to assume a cold cache or whatever.

coldtea · on Dec 17, 2022

>Sure, my point was if prepossessing is allowed then arguably so should having a hot cache.

Preprocessing is always an option: just needs some ahead of time effort.

A hot cache for a 37GB file is not always an option.

Retric · on Dec 17, 2022

Preprocessing isn’t an option if you’re given the file at the same time you get the list to check.

Aka are you building a service or grep.

chii · on Dec 17, 2022

it is a factor - you calculate the amortized cost of calculating the hashes (using the total access count as the amortization count).

HyperSane · on Dec 17, 2022

For any indexing operation that is always assumed to be asymptotically zero.

nine_k · on Dec 17, 2022

I'm saying that adding hashes is basically free, given the right hash function.

Sorting is very much not free though. I don't see why would sorting be needed if you have an efficient hash table, or a search tree. Essentially TFA describes just that, an index file + data file approach. They could have used SQLite which readily provides indexed access, instead of pure Python.

Of course if you never do the same search again, you're back to mmap-ing the file and doing a full scan smartly, like GNU grep does (https://lists.freebsd.org/pipermail/freebsd-current/2010-Aug...).

lern_too_spel · on Dec 17, 2022

You are being downvoted because the file is given as sorted sha1 hashes, and the benchmark in the article is about speeding up multiple lookups.

bArray · on Dec 17, 2022

If you're going to pre-compute, pre-compute also a look-up table for your ordered data. I wrote something like this for a simple flat-binary key-value store [1] (that table is computed into RAM on start-up).

That way you jump from look-up table to look-up table, and then search the remaining space with something like a binary search.

[1] https://gitlab.com/danbarry16/u-database

colejohnson66 · on Dec 16, 2022

> Compute the SHA1 of the candidate password, and use the prefix to guess at the location in the file. Read a block of +/- a few kilobytes around that guessed location. This will have a "hit" about 99% of the time, and at this guess will have to be refined at most 2-3 times.

Basically, a binary search?

ryangittins · on Dec 17, 2022

It's similar, but not quite a binary search. Imagine you have a book and you're trying to determine if a certain page number exists or if it's been torn out. You know the order, as pages are numbered in sequence, but you don't know whether the target page is still in the book or not. (For very large, cryptographically secure books you can assume the pages have been torn out uniformly.)

With the method described, you would measure the thickness of the book (the size of the lookup file) and open to a proportional page. e.g. If you're looking for page 120 and the book is 360 pages long, that's 1/3 of the thickness. So, find the point that's a third of the thickness of the book and open the book there. You'll be pretty close.

With a binary search, you wouldn't measure the thickness of the book. You'd always just blindly start in the middle, assess which half the target page is, and repeat the operation on that half and you'll get there pretty quickly.

Both exploit the knowledge of how the data is organized and both are most well-suited to uniform data (like pages in a book or random hashes), but are just different strategies requiring different means of accessing the data.

Admittedly not a perfect analogy, but you get the idea.

sally_glance · on Dec 17, 2022

Wiki lists a variation of binary search called "interpolation search" [1] which

> Instead of calculating the midpoint, interpolation search estimates the position of the target value, taking into account the lowest and highest elements in the array as well as length of the array.

Sound pretty close to what you're describing :-)

[1] https://en.m.wikipedia.org/wiki/Binary_search_algorithm#Inte...

dellamonica · on Dec 17, 2022

I think the basic idea is that the hash has a fairly uniform probability distribution, so knowing the prefix means you can estimate its location in a sorted list.

For instance if we were talking about n random sequences of digits then if you want to look for a number starting with 42 then you can start looking at the the 0.42n element and it is likely already very close to a match.

piyh · on Dec 17, 2022

Is a uniform hash the same thing as a consistent hash?

User23 · on Dec 17, 2022

No. Consistent hashing and its easier to understand cousin rendezvous hashing are about maintaining properties after a rehash.

contravariant · on Dec 16, 2022

More like the secant method I suppose.

0-_-0 · on Dec 17, 2022

You could optimize this further if you're combining multiple lookups