This problem came up before on Reddit. After a bit of code-golfing style optimisation the general consensus was that for searching for passwords in a file the optimal exact solution (which excludes Bloom filters) was:
Compute the SHA1 of each password. Sort. Store in a binary file, so for 'n' passwords the file is '20n' bytes in length exactly.
Compute the SHA1 of the candidate password, and use the prefix to guess at the location in the file. Read a block of +/- a few kilobytes around that guessed location. This will have a "hit" about 99% of the time, and at this guess will have to be refined at most 2-3 times.
It's possible to precompute the maximum +/- error bounds, and also to produce a tiny (~100KB) second level index that can guarantee a hit in one I/O operation for a lookup. For a file that's bigger than memory, this is absolutely optimal and cannot be improved upon (without sacrificing the exact answers).
I mean, if you have arbitrary pre compute time, did they consider generating a perfect hash? The challenge will be that computing the hash for lookup may be slower than SHA1 due to the lack of HW optimization, despite the O(1) lookup.
I have very little knowledge on information theory, but wouldn't a perfect hashing algorithm for a set of data have to embed all the variance of that data? At query time this could likely still be made faster than binary search, but would not be O(1) as the complexity of the hashing algorithm depends on the data it is built for.
Edit: Wikipedia says perfect hashes have O(1) runtime for a given dataset. This framing is a bit weird, as I would think the size of "1" still depends on "n". Also Wikipedia gives ~1.44 bits/key as complexity so this should be much better than the 9 bytes the article came up with.
Bit late but ill try - perfect hashing is a variant where the set of keys is known( say N) and the hash table size is set large enough (N^2) such that the no of collision is 0 in a nutshell (expected no of collisions to be precise which is less than 1)
So if i understand correctly the variance of the data would not really matter since we will choose a hash table of size of the total space of passwords thus absolutely no collisions will occur no matter which distribution of keys we pick and hash
Thats where the O(1) comes from since searching for a key in a hash slot would take constant time as there is just one key in each slot (due to no collisions)
As the other commentor said the SHA1 collision has such a low probability that theres no need for perfect hashing
Ah, ok that makes a lot of sense. But in the use case of the article I assumed we were under the constraint that output space would be exactly the size of input space as we have a non-sparse file on disk.
I assume they meant computing a minimal perfect hash, so they can seek directly to where the password hash would be in the file if it exists instead of binary searching.
Computing a SHA1 (Ed: for every password in the file) is hardly free, which suggests rather than minimize total time you want to minimize latency for a service.
In which case the obvious improvement is to load as much of the password file into RAM as possible while waiting for the request and then only search the disk if it’s non in RAM. As to whether you want to load prefixes so you can quickly reject passwords not in the file or full passwords to acknowledge the password as soon as possible probably depends on how likely the matches are.
Modern CPUs have hardware acceleration for computing SHA1 and SHA2. They can be effectively free if they can be computed by CPU cores faster than the data from disk (likely NVMe now) are committed to RAM.
But a crypto hash like SHA may be a wrong hash here. Something like murmur / city / metro hash may be a better solution, since they compute 3-5 bytes of hash per clock.
You are being down voted because although your point is valid, it has zero relevance because the goal is fast lookup. Preprocessing cost is not a factor.
I'm saying that adding hashes is basically free, given the right hash function.
Sorting is very much not free though. I don't see why would sorting be needed if you have an efficient hash table, or a search tree. Essentially TFA describes just that, an index file + data file approach. They could have used SQLite which readily provides indexed access, instead of pure Python.
If you're going to pre-compute, pre-compute also a look-up table for your ordered data. I wrote something like this for a simple flat-binary key-value store [1] (that table is computed into RAM on start-up).
That way you jump from look-up table to look-up table, and then search the remaining space with something like a binary search.
> Compute the SHA1 of the candidate password, and use the prefix to guess at the location in the file. Read a block of +/- a few kilobytes around that guessed location. This will have a "hit" about 99% of the time, and at this guess will have to be refined at most 2-3 times.
It's similar, but not quite a binary search. Imagine you have a book and you're trying to determine if a certain page number exists or if it's been torn out. You know the order, as pages are numbered in sequence, but you don't know whether the target page is still in the book or not. (For very large, cryptographically secure books you can assume the pages have been torn out uniformly.)
With the method described, you would measure the thickness of the book (the size of the lookup file) and open to a proportional page. e.g. If you're looking for page 120 and the book is 360 pages long, that's 1/3 of the thickness. So, find the point that's a third of the thickness of the book and open the book there. You'll be pretty close.
With a binary search, you wouldn't measure the thickness of the book. You'd always just blindly start in the middle, assess which half the target page is, and repeat the operation on that half and you'll get there pretty quickly.
Both exploit the knowledge of how the data is organized and both are most well-suited to uniform data (like pages in a book or random hashes), but are just different strategies requiring different means of accessing the data.
Admittedly not a perfect analogy, but you get the idea.
Wiki lists a variation of binary search called "interpolation search" [1] which
> Instead of calculating the midpoint, interpolation search estimates the position of the target value, taking into account the lowest and highest elements in the array as well as length of the array.
I think the basic idea is that the hash has a fairly uniform probability distribution, so knowing the prefix means you can estimate its location in a sorted list.
For instance if we were talking about n random sequences of digits then if you want to look for a number starting with 42 then you can start looking at the the 0.42n element and it is likely already very close to a match.
CREATE INDEX currently has the restriction that the index must fit in memory [1]. As the data is already sorted, creating an index is not necessary anyway. The min/max indexes created automatically by the system are sufficient to complete the query in a few milliseconds.
D CREATE TABLE passwords (hash TEXT, count INT);
D COPY passwords FROM '~/Downloads/pwned-passwords-sha1-ordered-by-hash-v8.txt' (SEPARATOR ':');
D .timer on
D SELECT \* FROM passwords WHERE hash=upper('5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8');
┌──────────────────────────────────────────┬─────────┐
│ hash │ count │
│ varchar │ int32 │
├──────────────────────────────────────────┼─────────┤
│ 5BAA61E4C9B93F3F0682250B6CF8331B7EE68FD8 │ 9545824 │
└──────────────────────────────────────────┴─────────┘
Run Time (s): real 0.005 user 0.007455 sys 0.000584
I cannot even ssh into the server after trying to use DuckDB. It is completely dead (with all the ducks, what a misery).
The reason is probably that it's using a full index, in contrast with the sparse index in ClickHouse, and maybe it's trying to build it in memory, going to swap (the server has 32 GB memory).
Because DuckDB uses ACID [1] data is loaded in an all-or-nothing manner. As the load was interrupted due to the system running out of memory, the table is expected to be empty.
If you load the data properly (creating the index after insertion, which is definitely preferable in this case), it will load extremely quickly (milliseconds).
You should also disclose your relationship with a competing project. For the record, I use DuckDB in personal projects and love it. You seem to be misusing it. :)
“These days”? A 32 bit intel running mysql back in 2000 would probably have laughed at this query.
Granted, I’d expect one or two disk seeks, at ~ 10 ms each. I imagine on modern hardware, it would be in the 100’s of usec range. (Assuming you limited it to 2-4 GB of ram).
I'm always amazed about how well naive Unix solutions measure up compared to contemporary solutions. For instance, the non-indexed python solution here takes a little more time than the Unix look command (Pi 4):
time pwned.sh password
5BAA61E4C9B93F3F0682250B6CF8331B7EE68FD8:9545824
0.053u 0.035s 0:00.40 20.0% 83+22k 35+0io 21pf+0w
time python3.10 02-binary-search.py pwned-passwords-sha1-ordered-by-hash-v8.txt password
looking for 5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8
pwned! seen 9,545,824 times before
in 0.090021 seconds
0.105u 0.013s 0:00.44 25.0% 0+3k 32+0io 4pf+0w
Obviously, if you're doing millions of lookups you'll want indexing, but otherwise using built-in tools that have been optimized and refined for decades does very well.
If we are really going all out for performance optimization...
In theory, hashes are going to be very evenly distributed. With a file of that size you could probably land within 1% of the correct location just by seeking to byte offset percentage given by the first bytes of the hash...
> We can do this once, and then binary search a safety interval around that position. Alas, this only gets rid of the fast jumps at the beginning of the binary search, and for some reason, it ends up being slightly slower than binary search alone.
I assume this doesn't improve perf much because you pretty quickly zoom in on a slice of hashes that are not evenly distributed. 37GB feels like a lot of hashes but it's pretty small compared to the space of all hashes, I suspect this approach would be more effective if we were searching over a much bigger chunk of data.
look - display lines beginning with a given string
DESCRIPTION
The look utility displays any lines in file which contain string. As look performs a binary search, the lines in file must be sorted (where sort(1) was given the same options -d and/or -f that look is invoked with).
example:
justin@box:~/data$ time look $(echo -n secret123 | sha1sum | cut -d ' ' -f 1 | tr a-z A-Z) pwned-passwords-sha1-ordered-by-hash-v6.txt
F2B14F68EB995FACB3A1C35287B778D5BD785511:17384
real 0m0.212s
user 0m0.005s
sys 0m0.001s
justin@box:~/data$ time look $(echo -n secret123 | sha1sum | cut -d ' ' -f 1 | tr a-z A-Z) pwned-passwords-sha1-ordered-by-hash-v6.txt
F2B14F68EB995FACB3A1C35287B778D5BD785511:17384
real 0m0.002s
user 0m0.003s
sys 0m0.001s
You can make python binary search super fast if you use mmap. here's a version of that I had lying around, it's probably correct.
import os
import mmap
def do_mmap(f):
fd = os.open(f, os.O_RDONLY)
size = os.lseek(fd, 0, 2)
os.lseek(fd, 0, 0)
m = mmap.mmap(fd, size, prot=mmap.PROT_READ)
return m, size, fd
SEEK_SET = 0
SEEK_CUR = 1
class Searcher:
def __init__(self, file):
self.file = file
self.map, self.size, self.fd = do_mmap(file)
def close(self):
self.map.close()
os.close(self.fd)
def find_newline(self):
self.map.readline()
return self.map.tell()
def binary_search(self, q):
pos = 0
start = 0
end = self.size
found = False
#this can get stuck with start = xxx and end = xxx+1, probably from the \r\n
while start < end - 2:
mid = start + (end-start)//2
self.map.seek(mid)
pos = self.find_newline()
if pos > end:
break
line = self.map.readline()
if q < line:
end = mid
elif q > line:
start = mid
while True:
line = self.map.readline()
if not line.startswith(q): break
yield line
if __name__ == "__main__":
import sys
q = sys.argv[1]
s = Searcher("pwned-passwords-sha1-ordered-by-hash-v6.txt")
import time
ss = time.perf_counter()
res = s.binary_search(q.upper().encode())
for x in res:
print(x)
ee = time.perf_counter()
print(ee-ss)
if have 10 computers put every 10th line in its own file, if each file is 1000 lines put line 500 at the start, then line 250, then line 750, then line 125, 375 etc
There is a bit over 1B passwords in there (based on the size of the file and the length of the line). You would need a binary file around 3GB in size that would have to either load into memory or do about 17 accesses to read specific bytes ( no searching) to figure out if the password is in the filter.
> Can't you just use something like a bloom filter?
If I had the ability to download a massive file I’d try it out on a hextree I toy around with occasionally.
If you’re making an index file may as well just throw it into a tree structure where a lookup is anywhere from 1 to 20 pointer dereferences (assuming the checksum is 20 hex digits) as it optimizes storage so tree depth is variable. Plus it can retain the counts as well.
Now I really want to try this out, the last article I read along these lines I used it as a comparison and it was equally as efficient as their conclusion.
As I said, "not worrying about it in practice" isn't "the same problem." If that is hard to wrap your mind around, you can suppose a fine of $1 billion for every wrong answer.
I'm pretty skeptical of the performance of a billion files.
I'm sure it will go at okay speed once you actually construct it, since it's basically a tree, but with so many entries I'd expect it to be much less pleasant than a database in many ways. The "preprocessing" is going to be especially awful.
There's probably a large overhead in storage used. It would probably result in much larger overall storage than 37GB. And I agree preprocessing would be painful.
But I'm curious on how lookup speeds would compare to the author's 1ms.
I'm also curious on how addition of a new hash would compare against adding a new hash to the single sorted file used by the author.
Leveraging any database is probably better in any case :)
If we're running on a SATA SSD, we can probably chain 5 or fewer accesses to the drive before we go over a millisecond. And each directory of depth likely requires 2 accesses.
> I'm also curious on how addition of a new hash would compare against adding a new hash to the single sorted file used by the author.
In a fair fight with that requirement, the sorted file would be allowed to add a few percent of extra blank entries and then it could insert in a millisecond too.
Git chooses to pack them into much fewer files when there are lots of loose files.
Actually, these days many filesystems perform surprisingly well with lots of little files. What don't work with huge directories is basic utilities like ls, or anything that likes to collect file list in memory & sort it. I have some directories that essentially hang ls, where find is still happy listing the files (because it just streams the output).
You could compute the expected position of the hash in the file as hash / 2^160 * filesize and then just look around in the vicinity of that. Would be interesting to know what the maximum deviation for any hash from its expected position is, then you could maybe just load a single block of that size and search it in memory if the maximum deviation is sufficiently small.
Not very well unfortunately. The author's confusion about how binary search works is preventing progress toward this solution. The idea is to do an "unbalanced" binary search, with the pivot chosen as offset (range size * big-endian value of hash / (maximum possible big-endian value of hash + 1)). Correctness is exactly the same as correctness of binary search, because the correctness of binary search does not depend on the offset of the pivot (so long as it's not zero or past the end). At some heuristic range length, fall back to linear search of the range. (Possibly, at some larger length, fall back to balanced binary search... But with efficient linear search of no more than a couple disk pages, I doubt this would help.)
This exactly. The author is not describing their approach very well, so I stopped reading because it was frustrating to try to figure out if they tried this solution.
The simple statement of the approach is:
Use binary search, but instead of using the middle as the pivot, use the estimated position heuristic as the pivot for each iteration.
> You could compute the expected position of the hash in the file as hash / 2^160 * filesize and then just look around in the vicinity of that.
> The idea is to do an "unbalanced" binary search, with the pivot chosen as offset (range size * big-endian value of hash / (maximum possible big-endian value of hash + 1)).
Thanks. I thought something along these lines but didn't know how to express or formalize it. Now I must study unbalanced binary search algorithms.
I tried it out and the maximum deviation of the actual position in the file from the estimated one is a bit under five megabyte for the version 8 file. That is not good enough to just fetch a single page by quite a bit. So I guess building an index is the only real option to get the number of reads down, but even then you will have at least one read. With an SSD that will take on the order of a few hundred microseconds, so it can be done under a millisecond without relying on a cache. Binary search will require about 20 to 30 reads, so without a cache it would be about that much slower, however the first couple of steps will quickly get cached, reducing the number of reads that actually hit the hardware quite a bit.
> Binary search will require about 20 to 30 reads,
This does not pass a sanity check. If one switches to linear scan at 4KiB = 2^12 bytes, then 30 (balanced, but approximately correct for unbalanced) binary search steps would search a 4TiB file. I don't know how big the hash file is but it isn't this big. Not even 4GiB big.
The maximum number of steps is lg(file size)-12, maybe plus 2 or so to account for the unbalanced search. (NB the unbalanced search approximates the balanced only because the hashes are approximately uniform; the relation doesn't hold in general.)
The version 8 file contains 847,223,402 hashes which gives 29.7 bits, that is where I got the 30 reads from. And I assumed the most naïve approach, always just seeking and reading a single hash, but admittedly this would almost certainly not read the final page over and over again. And then I just subtracted 10 taking into account that you would read the final page only once, with 10 being a rough approximation of 12. How many bytes you actually read per read could be affected by all kind of things - sector size, cluster size, page size, read ahead, ... - so 10 bit was good enough for me. But you are right, the number of reads would almost certainly be in the vicinity of 20, not 30.
I actually did try this, but I got impatient and stopped it after 30 minutes or so, and a ~20G database file.
For some reason, after that, `ls` in that directory (or even shell completion) would freeze the shell entirely, even after reboot. I eventually managed to delete the file with `rm db.sqlite` (no completion allowed), but even that took like 2 minutes.
I might try again with WAL enabled (based on my shell history, I also deleted a -journal file).
importing was sooo slow (something I likely messed up trying to make it go faster, was only chugging along at ~100MB/s)
SQLite version 3.37.0 2021-12-09 01:34:53
here's the results tho
sqlite> create table hashes (hash text, count text);
sqlite> pragma cache_size=1000;
sqlite> pragma synchronous=off;
sqlite> .mode csv
sqlite> .separator :
sqlite> .timer on
sqlite> .import pwned-passwords-sha1-ordered-by-hash-v8.txt hashes
sqlite> select * from hashes where hash = 'F2B14F68EB995FACB3A1C35287B778D5BD785511';
^CRun Time: real 14.841 user 9.370057 sys 2.363629
Error: stepping, interrupted (9)
sqlite> create unique index hash_uniq on hashes(hash);
Run Time: real 530.029 user 262.952507 sys 213.305437
sqlite> select * from hashes where hash = 'F2B14F68EB995FACB3A1C35287B778D5BD785511';
F2B14F68EB995FACB3A1C35287B778D5BD785511:26437
Run Time: real 0.004 user 0.000303 sys 0.000936
sqlite> select * from hashes where hash = 'F2B14F68EB995FACB3A1C35287B778D5BD785511';
F2B14F68EB995FACB3A1C35287B778D5BD785511:26437
Run Time: real 0.001 user 0.000310 sys 0.000306
sqlite> select * from hashes where hash = 'F2B14F68EB995FACB3A1C35287B778D5BD785511';
F2B14F68EB995FACB3A1C35287B778D5BD785511:26437
Run Time: real 0.000 user 0.000295 sys 0.000305
sqlite> select * from hashes where hash = 'F2B14F68EB995FACB3A1C35287B778D5BD785511';
F2B14F68EB995FACB3A1C35287B778D5BD785511:26437
Run Time: real 0.000 user 0.000320 sys 0.000279
sqlite> select * from hashes where hash = 'F2B14F68EB995FACB3A1C35287B778D5BD785511';
F2B14F68EB995FACB3A1C35287B778D5BD785511:26437
Run Time: real 0.001 user 0.000322 sys 0.000317
sqlite> select * from hashes where hash = 'F2B14F68EB995FACB3A1C35287B778D5BD785511';
F2B14F68EB995FACB3A1C35287B778D5BD785511:26437
Run Time: real 0.001 user 0.000128 sys 0.000155
sqlite> select * from hashes where hash = 'F2B14F68EB995FACB3A1C35287B778D5BD785511';
F2B14F68EB995FACB3A1C35287B778D5BD785511:26437
Run Time: real 0.001 user 0.000145 sys 0.000158
I wonder if they could improve this by queueing multiple positions to read at once. Presumably the disk will be able to do some of them more efficiently.
(author here) Hey, that's actually a great idea. I remember reading about a fast grep or wc implementation that did parallel reads.
Off the top of my head, I don't really know how to do that (aside from the naive way of using threads), I'm guessing readv[1] might be be useful? Will definitely look into it.
This could be a game changing blog post for self taught devs like me who are just getting into profiling code for improving it and such. I’m still reading it but I’ve already bookmarked it for further study.
Glad to hear that :D One of my main goals was to show the entire progression of how I approached the problem, instead of just saying "here's the final result".
I wish that community would put a warning sign or something. The article is a collection of things that should never be done. People who want to learn software performance work need to start from a solid foundation. Mentioning that algorithms exist at the end is not good enough.
(replying with more details, because I think it's pretty cool, and I'll definitely try this out)
The linked item is to https://blog.mro.name/2022/08/pwned-diy/, where the author converts the passwords list to a CDB (Constant DataBase) file, a simple, fast lookup database format created by D. J. Bernstein.
There is a more modern rewrite that does away with the 4 gigabyte limit of the original[1]. I am not sure if the performance characteristics are the same.
You're a bit hasty, young padawan. Do you insinuate this has never been tried while you don't even take a glimpse at it and see how it's done? What use is it linking the cdb docs? The article has it all in context if you bother reading.
The dataset is turned to 16 cdbs to be precise in case you wonder how 37GB fit into the 4GB cdb size limit.
Having recently been digging into the CDB format[1], this was exactly my first thought as well. `tinycdb` is extremely fast at creating an index, and most successful lookups take only two disk seeks (failed lookups need only one), very close to the optimal possible lookup performance on this problem.
Let me get this straight. OP searched for an entry in a sorted list using dichotomic search. Okay... Any CS undergrad can do that. Is there something that I'm missing?
CS solved this problem, but not in such a way yet that we don't have to think about it anymore. If Python's dict implemented all of this under the hood then it could be called a solved problem.
Compute the SHA1 of each password. Sort. Store in a binary file, so for 'n' passwords the file is '20n' bytes in length exactly.
Compute the SHA1 of the candidate password, and use the prefix to guess at the location in the file. Read a block of +/- a few kilobytes around that guessed location. This will have a "hit" about 99% of the time, and at this guess will have to be refined at most 2-3 times.
It's possible to precompute the maximum +/- error bounds, and also to produce a tiny (~100KB) second level index that can guarantee a hit in one I/O operation for a lookup. For a file that's bigger than memory, this is absolutely optimal and cannot be improved upon (without sacrificing the exact answers).