Hacker News new | past | comments | ask | show | jobs | submit login
I almost failed to search a 37 GB text file in under 1 millisecond (andgravity.com)
384 points by xrayarx on Dec 16, 2022 | hide | past | favorite | 142 comments



This problem came up before on Reddit. After a bit of code-golfing style optimisation the general consensus was that for searching for passwords in a file the optimal exact solution (which excludes Bloom filters) was:

Compute the SHA1 of each password. Sort. Store in a binary file, so for 'n' passwords the file is '20n' bytes in length exactly.

Compute the SHA1 of the candidate password, and use the prefix to guess at the location in the file. Read a block of +/- a few kilobytes around that guessed location. This will have a "hit" about 99% of the time, and at this guess will have to be refined at most 2-3 times.

It's possible to precompute the maximum +/- error bounds, and also to produce a tiny (~100KB) second level index that can guarantee a hit in one I/O operation for a lookup. For a file that's bigger than memory, this is absolutely optimal and cannot be improved upon (without sacrificing the exact answers).


I mean, if you have arbitrary pre compute time, did they consider generating a perfect hash? The challenge will be that computing the hash for lookup may be slower than SHA1 due to the lack of HW optimization, despite the O(1) lookup.


I have very little knowledge on information theory, but wouldn't a perfect hashing algorithm for a set of data have to embed all the variance of that data? At query time this could likely still be made faster than binary search, but would not be O(1) as the complexity of the hashing algorithm depends on the data it is built for.

Edit: Wikipedia says perfect hashes have O(1) runtime for a given dataset. This framing is a bit weird, as I would think the size of "1" still depends on "n". Also Wikipedia gives ~1.44 bits/key as complexity so this should be much better than the 9 bytes the article came up with.


Bit late but ill try - perfect hashing is a variant where the set of keys is known( say N) and the hash table size is set large enough (N^2) such that the no of collision is 0 in a nutshell (expected no of collisions to be precise which is less than 1)

So if i understand correctly the variance of the data would not really matter since we will choose a hash table of size of the total space of passwords thus absolutely no collisions will occur no matter which distribution of keys we pick and hash

Thats where the O(1) comes from since searching for a key in a hash slot would take constant time as there is just one key in each slot (due to no collisions)

As the other commentor said the SHA1 collision has such a low probability that theres no need for perfect hashing


Ah, ok that makes a lot of sense. But in the use case of the article I assumed we were under the constraint that output space would be exactly the size of input space as we have a non-sparse file on disk.


Pretty sure you get O(n) space usage


In practice, the difference between a perfect and imperfect hash is minimal since the chance of a SHA collision is vanishingly small


I assume they meant computing a minimal perfect hash, so they can seek directly to where the password hash would be in the file if it exists instead of binary searching.


Yes that. Wasn’t even thinking about collisions since the probability is so low.


If we could support a larger file, we can just zero pad the counts (stored binary) to max needed length. Then we get accurate seek to entry # x.

If that is not possible, precompute the average count representation length to get an accurate AVG offset per entry. Accurate in aggregate.

But I wonder if I wouldn’t use some prefix representation, like a trie, but that’s cheating I think


Computing a SHA1 (Ed: for every password in the file) is hardly free, which suggests rather than minimize total time you want to minimize latency for a service.

In which case the obvious improvement is to load as much of the password file into RAM as possible while waiting for the request and then only search the disk if it’s non in RAM. As to whether you want to load prefixes so you can quickly reject passwords not in the file or full passwords to acknowledge the password as soon as possible probably depends on how likely the matches are.


Modern CPUs have hardware acceleration for computing SHA1 and SHA2. They can be effectively free if they can be computed by CPU cores faster than the data from disk (likely NVMe now) are committed to RAM.

But a crypto hash like SHA may be a wrong hash here. Something like murmur / city / metro hash may be a better solution, since they compute 3-5 bytes of hash per clock.


You are focusing on the wrong step. “Compute the SHA1 of each password. Sort. Store in a binary file”

Preprocessing the file and adding hashes isn’t free.


You are being down voted because although your point is valid, it has zero relevance because the goal is fast lookup. Preprocessing cost is not a factor.


Sure, my point was if prepossessing is allowed then arguably so should having a hot cache.

It’s an arbitrary problem, so you could argue that pre computation is fine but you need to assume a cold cache or whatever.


>Sure, my point was if prepossessing is allowed then arguably so should having a hot cache.

Preprocessing is always an option: just needs some ahead of time effort.

A hot cache for a 37GB file is not always an option.


Preprocessing isn’t an option if you’re given the file at the same time you get the list to check.

Aka are you building a service or grep.


it is a factor - you calculate the amortized cost of calculating the hashes (using the total access count as the amortization count).


For any indexing operation that is always assumed to be asymptotically zero.


I'm saying that adding hashes is basically free, given the right hash function.

Sorting is very much not free though. I don't see why would sorting be needed if you have an efficient hash table, or a search tree. Essentially TFA describes just that, an index file + data file approach. They could have used SQLite which readily provides indexed access, instead of pure Python.

Of course if you never do the same search again, you're back to mmap-ing the file and doing a full scan smartly, like GNU grep does (https://lists.freebsd.org/pipermail/freebsd-current/2010-Aug...).


You are being downvoted because the file is given as sorted sha1 hashes, and the benchmark in the article is about speeding up multiple lookups.


If you're going to pre-compute, pre-compute also a look-up table for your ordered data. I wrote something like this for a simple flat-binary key-value store [1] (that table is computed into RAM on start-up).

That way you jump from look-up table to look-up table, and then search the remaining space with something like a binary search.

[1] https://gitlab.com/danbarry16/u-database


> Compute the SHA1 of the candidate password, and use the prefix to guess at the location in the file. Read a block of +/- a few kilobytes around that guessed location. This will have a "hit" about 99% of the time, and at this guess will have to be refined at most 2-3 times.

Basically, a binary search?


It's similar, but not quite a binary search. Imagine you have a book and you're trying to determine if a certain page number exists or if it's been torn out. You know the order, as pages are numbered in sequence, but you don't know whether the target page is still in the book or not. (For very large, cryptographically secure books you can assume the pages have been torn out uniformly.)

With the method described, you would measure the thickness of the book (the size of the lookup file) and open to a proportional page. e.g. If you're looking for page 120 and the book is 360 pages long, that's 1/3 of the thickness. So, find the point that's a third of the thickness of the book and open the book there. You'll be pretty close.

With a binary search, you wouldn't measure the thickness of the book. You'd always just blindly start in the middle, assess which half the target page is, and repeat the operation on that half and you'll get there pretty quickly.

Both exploit the knowledge of how the data is organized and both are most well-suited to uniform data (like pages in a book or random hashes), but are just different strategies requiring different means of accessing the data.

Admittedly not a perfect analogy, but you get the idea.


Wiki lists a variation of binary search called "interpolation search" [1] which

> Instead of calculating the midpoint, interpolation search estimates the position of the target value, taking into account the lowest and highest elements in the array as well as length of the array.

Sound pretty close to what you're describing :-)

[1] https://en.m.wikipedia.org/wiki/Binary_search_algorithm#Inte...


I think the basic idea is that the hash has a fairly uniform probability distribution, so knowing the prefix means you can estimate its location in a sorted list.

For instance if we were talking about n random sequences of digits then if you want to look for a number starting with 42 then you can start looking at the the 0.42n element and it is likely already very close to a match.


Is a uniform hash the same thing as a consistent hash?


No. Consistent hashing and its easier to understand cousin rendezvous hashing are about maintaining properties after a rehash.


More like the secant method I suppose.


You could optimize this further if you're combining multiple lookups


These days you can simply load it into a database (ClickHouse) - it will be compressed (about the same as the 7z file), and the search will take 3 ms:

https://github.com/ClickHouse/ClickHouse/issues/42363


But the point was to make it take less than 1ms. How would 3ms help?


DuckDB is also great for stuff like this. You can replace a MapReduce cluster with a single SQL.


I checked DuckDB and your statement appears to be untrue.

    >>> con.execute("CREATE TABLE passwords (hash TEXT, count INT)")
    <duckdb.DuckDBPyConnection object at 0x7fc7bceb55f0>
    >>> con.execute("CREATE INDEX ix_hash ON passwords (hash)")
    <duckdb.DuckDBPyConnection object at 0x7fc7bceb55f0>
    >>> con.execute("COPY passwords FROM 'pwned-passwords-sha1-ordered-by-hash-v8.txt' (SEPARATOR ':')")
    100%  
    100% 
It froze in an attempt to load the data. Nothing happens after it displays 100%.


CREATE INDEX currently has the restriction that the index must fit in memory [1]. As the data is already sorted, creating an index is not necessary anyway. The min/max indexes created automatically by the system are sufficient to complete the query in a few milliseconds.

  D CREATE TABLE passwords (hash TEXT, count INT);
  D COPY passwords FROM '~/Downloads/pwned-passwords-sha1-ordered-by-hash-v8.txt' (SEPARATOR ':');
  D .timer on
  D SELECT \* FROM passwords WHERE hash=upper('5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8');
  ┌──────────────────────────────────────────┬─────────┐
  │                   hash                   │  count  │
  │                 varchar                  │  int32  │
  ├──────────────────────────────────────────┼─────────┤
  │ 5BAA61E4C9B93F3F0682250B6CF8331B7EE68FD8 │ 9545824 │
  └──────────────────────────────────────────┴─────────┘
  Run Time (s): real 0.005 user 0.007455 sys 0.000584
[1] https://duckdb.org/docs/sql/indexes


based on the headline, it must be under 1 ms. love the table


I cannot even ssh into the server after trying to use DuckDB. It is completely dead (with all the ducks, what a misery).

The reason is probably that it's using a full index, in contrast with the sparse index in ClickHouse, and maybe it's trying to build it in memory, going to swap (the server has 32 GB memory).


Interesting - the database file looks ok, but the data is lost (the table is empty):

  ubuntu@ip-172-31-3-138:~$ ls -l
  total 69561648
  -rw-rw-r-- 1 ubuntu ubuntu 17631031296 Dec 16 23:57 my-db.duckdb
  -rw-rw-r-- 1 ubuntu ubuntu         326 Dec 16 23:53 my-db.duckdb.wal
  -rw-rw-r-- 1 ubuntu ubuntu 16257755606 Jan 21  2022 pwned-passwords-sha1-ordered-by-hash-v8.7z
  -rw-rw-r-- 1 ubuntu ubuntu 37342268646 Dec  2  2021 pwned-passwords-sha1-ordered-by-hash-v8.txt
  ubuntu@ip-172-31-3-138:~$ python3
  Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] on linux
  Type "help", "copyright", "credits" or "license" for more information.
  >>> import duckdb
  >>> con = duckdb.connect(database='my-db.duckdb')
  >>> con.execute("SELECT count(*) FROM passwords").fetchall()
  [(0,)]


Because DuckDB uses ACID [1] data is loaded in an all-or-nothing manner. As the load was interrupted due to the system running out of memory, the table is expected to be empty.

[1] https://en.wikipedia.org/wiki/ACID


Curious: Are you affiliated with ClickHouse or any other Columnar DB project in any way? If so, you may want to add that as a disclosure.


Yes, I'm working on ClickHouse, here is my GitHub profile: https://github.com/alexey-milovidov

I'm also trying to follow every existing technology in the data engineering space :)


If you load the data properly (creating the index after insertion, which is definitely preferable in this case), it will load extremely quickly (milliseconds).

You should also disclose your relationship with a competing project. For the record, I use DuckDB in personal projects and love it. You seem to be misusing it. :)


have tested duckdb v0.6.0 2213f9c946

  4e17b76fc101c9db7222e0cd8d6f5eee  pwned-passwords-sha1-ordered-by-hash-v8.txt

  select count(*) from read_csv('pwned-passwords-sha1-ordered-by-hash-v8.txt', delim=':', header=False, columns={'Hash': 'VARCHAR', 'Count': 'INT'});
60.32s, 847223402 rows

  create table hashes as select * from ...
OOM :( set PRAGMA temp_directory

  create table ...
144.92s (83.19s on BATCH CREATE, 61.53s on READ CSV)

  select \* from hashes where Hash = 'F2B14F68EB995FACB3A1C35287B778D5BD785511'; -- secret123

  0.0269s -- 1st
  0.0043s -- 2nd
  0.0026s -- 3rd
  0.0062s -- 4th
  0.0047s -- 5th
edits: attempt to fix formatting


“These days”? A 32 bit intel running mysql back in 2000 would probably have laughed at this query.

Granted, I’d expect one or two disk seeks, at ~ 10 ms each. I imagine on modern hardware, it would be in the 100’s of usec range. (Assuming you limited it to 2-4 GB of ram).


Lots of great solutions in the in-memory columnar data space, what a time to be alive.


Clickhouse is not in memory.


ClickHouse has a table engine for keeping data in memory. Can be accessed like this

create table tableName (.....) Engine = Memory();


But it's not necessary to keep data in Memory, because MergeTree works perfectly.

Here is a slide deck showing why MergeTree is faster than in-memory tables: https://presentations.clickhouse.com/meetup53/optimizations/


The comparison is done with the in-memory table doing full scan while MergeTree using an index. Kind of meaningless comparison.


Soooo did they just implement in memory store worse ?


It's a database


I'm always amazed about how well naive Unix solutions measure up compared to contemporary solutions. For instance, the non-indexed python solution here takes a little more time than the Unix look command (Pi 4):

    time pwned.sh password
    5BAA61E4C9B93F3F0682250B6CF8331B7EE68FD8:9545824
    0.053u 0.035s 0:00.40 20.0% 83+22k 35+0io 21pf+0w

    time python3.10 02-binary-search.py pwned-passwords-sha1-ordered-by-hash-v8.txt password
    looking for 5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8
    pwned! seen 9,545,824 times before
    in 0.090021 seconds
    0.105u 0.013s 0:00.44 25.0% 0+3k 32+0io 4pf+0w
Obviously, if you're doing millions of lookups you'll want indexing, but otherwise using built-in tools that have been optimized and refined for decades does very well.


If we are really going all out for performance optimization...

In theory, hashes are going to be very evenly distributed. With a file of that size you could probably land within 1% of the correct location just by seeking to byte offset percentage given by the first bytes of the hash...

(It's a fun article)


They did mention they tried that, and got similar results to the binary search they were already doing: https://death.andgravity.com/pwned#position-heuristic


> We can do this once, and then binary search a safety interval around that position. Alas, this only gets rid of the fast jumps at the beginning of the binary search, and for some reason, it ends up being slightly slower than binary search alone.

I don't get why you can't do this at each step.


You can, see the next paragraph, "also narrow down around the estimated position iteratively..."

I gave up because I was too lazy to think how to do the "precompute the maximum +/- error bounds" jiggawatts mentions in https://news.ycombinator.com/item?id=34021826


BTW, loved the article, not sure how I missed that you tried this.


I assume this doesn't improve perf much because you pretty quickly zoom in on a slice of hashes that are not evenly distributed. 37GB feels like a lot of hashes but it's pretty small compared to the space of all hashes, I suspect this approach would be more effective if we were searching over a much bigger chunk of data.


could one insert empty bytes to make it evenly distributed?


Indeed that's discussed in the article, under "Position heuristic" - https://death.andgravity.com/pwned#position-heuristic


With some random sampling it looks like the worst case is 5 megabytes of error on the initial seek.


OP attempted this using Python.

What would be the fastest way using *nix commands? A naive solution would be something like:

  echo -n password | sha1sum | cut -d ' ' -f 1 | xargs -I hash grep hash pwned.txt


Use look:

  look $(echo -n password | sha1sum | cut -d ' ' -f 1 | tr a-z A-Z) pwned.txt

from man page:

NAME

look - display lines beginning with a given string

DESCRIPTION

The look utility displays any lines in file which contain string. As look performs a binary search, the lines in file must be sorted (where sort(1) was given the same options -d and/or -f that look is invoked with).

example:

  justin@box:~/data$ time look $(echo -n secret123 | sha1sum | cut -d ' ' -f 1 | tr a-z A-Z) pwned-passwords-sha1-ordered-by-hash-v6.txt 
  F2B14F68EB995FACB3A1C35287B778D5BD785511:17384

  real 0m0.212s
  user 0m0.005s
  sys 0m0.001s

  justin@box:~/data$ time look $(echo -n secret123 | sha1sum | cut -d ' ' -f 1 | tr a-z A-Z) pwned-passwords-sha1-ordered-by-hash-v6.txt 
  F2B14F68EB995FACB3A1C35287B778D5BD785511:17384

  real 0m0.002s
  user 0m0.003s
  sys 0m0.001s


Hey, I didn't know about this command, neat!

On my laptop, look `time`s at ~10 ms (for comparison, the Python "binary search" script `time`s at ~50 ms).


You can make python binary search super fast if you use mmap. here's a version of that I had lying around, it's probably correct.

  import os
  import mmap
  
  def do_mmap(f):
      fd = os.open(f, os.O_RDONLY)
      size = os.lseek(fd, 0, 2)
      os.lseek(fd, 0, 0)
      m = mmap.mmap(fd, size, prot=mmap.PROT_READ)
      return m, size, fd
  
  SEEK_SET = 0
  SEEK_CUR = 1
  
  class Searcher:
      def __init__(self, file):
          self.file = file
          self.map, self.size, self.fd = do_mmap(file)
  
      def close(self):
          self.map.close()
          os.close(self.fd)
  
      def find_newline(self):
          self.map.readline()
          return self.map.tell()
  
      def binary_search(self, q):
          pos = 0
          start = 0
          end = self.size
          found = False
          #this can get stuck with start = xxx and end = xxx+1, probably from the \r\n
          while start < end - 2:
              mid = start + (end-start)//2
              self.map.seek(mid)
              pos = self.find_newline()
              if pos > end:
                  break
              line = self.map.readline()
              if q < line:
                  end = mid
              elif q > line:
                  start = mid
  
          while True:
              line = self.map.readline()
              if not line.startswith(q): break
              yield line
  
  if __name__ == "__main__":
      import sys
      q = sys.argv[1]
      s = Searcher("pwned-passwords-sha1-ordered-by-hash-v6.txt")
      import time
      ss = time.perf_counter()
      res = s.binary_search(q.upper().encode())
      for x in res:
          print(x)
      ee = time.perf_counter()
      print(ee-ss)


I did try mmap, both with the plaintext binary search, and with the binary file (you can find a note about it in the HTML source :)

I ended up not mentioning it because for some reason, it was ~twice as slow on my mac... I'm now curious to try it on a decent Linux machine.


Make sure you put a space at the beginning of your command, so you don't leave your password sitting plaintext in your bash history.


If you're using bash, you'll need a to use HISTIGNORE or HISTCONTROL environment variables to do this.


If you're using bash, you can just leave a space before the command, like the other commentor said.


Thats true if HISTCONTROL is set to `ignorespace` or `ignoreboth`

https://www.gnu.org/software/bash/manual/html_node/Bash-Vari...


read -s -r MY_PASSWORD

Then, after typing your password you can safely use the $MY_PASSWORD variabile


Oh I just learned something, thank you.


Perhaps there's a way to insert GNU Parallel in there to do parallel search of different chunks?

Or just use ripgrep, which integrates multi-core.


That is already doable with xargs itself

xargs -P maxprocs

Parallel mode: run at most maxprocs invocations of utility at once. If maxprocs is set to 0, xargs will run as many processes as possible.


GNU parallel gives you some extra features like a runtime log and resumable operation.


if have 10 computers put every 10th line in its own file, if each file is 1000 lines put line 500 at the start, then line 250, then line 750, then line 125, 375 etc


Assuming you have enough RAM, I wonder how much putting this file into a ramdisk will help Be speed things up.


Perhaps pushing the definition of a *nix command slightly, but I’d be interested in the performance of https://sgrep.sourceforge.net/


Can't you just use something like a bloom filter?

There is a bit over 1B passwords in there (based on the size of the file and the length of the line). You would need a binary file around 3GB in size that would have to either load into memory or do about 17 accesses to read specific bytes ( no searching) to figure out if the password is in the filter.


You can, Scott Helme wrote some articles about this (I mention them at the end of my article):

* https://scotthelme.co.uk/when-pwned-passwords-bloom/

* https://scotthelme.co.uk/sketchy-pwned-passwords/ – here he uses a count-min sketch to also store the frequency


> Can't you just use something like a bloom filter?

If I had the ability to download a massive file I’d try it out on a hextree I toy around with occasionally.

If you’re making an index file may as well just throw it into a tree structure where a lookup is anywhere from 1 to 20 pointer dereferences (assuming the checksum is 20 hex digits) as it optimizes storage so tree depth is variable. Plus it can retain the counts as well.

Now I really want to try this out, the last article I read along these lines I used it as a comparison and it was equally as efficient as their conclusion.


See the bonus chapter in the post (CTRL+F + "bloom" to find it)


It doesn't solve the same problem. They want to know if the password is in the file, not if the password could be in the file.


You can set up bloom filters for arbitrary false positive probability.


Not 0.


Arbitrary rate means you can set it up so that you don't have to worry about it in practice.

If I served a site where people could check if their passwords leaked I would not worry if one in a million viewers got a false positive.


As I said, "not worrying about it in practice" isn't "the same problem." If that is hard to wrap your mind around, you can suppose a fine of $1 billion for every wrong answer.


Engineering is all about tradeoffs. In real life things never work 100% reliably. Not being able to do tradeoffs disqualifies person as an engineer.

plonk


Algorithms are about correctness. Being incorrect disqualifies an algorithm as a solution to a problem.


A strategy I haven't seen mentioned is to use the filesystem itself as an index. Directory structure can be used to implement a trie or prefix tree.

For example, if you want to look up the existence of a sample hash: `2aae6c35c94fcfb415dbe95f408b9ce91ee846ed`

Then simply check for existence of directory <data-root>/2a/ae/6c/35/etc...

I was looking at the directory structure of Gitea's Docker Container Registry and this is how they stored container images.


I'm pretty skeptical of the performance of a billion files.

I'm sure it will go at okay speed once you actually construct it, since it's basically a tree, but with so many entries I'd expect it to be much less pleasant than a database in many ways. The "preprocessing" is going to be especially awful.


There's probably a large overhead in storage used. It would probably result in much larger overall storage than 37GB. And I agree preprocessing would be painful.

But I'm curious on how lookup speeds would compare to the author's 1ms.

I'm also curious on how addition of a new hash would compare against adding a new hash to the single sorted file used by the author.

Leveraging any database is probably better in any case :)


If we're running on a SATA SSD, we can probably chain 5 or fewer accesses to the drive before we go over a millisecond. And each directory of depth likely requires 2 accesses.

> I'm also curious on how addition of a new hash would compare against adding a new hash to the single sorted file used by the author.

In a fair fight with that requirement, the sorted file would be allowed to add a few percent of extra blank entries and then it could insert in a millisecond too.


Same strategy as git uses for objects, .git/objects/d6/70460b4b4aetc

https://git-scm.com/book/en/v2/Git-Internals-Git-Objects


Git chooses to pack them into much fewer files when there are lots of loose files.

Actually, these days many filesystems perform surprisingly well with lots of little files. What don't work with huge directories is basic utilities like ls, or anything that likes to collect file list in memory & sort it. I have some directories that essentially hang ls, where find is still happy listing the files (because it just streams the output).


to not qualify as clickbait the title needs to include the word 'sorted'


As well as "with a pre-computed index"


jesus, did he really do that

what a scummy headline


You could compute the expected position of the hash in the file as hash / 2^160 * filesize and then just look around in the vicinity of that. Would be interesting to know what the maximum deviation for any hash from its expected position is, then you could maybe just load a single block of that size and search it in memory if the maximum deviation is sufficiently small.



Not very well unfortunately. The author's confusion about how binary search works is preventing progress toward this solution. The idea is to do an "unbalanced" binary search, with the pivot chosen as offset (range size * big-endian value of hash / (maximum possible big-endian value of hash + 1)). Correctness is exactly the same as correctness of binary search, because the correctness of binary search does not depend on the offset of the pivot (so long as it's not zero or past the end). At some heuristic range length, fall back to linear search of the range. (Possibly, at some larger length, fall back to balanced binary search... But with efficient linear search of no more than a couple disk pages, I doubt this would help.)


This exactly. The author is not describing their approach very well, so I stopped reading because it was frustrating to try to figure out if they tried this solution.

The simple statement of the approach is: Use binary search, but instead of using the middle as the pivot, use the estimated position heuristic as the pivot for each iteration.


> You could compute the expected position of the hash in the file as hash / 2^160 * filesize and then just look around in the vicinity of that.

> The idea is to do an "unbalanced" binary search, with the pivot chosen as offset (range size * big-endian value of hash / (maximum possible big-endian value of hash + 1)).

Thanks. I thought something along these lines but didn't know how to express or formalize it. Now I must study unbalanced binary search algorithms.


I tried it out and the maximum deviation of the actual position in the file from the estimated one is a bit under five megabyte for the version 8 file. That is not good enough to just fetch a single page by quite a bit. So I guess building an index is the only real option to get the number of reads down, but even then you will have at least one read. With an SSD that will take on the order of a few hundred microseconds, so it can be done under a millisecond without relying on a cache. Binary search will require about 20 to 30 reads, so without a cache it would be about that much slower, however the first couple of steps will quickly get cached, reducing the number of reads that actually hit the hardware quite a bit.


> Binary search will require about 20 to 30 reads,

This does not pass a sanity check. If one switches to linear scan at 4KiB = 2^12 bytes, then 30 (balanced, but approximately correct for unbalanced) binary search steps would search a 4TiB file. I don't know how big the hash file is but it isn't this big. Not even 4GiB big.

The maximum number of steps is lg(file size)-12, maybe plus 2 or so to account for the unbalanced search. (NB the unbalanced search approximates the balanced only because the hashes are approximately uniform; the relation doesn't hold in general.)


The version 8 file contains 847,223,402 hashes which gives 29.7 bits, that is where I got the 30 reads from. And I assumed the most naïve approach, always just seeking and reading a single hash, but admittedly this would almost certainly not read the final page over and over again. And then I just subtracted 10 taking into account that you would read the final page only once, with 10 being a rough approximation of 12. How many bytes you actually read per read could be affected by all kind of things - sector size, cluster size, page size, read ahead, ... - so 10 bit was good enough for me. But you are right, the number of reads would almost certainly be in the vicinity of 20, not 30.


I am curious how fast you could query this data if it were inserted into a sqlite table with an index on hash.


I actually did try this, but I got impatient and stopped it after 30 minutes or so, and a ~20G database file.

For some reason, after that, `ls` in that directory (or even shell completion) would freeze the shell entirely, even after reboot. I eventually managed to delete the file with `rm db.sqlite` (no completion allowed), but even that took like 2 minutes.

I might try again with WAL enabled (based on my shell history, I also deleted a -journal file).


Have you tried DuckDB? It’s the columnar version of SQLite.


Not yet, I might give it a try.


importing was sooo slow (something I likely messed up trying to make it go faster, was only chugging along at ~100MB/s)

SQLite version 3.37.0 2021-12-09 01:34:53

here's the results tho

  sqlite> create table hashes (hash text, count text);
  sqlite> pragma cache_size=1000;
  sqlite> pragma synchronous=off;
  sqlite> .mode csv
  sqlite> .separator :
  sqlite> .timer on
  sqlite> .import pwned-passwords-sha1-ordered-by-hash-v8.txt hashes
  sqlite> select * from hashes where hash = 'F2B14F68EB995FACB3A1C35287B778D5BD785511';
  ^CRun Time: real 14.841 user 9.370057 sys 2.363629
  Error: stepping, interrupted (9)
  sqlite> create unique index hash_uniq on hashes(hash);
  Run Time: real 530.029 user 262.952507 sys 213.305437
  sqlite> select * from hashes where hash = 'F2B14F68EB995FACB3A1C35287B778D5BD785511';
  F2B14F68EB995FACB3A1C35287B778D5BD785511:26437
  Run Time: real 0.004 user 0.000303 sys 0.000936
  sqlite> select * from hashes where hash = 'F2B14F68EB995FACB3A1C35287B778D5BD785511';
  F2B14F68EB995FACB3A1C35287B778D5BD785511:26437
  Run Time: real 0.001 user 0.000310 sys 0.000306
  sqlite> select * from hashes where hash = 'F2B14F68EB995FACB3A1C35287B778D5BD785511';
  F2B14F68EB995FACB3A1C35287B778D5BD785511:26437
  Run Time: real 0.000 user 0.000295 sys 0.000305
  sqlite> select * from hashes where hash = 'F2B14F68EB995FACB3A1C35287B778D5BD785511';
  F2B14F68EB995FACB3A1C35287B778D5BD785511:26437
  Run Time: real 0.000 user 0.000320 sys 0.000279
  sqlite> select * from hashes where hash = 'F2B14F68EB995FACB3A1C35287B778D5BD785511';
  F2B14F68EB995FACB3A1C35287B778D5BD785511:26437
  Run Time: real 0.001 user 0.000322 sys 0.000317
  sqlite> select * from hashes where hash = 'F2B14F68EB995FACB3A1C35287B778D5BD785511';
  F2B14F68EB995FACB3A1C35287B778D5BD785511:26437
  Run Time: real 0.001 user 0.000128 sys 0.000155
  sqlite> select * from hashes where hash = 'F2B14F68EB995FACB3A1C35287B778D5BD785511';
  F2B14F68EB995FACB3A1C35287B778D5BD785511:26437
  Run Time: real 0.001 user 0.000145 sys 0.000158


I wonder if they could improve this by queueing multiple positions to read at once. Presumably the disk will be able to do some of them more efficiently.


(author here) Hey, that's actually a great idea. I remember reading about a fast grep or wc implementation that did parallel reads.

Off the top of my head, I don't really know how to do that (aside from the naive way of using threads), I'm guessing readv[1] might be be useful? Will definitely look into it.

[1]: https://linux.die.net/man/2/readv


it's generally a good idea to do

    echo 3 | sudo tee /proc/sys/vm/drop_caches
for benchmarks to avoid disk cache speeding things up, i'm surprised it wasn't mentioned


I wonder how long `rg "^000085013a02852372159cb94101b99ccaec59e1" --max-count=1` would take on this?


This could be a game changing blog post for self taught devs like me who are just getting into profiling code for improving it and such. I’m still reading it but I’ve already bookmarked it for further study.


Glad to hear that :D One of my main goals was to show the entire progression of how I approached the problem, instead of just saying "here's the final result".

I have another article that covers profiling in more detail, you might find it useful: https://death.andgravity.com/fast-conway-cubes#intro-to-prof...

Note that besides the cProfile text output, there are better, graphical tools for profiling; I've personally used pyflame+flamegraph.pl, and tuna.


I wish that community would put a warning sign or something. The article is a collection of things that should never be done. People who want to learn software performance work need to start from a solid foundation. Mentioning that algorithms exist at the end is not good enough.



(replying with more details, because I think it's pretty cool, and I'll definitely try this out)

The linked item is to https://blog.mro.name/2022/08/pwned-diy/, where the author converts the passwords list to a CDB (Constant DataBase) file, a simple, fast lookup database format created by D. J. Bernstein.

https://cr.yp.to/cdb.html

https://manpages.ubuntu.com/manpages/bionic/man1/cdb.1.html


> Positions, lengths, and hash values are 32-bit quantities, stored in little-endian form in 4 bytes. Thus a cdb must fit into 4 gigabytes.

Source: https://cr.yp.to/cdb/cdb.txt


There is a more modern rewrite that does away with the 4 gigabyte limit of the original[1]. I am not sure if the performance characteristics are the same.

[1]https://github.com/gstrauss/mcdb/


You're a bit hasty, young padawan. Do you insinuate this has never been tried while you don't even take a glimpse at it and see how it's done? What use is it linking the cdb docs? The article has it all in context if you bother reading.

The dataset is turned to 16 cdbs to be precise in case you wonder how 37GB fit into the 4GB cdb size limit.


Having recently been digging into the CDB format[1], this was exactly my first thought as well. `tinycdb` is extremely fast at creating an index, and most successful lookups take only two disk seeks (failed lookups need only one), very close to the optimal possible lookup performance on this problem.

[1]: https://search.feep.dev/blog/post/2022-12-03-cdb-file-format


In Unix just split the input text into a smaller chunk and fork a few processes to speed up the processing.

If the input file is even larger and you have a lot of RAM, use your favorite programming language and call mmap() syscall.


> Notice we land on the same 7F... prefix twice; this makes sense – we skip ahead by half the file size, then back, then ahead two times by a quarter.

How would one fix that?


I wonder what the performance would be if the whole thing was thrown into a postgres instance with a full text index...


> we skip ahead by half the file size, then back, then ahead two times by a quarter.

This is not how one does binary search.


Generating a trie might be a useful idea here in the allowing preprocessing approach


Almost failing in under 1 millisecond sure is taking fail fast to the extreme...


How fast is grep or fastgrep?


Let me get this straight. OP searched for an entry in a sorted list using dichotomic search. Okay... Any CS undergrad can do that. Is there something that I'm missing?


Some important site guidelines, including:

"Don't be snarky."

"Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something."

https://news.ycombinator.com/newsguidelines.html


It is famously difficult to implement binary search correctly.


How so, integer overflow? Not so famous that I'm aware.



it took about ten years from the time of the first purported publication of a binary search algorithm for a correct algorithm to be published

it's an algorithm where it's easy to have off-by-one errors that lead to nontermination or incorrect results for certain inputs

integer overflow is also a problem on some platforms, especially java


I doubt someone spent 10 years trying to implement a binary search.

I suspect it is as usual, someone made theoretical invention, then someone else looked at it years later and wrote implementation in evening or two


more likely there were a lot of correct implementations, but they got debugged after the initial publication, and then didn't publish the erratum


CS solved this problem, but not in such a way yet that we don't have to think about it anymore. If Python's dict implemented all of this under the hood then it could be called a solved problem.



Does reading/seeking in a file that large make it harder?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: