DuckDB is also great for stuff like this. You can replace a MapReduce cluster wi...

zX41ZdbW · on Dec 17, 2022

I checked DuckDB and your statement appears to be untrue.

    >>> con.execute("CREATE TABLE passwords (hash TEXT, count INT)")
    <duckdb.DuckDBPyConnection object at 0x7fc7bceb55f0>
    >>> con.execute("CREATE INDEX ix_hash ON passwords (hash)")
    <duckdb.DuckDBPyConnection object at 0x7fc7bceb55f0>
    >>> con.execute("COPY passwords FROM 'pwned-passwords-sha1-ordered-by-hash-v8.txt' (SEPARATOR ':')")
    100%  
    100%

It froze in an attempt to load the data. Nothing happens after it displays 100%.

mytherin · on Dec 17, 2022

CREATE INDEX currently has the restriction that the index must fit in memory [1]. As the data is already sorted, creating an index is not necessary anyway. The min/max indexes created automatically by the system are sufficient to complete the query in a few milliseconds.

  D CREATE TABLE passwords (hash TEXT, count INT);
  D COPY passwords FROM '~/Downloads/pwned-passwords-sha1-ordered-by-hash-v8.txt' (SEPARATOR ':');
  D .timer on
  D SELECT \* FROM passwords WHERE hash=upper('5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8');
  ┌──────────────────────────────────────────┬─────────┐
  │                   hash                   │  count  │
  │                 varchar                  │  int32  │
  ├──────────────────────────────────────────┼─────────┤
  │ 5BAA61E4C9B93F3F0682250B6CF8331B7EE68FD8 │ 9545824 │
  └──────────────────────────────────────────┴─────────┘
  Run Time (s): real 0.005 user 0.007455 sys 0.000584

[1] https://duckdb.org/docs/sql/indexes

elecush · on Dec 18, 2022

based on the headline, it must be under 1 ms. love the table

zX41ZdbW · on Dec 17, 2022

I cannot even ssh into the server after trying to use DuckDB. It is completely dead (with all the ducks, what a misery).

The reason is probably that it's using a full index, in contrast with the sparse index in ClickHouse, and maybe it's trying to build it in memory, going to swap (the server has 32 GB memory).

zX41ZdbW · on Dec 17, 2022

Interesting - the database file looks ok, but the data is lost (the table is empty):

  ubuntu@ip-172-31-3-138:~$ ls -l
  total 69561648
  -rw-rw-r-- 1 ubuntu ubuntu 17631031296 Dec 16 23:57 my-db.duckdb
  -rw-rw-r-- 1 ubuntu ubuntu         326 Dec 16 23:53 my-db.duckdb.wal
  -rw-rw-r-- 1 ubuntu ubuntu 16257755606 Jan 21  2022 pwned-passwords-sha1-ordered-by-hash-v8.7z
  -rw-rw-r-- 1 ubuntu ubuntu 37342268646 Dec  2  2021 pwned-passwords-sha1-ordered-by-hash-v8.txt
  ubuntu@ip-172-31-3-138:~$ python3
  Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] on linux
  Type "help", "copyright", "credits" or "license" for more information.
  >>> import duckdb
  >>> con = duckdb.connect(database='my-db.duckdb')
  >>> con.execute("SELECT count(*) FROM passwords").fetchall()
  [(0,)]

mytherin · on Dec 17, 2022

Because DuckDB uses ACID [1] data is loaded in an all-or-nothing manner. As the load was interrupted due to the system running out of memory, the table is expected to be empty.

[1] https://en.wikipedia.org/wiki/ACID

ignoramous · on Dec 17, 2022

Curious: Are you affiliated with ClickHouse or any other Columnar DB project in any way? If so, you may want to add that as a disclosure.

zX41ZdbW · on Dec 18, 2022

Yes, I'm working on ClickHouse, here is my GitHub profile: https://github.com/alexey-milovidov

I'm also trying to follow every existing technology in the data engineering space :)

lazzlazzlazz · on Dec 25, 2022

If you load the data properly (creating the index after insertion, which is definitely preferable in this case), it will load extremely quickly (milliseconds).

You should also disclose your relationship with a competing project. For the record, I use DuckDB in personal projects and love it. You seem to be misusing it. :)

thewisenerd · on Dec 17, 2022

have tested duckdb v0.6.0 2213f9c946

  4e17b76fc101c9db7222e0cd8d6f5eee  pwned-passwords-sha1-ordered-by-hash-v8.txt

  select count(*) from read_csv('pwned-passwords-sha1-ordered-by-hash-v8.txt', delim=':', header=False, columns={'Hash': 'VARCHAR', 'Count': 'INT'});

60.32s, 847223402 rows

  create table hashes as select * from ...

OOM :( set PRAGMA temp_directory

  create table ...

144.92s (83.19s on BATCH CREATE, 61.53s on READ CSV)

  select \* from hashes where Hash = 'F2B14F68EB995FACB3A1C35287B778D5BD785511'; -- secret123

  0.0269s -- 1st
  0.0043s -- 2nd
  0.0026s -- 3rd
  0.0062s -- 4th
  0.0047s -- 5th

edits: attempt to fix formatting