Show HN: Easily Convert WARC (Web Archive) into Parquet, Then Query with DuckDB

wahnfrieden · on June 24, 2022

How does this compare with SQLite approaches shared recently?

llambda · on June 24, 2022

It's a great question: fundamentally the Parquet format offers columnar orientation. With datasets like these, there's some research[0] indicating this is a preferable way of storing and querying WARC.

DuckDB, like SQLite, is serverless. Duck has a leg up on SQLite though when it comes to Parquet: Parquet is supported directly in Duck and this makes dealing with these datasets a breeze.

[0] https://www.researchgate.net/figure/Comparing-WARC-CDX-Parqu...

1egg0myegg0 · on June 24, 2022

Good question! As a disclaimer, I work for DuckDB Labs.

There are 2 big benefits to working with Parquet files in DuckDB, and both relate to speed!

DuckDB can query parquet right where it sits, so there is no need to insert it into the db first. This is typically much faster. Also, DuckDB's engine is columnar (SQLite is row based), so it can do faster analytical queries using that format. I have seen 20-100x speed improvements over SQLite in analytical workloads.

Happy to answer any questions!

pmeira · on June 25, 2022

Another benefit that is often overlooked is that Parquet support various compression methods, right? SQLite's official support for compression is a commercial product, so basically useless for general usage in free open-source software. There are a few 3rd party alternatives, but it becomes cumbersome.

Considering DuckDB's and Parquet's features, if DuckDB manages to support the emerging GeoParquet standard someday, I believe it could be a nice alternative to the typical SQLite + Spatialite combo, and GeoPackage as a whole.

arpinum · on June 24, 2022

Do you see DuckDB as a possible replacement for AWS Athena? Where would Athena still be better than DuckDB + Parquet + Lambda?

henrydark · on June 25, 2022

I've gone almost completely off Athena in favor of duckdb. Duckdb running with 12 vcpus on mid-size data (1tb) that's on ssd is routinely faster than Athena with the data on s3. If there are a lot of partitions, then the difference is huge.

wenc · on June 24, 2022

DuckDB user here. As far as I can tell, DuckDB doesn’t support distributed computation so you have to set that up yourself, whereas Athena is essentially Presto — it handles that detail for you. It also doesn’t support Avro or Orc yet.

DuckDB excels at single machine compute where everything fits in memory or is streamable (data can be local or on S3) — it’s lightweight and vectorized. I use it in Jupyter notebooks and in Python code.

But it may not be the right tool if you need distributed compute over a very large dataset.

youngtaff · on June 25, 2022

> But it may not be the right tool if you need distributed compute over a very large dataset

I’m really interested in what the limits of DuckDB and Parquet are.

Can you give me an idea of what size you mean by “a very large dataset”

wenc · on June 25, 2022

In the distributed computing world, the rule is you start to scale horizontally when your compute workload is too large to fit in the memory of a single machine. So it depends on your compute workload and your hardware. (There’s no fixed number for what a large dataset is)

DuckDB itself doesn’t have any baked in limits. If it fits in memory, single machine compute is usually faster than distributed compute — and DuckDB is faster than Pandas, and definitely faster than local Spark.

wenc · on June 24, 2022

DuckDB has SQLite semantics but is natively built around columnar formats (parquet, in-memory Arrow) and strong types (including dates). It also supports very complex SQL.

SQLite is a row store built around row based transactional workloads. DuckDB is built around analytics workloads (lots of filtering, aggregations and transformations) and for these workloads DuckDB is just way way faster. Source: personal experience.

infogulch · on June 24, 2022

Well there's a virtual table extension to read parquet files in SQLite. I've not tried it myself. https://github.com/cldellow/sqlite-parquet-vtable

westurner · on June 24, 2022

Could this work with datasette (which is a flexible interface to sqlite with a web-based query editor)?

infogulch · on June 25, 2022

It appears so: https://github.com/simonw/datasette/issues/657

1vuio0pswjnm7 · on June 25, 2022

SQLite3 is 1.6MB

duckdb is 41MB

(q/k, another columnar SQL database, is less than a MB)

mritchie712 · on June 24, 2022

Nice! I've been considering using DuckDB for our product (to speed up join's and aggregates of in-memory data), it's an incredible technology.