It's a great question: fundamentally the Parquet format offers columnar orientation. With datasets like these, there's some research[0] indicating this is a preferable way of storing and querying WARC.
DuckDB, like SQLite, is serverless. Duck has a leg up on SQLite though when it comes to Parquet: Parquet is supported directly in Duck and this makes dealing with these datasets a breeze.
Good question! As a disclaimer, I work for DuckDB Labs.
There are 2 big benefits to working with Parquet files in DuckDB, and both relate to speed!
DuckDB can query parquet right where it sits, so there is no need to insert it into the db first. This is typically much faster. Also, DuckDB's engine is columnar (SQLite is row based), so it can do faster analytical queries using that format. I have seen 20-100x speed improvements over SQLite in analytical workloads.
Another benefit that is often overlooked is that Parquet support various compression methods, right? SQLite's official support for compression is a commercial product, so basically useless for general usage in free open-source software. There are a few 3rd party alternatives, but it becomes cumbersome.
Considering DuckDB's and Parquet's features, if DuckDB manages to support the emerging GeoParquet standard someday, I believe it could be a nice alternative to the typical SQLite + Spatialite combo, and GeoPackage as a whole.
I've gone almost completely off Athena in favor of duckdb.
Duckdb running with 12 vcpus on mid-size data (1tb) that's on ssd is routinely faster than Athena with the data on s3.
If there are a lot of partitions, then the difference is huge.
DuckDB user here. As far as I can tell, DuckDB doesn’t support distributed computation so you have to set that up yourself, whereas Athena is essentially Presto — it handles that detail for you. It also doesn’t support Avro or Orc yet.
DuckDB excels at single machine compute where everything fits in memory or is streamable (data can be local or on S3) — it’s lightweight and vectorized. I use it in Jupyter notebooks and in Python code.
But it may not be the right tool if you need distributed compute over a very large dataset.
In the distributed computing world, the rule is you start to scale horizontally when your compute workload is too large to fit in the memory of a single machine. So it depends on your compute workload and your hardware. (There’s no fixed number for what a large dataset is)
DuckDB itself doesn’t have any baked in limits. If it fits in memory, single machine compute is usually faster than distributed compute — and DuckDB is faster than Pandas, and definitely faster than local Spark.
DuckDB has SQLite semantics but is natively built around columnar formats (parquet, in-memory Arrow) and strong types (including dates). It also supports very complex SQL.
SQLite is a row store built around row based transactional workloads. DuckDB is built around analytics workloads (lots of filtering, aggregations and transformations) and for these workloads DuckDB is just way way faster. Source: personal experience.