> The SQL above results in a plan similar to the DuckDB optimized plan, but it is wordier and more error-prone to write, which can potentially lead to bugs.
FWIW, aside from manual filter pushdown, I consider the JOIN variant the canonical / "default" way to merge multiple tables; it keeps all the join-related logic in one place, while mixing both joining conditions and filtering conditions in WHERE always felt more error-prone to me.
Same here; I've always intuited that this would limit the generated tuples. I'm too lazy to do it now, but I wonder if other DB engines also perform this optimization that effectively makes filtering in JOIN conditions equivalent to filtering in WHERE clauses. I'd also be interested in some example queries that were hand-optimized to the point of obvious obscurity--my guess is it's harder to do this in SQL than in something like C.
Any non-toy SQL engine is going to do condition pullup and pushdown. (Well, I know at least one that only does explicit pullup, but does pushdown more implicitly because it doesn't use a tree. The effect is largely the same, modulo some shenanigans around outer joins.)
Pushdown is actually a more subtle problem than people give it credit for; it inhabits this weird space where things are too trivial to be covered in papers but is too hard to cover properly in textbooks (e.g., no coverage of multiple equalities or outer joins).
Note: the dig at dataframe libs is worth some care in case you think that means duckdb can optimize and they cannot
Dask, Polars, and others pick a lazy default in order to make distribution and other optimizations easier. When staying in their pure fragments ('vectorized'), the same scheduler rewriting opportunity is here.
This is a subtle but important distinction when looking at these frameworks. We are making our new graph query language 'gfql' to be dataframe-native so it can run naturally & natively as a step of pipelines people are already doing, but also to ensure we automatically run as optimized CPU/GPU columnar opts. At the same time, because of the intent to allow room for query plan optimization, we are staying declarative / lazy, even if the generated & interpreted code uses an eager DF runtime . I'm optimistic about output target lazy DF systems doing query planner work for us long-term here, but for the eager framework targets, the query planning has to be on our side.
> This means your optimizations need to be applied by hand, which is sustainable if your data starts changing.
Seems like a missing "un" here
Compelling article! I've already found DuckDB to be the most ergonomic tool for quick and dirty wrangling, it's good to know it can handle massive jobs too.
I regularly use duckdb on datasets of 1B+ rows, with nasty strong columns that may be over 10MB per value in the outliers. Mostly it just works, and fast too! When it doesn't, I'll usually just dump to parquet and hit it with sparksql, but that is the exception rather than the rule.
SELECT
pickup.zone AS pickup_zone,
dropoff.zone AS dropoff_zone,
cnt AS num_trips
FROM
(select pickup_location_id, dropoff_location_id, count(*) as cnt from taxi_data_2019 group by 1,2) data
INNER JOIN
(SELECT * FROM zone_lookups WHERE Borough = 'Manhattan') pickup
ON pickup.LocationID = data.pickup_location_id
INNER JOIN
(SELECT * FROM zone_lookups WHERE Borough = 'Manhattan') dropoff
ON dropoff.LocationID = data.dropoff_location_id
ORDER BY num_trips desc
LIMIT 5;
┌───────────────────────┬───────────────────────┬───────────┐
│ pickup_zone │ dropoff_zone │ num_trips │
│ varchar │ varchar │ int64 │
├───────────────────────┼───────────────────────┼───────────┤
│ Upper East Side South │ Upper East Side North │ 536621 │
│ Upper East Side North │ Upper East Side South │ 455954 │
│ Upper East Side North │ Upper East Side North │ 451805 │
│ Upper East Side South │ Upper East Side South │ 435054 │
│ Upper West Side South │ Upper West Side North │ 236737 │
└───────────────────────┴───────────────────────┴───────────┘
Run Time (s): real 0.304 user 1.791931 sys 0.132745
(unedited query is similar to theirs, about .9s)
And, doing the same to the optimized query drops it to about .265s.
AFAIK there are rather few query optimizers that can do this; it's not an easy problem, in part because your search space explodes. There's a 2014 paper that addresses this (https://madoc.bib.uni-mannheim.de/37228/1/main.pdf) and some other papers that combine grouping and join into a “groupjoin” operation; the main problem is that it only really seems to scale to planning small queries (less than eight joins or so, IIRC).
I've never run across it, but I would describe the job of an analytics engineer as doing this over and over for analysts -- it's probably semantically clearer to do the join first and then aggregate, so I end up pushing it down for them.
Thanks for the reference; this question has been on my mind for some time.
FWIW, aside from manual filter pushdown, I consider the JOIN variant the canonical / "default" way to merge multiple tables; it keeps all the join-related logic in one place, while mixing both joining conditions and filtering conditions in WHERE always felt more error-prone to me.
reply