Nobody actually answered your question helpfully, so here goes: When you write `...

Sesse__ · on Oct 21, 2023

> The only way to implement that "generally" is to visit all of the rows in the result-set, and skip over duplicates using a hash table of "already seen" pairs, or by sorting the result. The problem is that this intermediate result-set is much bigger than we really want.

This isn't quite right; there are two general ways. One is a hash table as you say, the other one is sorting. The second is quite relevant once you have a reasonable framework for interesting orders (e.g., scanning along an index, or doing sortahead on a smaller table with a foreign key that guarantees uniqueness).

> The problem is that the query optimizer has not been endowed with this kind of knowledge about the specific relationship between `a` and `b`, and so cannot do this optimization. In principle, a query optimizer could implement this kind of optimization though.

There are absolutely query optimizers that know this (try such a query in MySQL, for instance).

marcosdumay · on Oct 21, 2023

Hum... If you know `a` is a key of table `X` and `b` is a key of table `Y`, you can replace a `select distinct` with a `join ... on exists ()` with complete certainty.

And this solves almost all of the problematic cases of this on practice. If the thing you are querying isn't a key, you will naturally think a bit further and not write the distinct.

I guess that kind of optimization just goes counter to the philosophy of SQL optimizers. I know that I would be deeply surprised if I found a database doing it, and not in a good way.

Diggsey · on Oct 21, 2023

You're overthinking it - the optimization is absolutely possible as you've described. The reason databases haven't done is not some question of design philosophy, it's simply that nobody bothered to implement it.

> I know that I would be deeply surprised if I found a database doing it, and not in a good way.

SQL engines do way more surprising optimizations than this. SQL is a "4th gen" language after all, meaning it's absolutely within the spirit that the database engine does whatever it wants to get the result.