Nobody actually answered your question helpfully, so here goes:
When you write `SELECT DISCTINCT a, b` you are asking for every unique pair (a, b) in the result. Specifically, the same value of `a` could appear multiple times in the result paired with different values of `b`.
The only way to implement that "generally" is to visit all of the rows in the result-set, and skip over duplicates using a hash table of "already seen" pairs, or by sorting the result. The problem is that this intermediate result-set is much bigger than we really want.
As a human, we look at this and think "Oh, but `a` is the primary key (in this case CustomerID) and `b` is a column in the same table (CustomerName) so we only need to be distinct over `a`, which means the query optimizer should be able to "push down" the DISTINCT clause into the scan of the joined tables". The problem is that the query optimizer has not been endowed with this kind of knowledge about the specific relationship between `a` and `b`, and so cannot do this optimization. In principle, a query optimizer could implement this kind of optimization though.
At this level you really start getting into the nitty gritty of what optimizations a particular database has implemented though: the solution presented in the article may be the best way to write this for SQL Server, but this "correlated subquery" form could be catastrophically slow in other database engines...
> The only way to implement that "generally" is to visit all of the rows in the result-set, and skip over duplicates using a hash table of "already seen" pairs, or by sorting the result. The problem is that this intermediate result-set is much bigger than we really want.
This isn't quite right; there are two general ways. One is a hash table as you say, the other one is sorting. The second is quite relevant once you have a reasonable framework for interesting orders (e.g., scanning along an index, or doing sortahead on a smaller table with a foreign key that guarantees uniqueness).
> The problem is that the query optimizer has not been endowed with this kind of knowledge about the specific relationship between `a` and `b`, and so cannot do this optimization. In principle, a query optimizer could implement this kind of optimization though.
There are absolutely query optimizers that know this (try such a query in MySQL, for instance).
Hum... If you know `a` is a key of table `X` and `b` is a key of table `Y`, you can replace a `select distinct` with a `join ... on exists ()` with complete certainty.
And this solves almost all of the problematic cases of this on practice. If the thing you are querying isn't a key, you will naturally think a bit further and not write the distinct.
I guess that kind of optimization just goes counter to the philosophy of SQL optimizers. I know that I would be deeply surprised if I found a database doing it, and not in a good way.
You're overthinking it - the optimization is absolutely possible as you've described. The reason databases haven't done is not some question of design philosophy, it's simply that nobody bothered to implement it.
> I know that I would be deeply surprised if I found a database doing it, and not in a good way.
SQL engines do way more surprising optimizations than this. SQL is a "4th gen" language after all, meaning it's absolutely within the spirit that the database engine does whatever it wants to get the result.
When you write `SELECT DISCTINCT a, b` you are asking for every unique pair (a, b) in the result. Specifically, the same value of `a` could appear multiple times in the result paired with different values of `b`.
The only way to implement that "generally" is to visit all of the rows in the result-set, and skip over duplicates using a hash table of "already seen" pairs, or by sorting the result. The problem is that this intermediate result-set is much bigger than we really want.
As a human, we look at this and think "Oh, but `a` is the primary key (in this case CustomerID) and `b` is a column in the same table (CustomerName) so we only need to be distinct over `a`, which means the query optimizer should be able to "push down" the DISTINCT clause into the scan of the joined tables". The problem is that the query optimizer has not been endowed with this kind of knowledge about the specific relationship between `a` and `b`, and so cannot do this optimization. In principle, a query optimizer could implement this kind of optimization though.
At this level you really start getting into the nitty gritty of what optimizations a particular database has implemented though: the solution presented in the article may be the best way to write this for SQL Server, but this "correlated subquery" form could be catastrophically slow in other database engines...