> The only way to implement that "generally" is to visit all of the rows in the result-set, and skip over duplicates using a hash table of "already seen" pairs, or by sorting the result. The problem is that this intermediate result-set is much bigger than we really want.
This isn't quite right; there are two general ways. One is a hash table as you say, the other one is sorting. The second is quite relevant once you have a reasonable framework for interesting orders (e.g., scanning along an index, or doing sortahead on a smaller table with a foreign key that guarantees uniqueness).
> The problem is that the query optimizer has not been endowed with this kind of knowledge about the specific relationship between `a` and `b`, and so cannot do this optimization. In principle, a query optimizer could implement this kind of optimization though.
There are absolutely query optimizers that know this (try such a query in MySQL, for instance).
This isn't quite right; there are two general ways. One is a hash table as you say, the other one is sorting. The second is quite relevant once you have a reasonable framework for interesting orders (e.g., scanning along an index, or doing sortahead on a smaller table with a foreign key that guarantees uniqueness).
> The problem is that the query optimizer has not been endowed with this kind of knowledge about the specific relationship between `a` and `b`, and so cannot do this optimization. In principle, a query optimizer could implement this kind of optimization though.
There are absolutely query optimizers that know this (try such a query in MySQL, for instance).