On “order by” optimization

acqq · on July 31, 2015

In short, if I understood, the author used

SELECT c,d FROM t WHERE b=X ORDER BY a DESC LIMIT 1

on MySQL and was surprised to find that MySQL was using index over a and in his opinion fully ignored the existence of the index over (b,c) to discover where b is equal to X? I admit I didn't understand his explanation why, especially not why it would maybe do the right thing with the LIMIT 2. I can't imagine that any serious SQL engine would ignore the possibility to use the index in such a case.

midom · on Aug 2, 2015

LIMIT 2 is also exposed to this issue, at different data ratios (in my tests up to LIMIT 142, I went through the math).

acqq · on Aug 2, 2015

Can you please explain: Does MySQL use index over (b,c) or not where evaluating "where b=X" or not? As I've said, I'd expect any engine which has an index to use it.

goldenkey · on July 31, 2015

Its usually better to be explicit and use aggregates like MAX or MIN instead of a LIMIT of 1.

degenerate · on Aug 1, 2015

Can you elaborate on that with an example? I can't imagine when running MAX() or MIN() could be better than a static digit, so an example would help me understand your comment!

crishoj · on Aug 1, 2015

As in SELECT MAX(a), c, d FROM t WHERE b=x?

MySQL will return a random c and d, not necessarily the c and d from the row with maximum a...

acqq · on Aug 1, 2015

Goldenkey actually posted

SELECT c,d FROM t WHERE a=(SELECT MAX(a) WHERE b=X from t)

But that post got flagged because he also gave his opinion on the level of understanding the article writer has. The query you posted is actually the one by the article writer.

crishoj · on Aug 1, 2015

I see. Thanks.

Animats · on Aug 1, 2015

Right. MySQL's optimizer knows how many records each table has, but not much else. It doesn't do statistical tests on the tables. Depending on table contents, "WHERE b=X" can select any number of records, from zero to the whole database. Whether the optimal strategy is to do the "WHERE b=X" first, possibly getting a large number of hits, or simply making a sequential pass, is not decidable from the database size only.

An alternative would be for the optimizer to do an indexed SELECT on b for "WHERE b=X", and if the number of hits exceeded some threshold, abandon that effort as inefficient and start over using a sequential pass. But MySQL doesn't abandon strategies once selected, or keep such statistics as a hint for future queries. As someone pointed out, Oracle's flagship database product does do that. Anyone know about Postgres?

th3iedkid · on Aug 1, 2015

>>Unfortunately, optimizer doesn’t understand, that there’re humans who are building these systems, and humans have their own rationale and thinking. One of the ideas that a human would have is that if you have 100GB table, you better understand things like data locality and other sorts of efficiencies.

I have not used MySql and cannot comment for that but with my experience with Oracle, i believe its histogram generation process is actually trying to solve the same problem.It has a binning strategy for histogram generation that tries to help with data patterns.

http://docs.oracle.com/database/121/TGSQL/tgsql_histo.htm

zapt02 · on Aug 1, 2015

Great read!

goldenkey · on July 31, 2015

[flagged]

dang · on Aug 1, 2015

It sounds like you have a lot to teach us about query writing, but gratuitously putting someone else down, even when they don't know as much as you do, is not ok here. It violates both important rules of HN comments: civility (it truly is gratuitously negative) and substance, because it impedes the absorption of real information in your comment.

In the future, please edit that stuff out and stick to the substance when commenting here.

https://news.ycombinator.com/newsguidelines.html

https://news.ycombinator.com/newswelcome.html

schrodinger · on Aug 1, 2015

Yea, the guy that goes into the source code of the optimizer to figure out the cause for a poorly optimized query plan is the amateur!

midom · on Aug 2, 2015

So sad I can't see or reply to that comment anymore - I'm the author. Sorry, HN, my personal blog rarely gets that wide circulation, I should've done better job at introducing the context.

This was just one of many systems that thousands of facebook engineers build and they did not think of some edge cases. We do know how to write queries or to operate our databases, in general.