I liked the article, and I know this is nitpicking, but I disagree that creating an index is the best solution for eliminating the sequential scan. Hopefully the next article mentions it.
Just drop sort clause. Returning the user with the lowest id seems like an unusual use case.
Imagine searching for a name in a phone book and only finding the oldest person with that name, or the person with the lowest phone number.
More common cases are checking whether a user exists, which doesn't require a sort, or finding all the users with a particular name, which I guess could be paged using limit and offset.
Oh absolutely. I thought about saying that, and just ran out of energy before posting the article this morning.
The reason I used this particular SQL statement as an example is that ActiveRecord (a Ruby ORM) generates it from a fairly simple, common Ruby expression. I suppose ActiveRecord could be improved to drop the sort when you know there's one one possible match.
Unless "name" is unique, there could be multiple matches. Asking for the "first" demands an ordering.
Perhaps (a) the user doesn't care what ordering is used, and/or (b) (such as in your example in the preceding post) hasn't specified an order (and apparently PK ordering is defaulted to PK - a model option?). It would be error-prone for ORM to take (b) as implying (a).
Yes, good point. Rails has a lot of conventions like this - that "first" implies ordering by primary key (by default). I just imagine that most Rails developers don't think about sorting and its implications when they ask for the first record.
Even if you don't care about the id ordering, unless "Captain Nemo" occurs so frequently as to be likely found in the first few pages of the table, the index is a big help.
That's one possibility. Another would be for a client to transfer an encoding of a query that can be cheaply decoded, instead of requiring parsing. The translation of a query from an application AST to that encoding could skip the SQL representation altogether!
Of course this sort of approach would give up the (debatable) interoperability capabilities of SQL and I'm not sure parsing is enough of a bottleneck for it to really be worthwhile. A "binary SQL" spec would also be interesting (and maybe exists already?).
I was purely talking about "pre-compiling" the queries themselves, so it would only effect parsing, not planning. I have no trouble at all believing that this would be a totally negligible improvement.
I'm not a postgres hacker but as I understand it, query plans generated at "exec time" can be better than those generated at "prepare time" because they have more information to hand about the expected size of nodes within the plan.
There are other reasons to use prepared queries but if performance is your only one it may not be worth it unless your query is either very complex, or runs in a tight loop.
You are absolutely right. That's because when a parametric query
arrives, the parameters are unbound and the planner cannot take
advantage of fit-to-purpose selectivity estimation. It must instead estimate generically.
Newish versions of Postgres (9.2+ I believe) try to paper over this
surprising effect by re-planning queries a few times to check for cost
stability before really saving the plan. It has proved very
practical.
Notes
If a prepared statement is executed enough times, the server may
eventually decide to save and re-use a generic plan rather than
re-planning each time. This will occur immediately if the prepared
statement has no parameters; otherwise it occurs only if the generic
plan appears to be not much more expensive than a plan that depends
on specific parameter values. Typically, a generic plan will be
selected only if the query's performance is estimated to be fairly
insensitive to the specific parameter values supplied.
The way you describe it, could prepared queries still delay the creation of the plans till exec time? There are some things that cause plans to change like sort order but usually, if they are just parameterized, no need to do so.
Ive toyed with the idea in the past to create a custom parser for an alternative sql syntax - although I doubt I have enough knowledge to do so. Not necessarily revolutionary, mostly just a shuffling of normal sql syntax to make it more understandable like moving the selected columns to the end. One day maybe.
Also agree. SQL is not easily composable. I wish for relational algebra programs that can have an optimizer run over it and determine the optimal join order etc.
The argument that SQL allows you to say what the result should be without any reference to the order in which the steps should be done is fine, except that it is hard to say some things in SQL. If C compilers can transform sequential, imperative code to an equivalent optimized form, I don't see why relational algebra compilers / optimizers cannot.
> I wish for relational algebra programs that can have an optimizer run over it and determine the optimal join order etc.
Optimal join order usually is a function of both the query and the data, and a query optimizer inside your database doesn't necessarily find the optimal join order. It uses heuristics (which often include particulars about the data currently stored in the database) to find a join order that's hopefully better than a naive query plan, in much the same way that optimization passes in a C compiler use heuristics to generate machine code that's hopefully faster than a naive translation to machine code.
> The argument that SQL allows you to say what the result should be without any reference to the order in which the steps should be done is fine, except that it is hard to say some things in SQL.
The GP is talking about a surface syntax change, not a change in the underlying computation model. Using a relational algebra notation, the query optimizer would have just as much freedom as an SQL query optimizer. Relational algebra isn't any more inherently imperative than SQL is.
> If C compilers can transform sequential, imperative code to an equivalent optimized form, I don't see why relational algebra compilers / optimizers cannot.
It sounds like you want a semantic change that gives the query optimizer less freedom than an SQL query optimizer. The GP is suggesting only a syntactic change. The thing it sounds like you want does sort-of exist... many SQL databases will let you inspect the query plan that their optimizer has generated.
However, it sounds like you want some statically defined query plan. The problem with this is that the optimal plan depends on the data that's in the database at the time the query is run. For instance, a query optimizer can look at a complex query with multiple constant WHERE clauses on indexed columns, and use the indexes to quickly determine the size of intermediate tables when deciding the order in which to perform joins. A query language that statically defines a query plan cannot take advantage of this information, unless you want to "re-compile"/"re-optimize" once a day or something. However, if you trust the database to automatically re-optimize on some schedule, then you've lost your static query plan and it seems you might as well let the query optimizer regenerate the plan based on heuristics created by the database developers rather than sticking to some static schedule of recompilation/reoptimization points.
What I want is to be able to transform relations in a certain order for ease of reasoning (and introspection of intermediate values), but then have the optimizer transform it into whatever equivalent plan it can determine that gives the same results. Of course, there may be places where the steps taken will over-constrain the optimizer, but that's probably an acceptable risk (and as long as we're wishing for things here, that should preferably be detected).
Yeah I'm excited to read about indexes in PostgreSQL. Indexes also add overhead to the database system as a whole, so they should be used sensibly — I'm looking forward to seeing your approach for good index candidates.
If you enjoyed this post, dig into Pat's archive. He's written a handful of posts that are in a similar style.
Indeed, I'm starting to feel like a shill given how often I praise it, but if you have any interest in language implementations and don't mind Ruby as a vehicle for some expiration, check out Pat's book Ruby Under a Microscope.
Postgres did an awful job on that query. Small values of LIMIT should not require a full sort, even when there's no index. I think MySQL has a special case for small values of LIMIT.
Looking forward to the next installment. That query is so simple that you're not seeing what a real database can do. Let's see something with a JOIN and lots of indices, so the optimizer can do some work.
> Postgres did an awful job on that query. Small values of LIMIT should not require a full sort, even when there's no index.
Postgres will not necessarily do a full external merge sort for the query plan shown in the article: when there is a LIMIT k clause, the Postgres sorting code has optimizations to only keep the top k values in memory and do an in-memory sort (obviously, a full seqscan of the input is still needed without an index). Search for TSS_BOUNDED and tuplesort_set_bound() here:
By the way, I used a simple query on purpose to keep things simple for readers (and for me, the author!). By using such a simple SQL statement I was able to avoid complex optimizations and algorithms I would haven't been able to follow - and that wouldn't have helped readers much anyway. I wanted to get the basic idea across to people who have no prior knowledge of DB internals.
Plus I wasn't interested in comparing one DB vs. another, as much as I was interested in understanding how any DB works.
Depends on the contents of the database. The query optimizer doesn't know, at optimization time, how many hits such a statement will produce. Given a database with only a few entries with the same name, the sort cost is tiny. If there are a lot of hits, it's expensive.
Just drop sort clause. Returning the user with the lowest id seems like an unusual use case.
Imagine searching for a name in a phone book and only finding the oldest person with that name, or the person with the lowest phone number.
More common cases are checking whether a user exists, which doesn't require a sort, or finding all the users with a particular name, which I guess could be paged using limit and offset.