Hacker News new | past | comments | ask | show | jobs | submit login
Prepared statements and their surprising performance implications (soykaf.com)
138 points by panic on Feb 23, 2020 | hide | past | favorite | 36 comments



Prepared Statements are a place where 'correct' and 'fast' intersect in a big way. If all databases would detect when you're done with them and close them automatically, they'd be just about perfect. But you can't have everything I suppose. A few programming languages have designed or re-designed their APIs to try to detect this at the library level, and I appreciate the efforts to do so.

I was stunned to learn that the most popular database mapping tool in NodeJS did not use Prepared Statements. Due to an accident of their compatibility matrix it was going to be tricky to achieve, and so they had kicked that can down the road several times.

But in the meantime they had done just heaps of work on input validation to prevent SQL injection attacks. They were confident they had no attack surface. Bunches of test cases to prove it.

It was sort of impressive but also really sad, because creating Prepared Statements with bound variables gets all of that for free, and they're much faster on some DBs. The main exception is for freeform text search, but we seem to have pushed that problem to external tools anyway (I'm not sure the two facts are unrelated...)


In MSSQL prepared statements are not really used because its unnecessary. MSSQL hashes incoming SQL statements and looks for them in the plan cache and reuses automatically. It also looks at connections setting and does parameter sniffing to decided whether to use existing plans.

You can still do prepared statements with MSSQL but its mainly there for backward compatibility.

I was actually surprised to find PG didn't do this recently and only had prepared statements and can't share plans across connections.


I'm a bit confused to read that because whilst they are hashed for reuse (requires optimising for ad hoc workloads to do this?), unless the text of the query is an identical match it will need to compile another separate query plan and won't reuse the existing plan. This means you can end up in a situation where you have many ad hoc requests causing significant memory consumption. Would you be able to speak as to why prepared statements aren't necessary in MSSQL - I'm happy to link to some blog articles that are arguing the total other direction.


Can you share which database mapping tool for NodeJS this is?


Sequelize has bind parameters on their 6.0 roadmap:

https://github.com/sequelize/sequelize/issues/11586

Good for them, but... Version 6?

It appears that the first person to propose an implementation did so in 2018, gave up and moved to other tools.

I’m just getting back to such things, I had looked at knex and bookshelf as well, I believe I picked the latter.

The side-loading logic for Ember isn’t really supported by any of these and I got bogged down trying to write those by hand. I’m about to pick that project up again and I think I will be using Svelte, and possibly Phoenix. I like JavaScript, but it turns out I just don’t like using it all day.


Long ago, we brought in a consultant to tell us why we couldn't max out our very fancy Oracle 9i database hardware. Nothing we did could get us much above 50% saturation on the disk array, CPU, or networking. We were stumped.

I don't know where Oracle is on this issue these days, but in that version there was a query cache for solved queries. And for a query to execute, it had to be in that cache. That dictated how many queries could be in-flight at once. Or rather, how many unique queries. If you used Prepared Statements, it would be happy to run as many configurations in parallel as you wanted. We were using Prepared Statements, so that shouldn't have been a problem. Right? ... Right?

And then he showed me the log of queries.

Less than 2 years earlier (oh, you can already tell this is going somewhere, can't you), a newer team mate was trying to scrub all of the query string arithmetic from our code. Someone had settled on some sort of Builder pattern, but this guy was getting stuck. The guy who had instigated this had to tag out for reasons that I cannot recall, so he asked me to sub in. I got a little briefing, then we went through the problems and I sorted him out, and then I very carefully explained to him what we expected of this code.

If you're gonna bind variables, bind all of them. Solving part of a combinatorics problem isn't solving it, solving part of a SQL injection problem isn't solving it. Bind. Every. Variable. Got it? Got it. You good? I'm good.

And then, dear friends, I made a fateful mistake: I went back to my other tasks and fires, and the other engineer oversaw the rest of the work and the code review.

18+ months later, I'm staring at a bunch of nearly identical queries in the Oracle query cache that have half a dozen bind variables and then a couple of distinct AND clause near the end. Plot twist: he was now my boss. The answer to 'why is our fancy hardware slow?' was, "because of you, Mike, because of you." Fuck a duck.

I guess maybe the reason he was so awful at giving orders is that he never learned how to follow them.


I'm not sure i follow the punchline. If the queries need different where clauses then they aren't interchangeable. Or are you saying they were semanticly the same and only syntacticly different?


My read of GP's story is that the prepared queries varied by supplying different hard-coded parameters, in addition to some bound variables. For example, the WHERE clause might be "WHERE first=? AND last='Smith'". The 'first' argument is a bound variable, while the 'last' argument is part of the prepared query string.


dantillberg has the right of it.

We were doing some complicated fitness tests on our data to return appropriate rows to the user.

Actual:

SELECT * FROM things WHERE ... AND columnA>? AND columnB<5;

Expected:

SELECT * FROM things WHERE ... AND columnA>? AND columnB<?;

These sorts of mistakes tend to sneak back into projects that previously had this handled, so you do have to watch for regressions. (But that's not what happened in this case, if anyone is wondering.)

I don't know if there are any good ways to audit for these sorts of problems these days. At one point I joked that I was going to write my own database library that only accepted Prepared Statements and nothing else, but barring a full SQL query parser, I'm not entirely sure how I thought I was going to prevent someone from doing that.


You really want this to be an SQL mode that the server enforces. Give me a query where anything that could be a parameter wasn't and I'll fail the query, write a nasty log, and maybe disconnect your client.

I think as an opinionated library, you could build a minimal parser, and fail queries that had values or didn't meet your parsing. I'm not super familiar with SQL these days, but I'd guess you could make a reasonable subset and have confidence that you would have false positives, but no false negatives.


Oracle, at least, has/had the problem that it can't use the values of parameters to affect the execution plan. So sometimes it's good to write in the values in the SQL statement.

If you write "select .. where account_id=?" then the query plan will probably be the same for any account_id, e.g. use the index, so that's fine, working as intended.

However, if you write "... where active=?" and active can only be true or false and there are a few thousand active entries but tens of millions of inactive ones, the query plan might be quite different for active=true and active=false. (Active=true would use an index, but active=false would make more sense to just sequentially read the whole table, as most rows will be used anyway.)

So if you have very few distinct values, the selection of the value influences the execution plan, and the value doesn't come from an untrusted source, it can be better to not use query parameters.

That is to say, 99% of the time you'll want to use query parameters, but rejecting queries which don't use them isn't a solution either, as you do need to avoid them sometimes, unfortunately.


> 99% of the time you'll want to use query parameters, but rejecting queries which don't use them isn't a solution either, as you do need to avoid them sometimes, unfortunately

This is one of the problems that I have with most SQL databases: they try to do the magic for you and make it hard to see what's going on under the hood. Databases aren't magic, they are just indexes, and there are only so many ways of organizing them. Providing a bit more transparency into how the data is stored and retrieved allow you to write efficient queries without the guesswork.


That's not entirely correct. A hard parse causes the optimizer to do bind variable peeking, i.e. it chooses the execution plan based on the bind variable value.

https://docs.oracle.com/cd/E11882_01/server.112/e41573/optim...


From the same source:

When choosing a plan, the optimizer only peeks at the bind value during the hard parse. This plan may not be optimal for all possible values.


In that case, would you not use a partial index, and kill two birds with one stone?

On that same project we got into putting hints into the queries. I’m sure if push came to shove you could do that for situations like this one and solve the problem. Less portable that way, of course.


The other way to fight things like this that however you want your users to behave should be the most straightforward way to use the code. If you library provides shortcuts to bad behavior, people will take them.

I was just rereading you comments and started searching for SQL linters and thinking how hard that would be in practice due to all the ways the queries get muddled in imperative code and a thought came to mind:

One of the ways we reduce people’s reliance on string concatenation is by providing interpolation. And I think that at least in some languages you can substitute your own logic for interpolation. You might be able to encourage bind variables that way.

    SELECT * FROM users 
        WHERE last_name=${surname} AND age>${legalDrinkingAge(state)} 
        ORDER BY birthdate ASC


This is mostly how entity framework behaves. Last I checked they had some sort of nonsense checking whether a variable was in _any_ closure before deciding whether to bind it or substitute it, but give or take the rough edges it will bind variables provided in parameterized strings.


> Nothing we did could get us much above 50% saturation on the disk array, CPU, or networking.

What you wrote basically makes no sense unless you are doing continuous table scans (ask any DBA and see how puzzled they look.)

Production MySQL and Postgres master databases that I administer typically have uptime loads of 0.1 to 0.2 on Linux, processing 5,000 qps.

Regarding placeholders and bind parameters (those are different things), how they work depends mostly on your client library. For example, some libraries do rewrite placeholders with string pasting, so verify that. YMMV.

Source: I am a DBA.


> Production MySQL and Postgres master databases that I administer typically have uptime loads of 0.1 to 0.2 on Linux, processing 5,000 qps.

I have worked with plenty production postgres instances that have much much higher loads. As in > 100 (on larger machines obviously). It's pretty easy to have workloads where the hot data set fits mostly into RAM - and if that's the case the read portion of that workload is usually bound by context switches and CPU processing time of the queries.

Far from every workload has enough queries to utilize a machine with plenty cores, but there's also a lot where that's trivial.


If you are getting a lower qps out of a machine than you expect and none of the hardware subsystems are saturated then something is wrong, and it’s not table scans. A table scan would be maxing iops for disk or memory or both, right?

When was 9i released? Are you qualified to compare 15 year old hardware running a 15 year old database on a 15 year old OS to what you have going on now?

(Hell, I’m not even sure it was Intel hardware. Odds are reasonably good that was on an Ultrasparc, since that’s what our customers were running)


Oh my goodness, you're my saviour. I spent a whole day last week debugging what looked like a caching issue with Ecto + Postgres. When the application ran a query, it'd eventually slow down to 600ms.. when I ran it in psql, always 10ms. My solution was to convert the query to SQL and run it manually which bypassed the plan caching Ecto does.

THANK YOU. I will absolutely be testing this tomorrow.


Looks like ecto went too far by using named prepared statement by default? Should work just fine with unnamed prepared statement: https://github.com/elixir-ecto/postgrex/blob/master/README.m...

Postgres planner is very fast, so there will be no noticeable performance difference by having to plan the same query again and again. This is unlike, say MSSQL, where a semi-complex query may take more than 0.5s to parse/plan, thus requiring the rather lousy query plan cache.


> Postgres planner is very fast, so there will be no noticeable performance difference by having to plan the same query again and again. This is unlike, say MSSQL, where a semi-complex query may take more than 0.5s to parse/plan, thus requiring the rather lousy query plan cache.

That's not true. For a simple read only oltp workload (postgres' pgbench -S), the difference between prepared and non-prepared is significant:

pgbench -M simple -P1 -n -S -T 10 -j72 -c72

...

tps = 585582.911949 (excluding connections establishing)

vs

pgbench -M prepared -P1 -n -S -T 10 -j72 -c72

...

tps = 1010152.078646 (excluding connections establishing)

For more complicated queries the difference can be considerably bigger.


Thanks for the numbers. Can you run it for -M extended as well? That would be the case when using unnamed prepared statement.

Anyway, as there is no way for a query plan cache to be one-size-fit-all, as shown by both TFA and some comments in this HN discussion, that slight overhead is more than acceptable IMO, especially when compared to MSSQL.


> Thanks for the numbers. Can you run it for -M extended as well? That would be the case when using unnamed prepared statement.

Those are worse, due to the increased number of protocol messages (parse/bind/execute vs exec).

tps = 491218.039129 (excluding connections establishing)

> that slight overhead is more than acceptable IMO, especially when compared to MSSQL.

I wouldn't call a 2x performance difference slight... That's also with a lot of context switching overhead - if you have a client using pipelined execution, the difference also gets bigger.


Good point about the "slight" wording, though I don't necessary agree that it can get much worse than 2x, with actual workload that:

- could be more I/O bound than when just doing PK lookups with pgbench

- could be getting a wrong plan from the query plan cache if named prepared statement is used


We experienced a similar orders of magnitude of performance at Hasura by using graphql plan caching and also postgres prepared statements at the same time.

This thread is quite timely as we recently published a post about this here: https://hasura.io/blog/fast-graphql-execution-with-query-cac...


Making some query optimization expensive (ie parsing the query in GraphQL, parsing SQL statement, calculating resource with HTTP call) to make it flexible but then caching the results to recognize that even with flexible API any (sanely designed) application will have finite actual queries is oft-recurring pattern in software engineering.

I wonder when this kind of pattern starts being taught more extensively than a stupid singleton which is nothing more than a fancy name for a global variable.


TNDR (Too Narrow - Didn't Read): I have a really wide monitor and this site's code is horrible to read as it's compressed to a very narrow strip (less than 600px) of text down the middle of the browser window. I tried ... I really did and it seems like the content might be worth reading.


I have also been struggling lately with prepared statements and obscure query planning decisions. The developers of the mssql JDBC driver seem to be determined to force the use of prepexec in their driver, which causes high execution times for typical ORM-generated queries:

https://github.com/microsoft/mssql-jdbc/issues/1196


I appreciate the use of the re-worked Pride and Predjudice quote, especially as modern social networking seems just as airheaded and deadly serious as early 19th century polite society. We in tech are the majordomos of the stately manion, the scullery maids and chimney sweeps, while all look up at Dukes Bezos, Gates, Bloomberg and Zuck.


Y'all are humourless philistines.


> Y'all are humourless philistines.

For what do we live, but to make sport for our neighbours, and laugh at them in our turn?

But seriously, I think the intersection of software nerds and readers of Pride and Prejudice is probably fairly small. I appreciated it too, but can understand that 99% of people here probably have no idea what you're talking about. For the curious, the quote in question is It is a truth universally acknowledged...


Great insight! And this situation illustrates why it's useful to understand a bit (especially understand how to test and explore) deeper than just the (not)ORM.


I'm getting a feeling of dread now. I'm using an ORM at the moment (Facebook's 'ent' in Go) but I don't know what the underlying queries and performance are like.

I guess I'll just build my application and review the queries / performance cost further down the line. He said hopefully.


The point is that at some point, it does become necessary to learn the complexities beneath the interface. However, you don't have to learn it until you _have_ to learn it. Just be prepared to spend some quality time when you reach that wall.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: