Hacker News new | past | comments | ask | show | jobs | submit login
Tricking PostgreSQL into using an insane, but faster, query plan (spacelift.io)
259 points by cube2222 on Jan 18, 2022 | hide | past | favorite | 132 comments



The curious part is why postgres can't estimate that the size of the active runs<>stacks join will be very small.

A core part of query planning is estimating the best join order using a dynamic programming algorithm. So roughly, A(1M) <> B(1M) <> C(10) should use the join order A-(B-C), not (A-B)-C.

I bet it's something like, postgres doesn't know the correlation between runs_other.worker_id and runs_other.stack_id. It seems like its seeing low number of runs_other.worker_id, then estimating the stacks are split evenly among those small number of runs in the millions.

(Why it wouldn't know stacks.id correlates to one stack each, idk. Is stacks.id not the primary key? Is there a foreign key from runs? Very curious.)

What happens if you follow this guide to hint Postgres as to which columns are statistically related? [0]

Hinting PG's estimating stats may be a more persistent solution. Another core part of query planning is query rewriting-if they add a rule to simplify COUNT(select...)>0 to normal joins (which are equivalent I think?) then your trick may stop working.

[0]: https://www.postgresql.org/docs/12/multivariate-statistics-e...


To be fair, we're using Aurora Serverless, so I'll have to dig deeper if we can use these kinds of hints. But thanks for the tip, this could be really useful!


Given Aurora talks about itself as a replacement for pg/mysql that provides compatible skims I'm honestly wondering whether you're dealing with the pg optimiser here in any meaningful way.

(exactly how far along the scale from 'tweaked version' to 'basically just a skin on a completely new database' aurora is is ... opaque to me ... there's probably something out there that answers that but I haven't yet managed to find it)

Edited to add: zorgmonkey downthread points out they have support for some extensions like pg_hint_plan - https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide... - which suggests at least a decent amount of pg code but I'm still curious exactly where the changes start.


I obviously don't know its exact internal architecture, but based on my digging and trying out different PostgreSQL-native features, it looks like it's PostgreSQL instances as query executors with a custom (high-availability) storage layer, fronted by PgBouncer.


That sounds like a pretty reasonable educated guess to me (that's meant positively, I'm just ops enough that I consider lots of things I hope I know to be in that category).

But it -does- still leave the question of how well the storage layer reports table statistics and how that interacts with the planner.

If I sound suspicious, it's because I ran into a situation recently where a SELECT inside a transaction on mysql Aurora somehow (as far as I can tell, I was helping out at a couple removes so if you're reading this and it's relevant to you please don't believe me without testing yourself) took a whole table lock in a situation where I am pretty damn sure that for all its 'fun' actual InnoDB would've locked a page at most.

So ops goes.


PostgreSQL really needs hints. Sometimes I as the developer know better than the optimizer and having to trick the optimizer to get it to do what I want is a ridiculous situation.

Also having the optimizer decide to do something bad off hours is not a good situation again hints or plan locking would help here.


Although absolute sacrilege to many, but I would honestly support a really explicit syntax to compose particular tables and indices. Not even in top of regular SQL query, I mean its own first class language.. E.g. "Take this index, restrict to this subrange; for each row, pick a row using this other index on the foreign key" (but it wouldn't have to be imperative like I've phrased it).

There are times when I already know how I want the query to execute, and I have to iteratively fiddle with the query to indirectly trick the planner into doing that. There's just no excuse for that arrangement. Even if I've made a mistake and my plan couldn't work efficiently – at least let me execute it inefficiently so I can discover and debug that.


Or, to put that another way: Postgres's query planner should be emitting an explicit bytecode that is then interpreted by a query-execution abstract machine. That same layer should also be exposed directly to clients over the network, allowing pre-planned query bytecode to be submitted directly in place of a textual query.

I've heart the rebuttal to this idea before, that even minor DB schema/configuration changes that should be "invisible" to the application (e.g. an index being created — or even refreshed — on the table) would result in the resulting bytecode being different; and so you'd never be able to ship such bytecode as a static artifact in your client.

But: who said anything about a static artifact? I'm more imagining a workflow like this:

1. user submits a query with leading command "PLAN" (e.g. "PLAN SELECT ...")

2. Postgres returns planned bytecode (as a regular single-column bytea rowset)

3. if desired, the user modifies the received bytecode arbitrarily (leaving alone anything the client doesn't understand — much like how optimization passes leave alone generated intrinsics in LLVM IR)

4. user later submits a low-level query-execution command over the binary wire protocol, supplying the (potentially modified) bytecode

5. if the bytecode's assumptions are still valid, the query executes. Otherwise, the query gives a "must re-plan" error. The client is expected to hold onto the original SQL and re-supply it in this case.

In other words, do it similarly to how Redis's EVAL + EVALSHA works; but instead of opaque handles (SHA hashes), the client receives a white-box internal representation.

Or: do it similarly to how compiling a shader on a game-console with unikernel games + unified RAM/VRAM works. The driver's shader compiler spits you back a buffer full of some GPU object code — which, running in kernel-mode, you're free to modify before running.


> Or, to put that another way: Postgres's query planner should be emitting an explicit bytecode that is then interpreted by a query-execution abstract machine. That same layer should also be exposed directly to clients over the network, allowing pre-planned query bytecode to be submitted directly in place of a textual query.

And now you've invented Firebird, or even 1980's Interbase. ;)


I suspect the extra compatibility/dependency surface would make this a non starter in practice but it's an absolutely fascinating thought experiment and I'm glad you wrote it up.


PostgreSQL doesn't even cache its own byte code, it replans every time!


Depending on the placeholder values the optimal plan may be very different.

You -can- use a server side prepared statement to force it to -ahem- plan ahead but that's -usually- not actually worth it.


Thats why other databases have parameter sniffing.

And using prepared statements only works on the same connection so of limited use.

Luckily PostgreSQL's optimizer is very primitive and so planning doesn't take too much time, as it gets more advanced the lack of plan reuse will become more of an issue. Its already an issue with the LLVM JIT compilation time.


You can't ship SQL queries and then do arbitrary schema changes either, right?


Not arbitrary changes, but nearly so - if you add new columns (and your existing queries name the columns they use explicitly), then nothing changes as far as the client is concerned. If you remove or rename one, only those clients who refer to it need to change, and then only if you don’t do a interposing view/schema/etc.

The server is supposed to take the logical intent specified by the queries and do the work to mapping that into concrete retrievals in a stable and performant way, so a query explicitly telling the server to do something in a specific way breaks the model. The more specific, the more broken.

DDL changes for instance ideally would not have to tell the server how to structure things on disk. The server should know how to do it efficiently.

That said, real life intrudes and sometimes (but not usually) it’s important to do this for stability or performance reasons. The more magic involved, the more unpredictable a system can be, and the closer a system runs to redline, the more chaos/downtime that can cause.


There's no technical reason why one couldn't just select or join from an index as if it was a table or a view is there?

We also write queries knowing it will use a specific index, or we will create an index because of a specific query. And then we have to have a scheduled task to periodically recalculate statistics just so the DB server doesn't get silly ideas.

Of course it could be abused, but I'm in favor of having ways of letting programmers tell computers exactly what to do. Sometimes we really do know best.


Yes I would too. Over time I have come to appreciate at least being able to control order of operations, just forcing join order is many times enough of hint to keep things working properly.

Having used Linq a lot I would actually prefer more of that kind of chained statement approach that is more hands on without having to explicitly loop.


something like sql server's "use plan" it really comes in handy when you need it (https://forrestmcdaniel.com/2018/07/19/how-to-use-the-use-pl...)


Clickhouse doesn't go that way all the way, but its approach to SQL support is very similar: it does exactly what you have written. It doesn't rewrite your query in any way: if you want to filter table A before you join it with table B, you write a subquery.


Not that I disagree with your point, but the opposite of

> Also having the optimizer decide to do something bad off hours is not a good situation again hints or plan locking would help here.

Can also happen when you force a particular query plan, only for it to turn to treacle when some assumption you made about the data suddenly breaks.

As with most things there's no free lunch.


In practice what happens in this case is that the query slows down over time, often days or weeks after deploying the thing that makes the query worse. It takes time to fill up the data that slows things down. This gives you plenty of time to monitor and take action. If it does happen more quickly than that (e.g. if a table starts getting filled for the first time, statistics can be off) then rolling back is easy.

But if the query planner decides to screw up your query plan it happens instantaneously and there's no possibility of rollback - only emergency deploying code fixes to try to tweak the query. In most OLTP use cases, you almost always know exactly how a query is supposed to execute, so always using index hints is totally reasonable to prevent bad query plans.

Basically, you breaking your own query with bad hints usually breaks things a lot less and at better times than the query planner doing it, and is usually easier to fix too.


Slow degradation is a common case, but a day with bad skew due to spam or other unusual activity can cause a previously optimal query plan to never complete.


When doing disaster planning, at least something blowing up because of a data change is preventable or mitigateable. Staging environments. Rollbacks, etc.

The system randomly deciding to drive itself off a cliff for no reason, with no known way to stop it next time is quite concerning.


And if you do find your query has turned to treacle, looking for whether you used a wrong query hint is already in your checklist, because it's right there in the query that you're already debugging. You can't forget to double-check it.


It's basically shameful how the maintainers refuse to even entertain the thought of allowing query hints. These kinds of articles and anecdotes come up again and again, yet seemingly no movement whatsoever on the blanket ban on query hints in Postgres. (See https://wiki.postgresql.org/wiki/OptimizerHintsDiscussion)

Postgres generally is great -- so many great features and optimizations and more added all the time. But its query optimizer still messes up, often. It's absolutely not the engineering marvel some would have you believe.


> These kinds of articles and anecdotes come up again and again

Don't articles and anecdotes also come up again and again of developers feeding bad happens to the database and cratering their performance?

And that wiki page... Very explicitly isn't a blanket ban? They straight up say that they're willing to consider the idea if somebody wants to say how to do it without the pitfalls of other systems. The only thing they say they're going to ignore outright is people thinking they should have a feature because other databases have it (which seems fair).


> people thinking they should have a feature because other databases have it (which seems fair)

Literally NO ONE wants query hinting in Postgres to check some kind of feature box because other databases have it.

We know we want it because…other databases have it, and it's INCREDIBLY USEFUL.

> They straight up say that they're willing to consider the idea if somebody wants to say how to do it without the pitfalls of other systems.

Pitfalls my ass. We want exactly the functionality that is already present in other systems, pitfalls and all. That's just an excuse to do nothing, Apple-style, "because we know better than our own users" while trying to appear reasonable.

It's akin to not adopting SQL until you can do so "while avoiding the pitfalls of SQL." Just utter bullshit.


> We want exactly the functionality that is already present in other systems, pitfalls and all. That's just an excuse to do nothing

Are you offering to help with the maintenance or development associated with it? Or are you just demanding features while calling the people that do help with that stuff liars?

Maybe they do know better? Or maybe they know about other hassles that will come with it that you can't fathom. They've been developing and maintaining one of the best open source projects on the planet for nearly 3 decades. Maybe with all that experience, its not their opinion that is BS?


Given there's already a pg_hint_plan extension I really don't see why people are so angry about the developers not wanting to bless something they consider a footgun as a core feature.


"risk", because pg_hint is an external module, written by somebody as a hobby, therefore if it stops existing or working after some updates or with newer versions of Postgres then it becomes a huge problem for PROD systems that need it for any reason.


My point is that "core developers should add an unproven feature they think is a footgun and take on the support overhead of that without convincing evidence it would actually be a net win" is not a convincing argument.

Also given Amazon AWS support it for Aurora, I feel like it's not -that- hobby-ish.

If people use it and can provide evidence that it helps more than it hurts, that might be convincing.

Insulting the core developers for not yet being convinced seems rather less likely to help.


> Also given Amazon AWS support it for Aurora, I feel like it's not -that- hobby-ish.

Oh, didn't know that :)

About the rest: sure, I can agree as well about the indirect adverse effect of having them available (e.g. easy to misuse them as I saw in some apps using Oracle DBs), and it's for sure wrong to insult somebody because of this.

Still, personally, I think that the pros would outweight the cons of having that embedded in the app.

On one hand I remember some nights spent in the past trying to make some SQL work, hints were always at least a good temporary workaround.

On the other hand there will always be some SQL which confuse the optimizer (or more special cases about a lot of data changing distribution of values, etc..) and hints would be the only way to cover these cases.

Maybe an interesting question is on which level should hints act? I know mainly only Oracle & MariaDB, therefore I know hints of the type "use that index"/"query tables in this order"/"join these tables with this join type"/etc..., which are probably low-level hints. Maybe already just higher-level hints of the type "most important selectivity criteria comes from inline-view X"/"I want just the first row of the result"/"take into account only plans which select data by using indexes"/etc... would be as well interesting, not sure, just dumping here my thoughts.


It is written by Telco company.


Sure, maybe in not doing what every other database does, that really helps users out when they're suffering unpredictable performance, they're demonstrating their far superior knowledge. And maybe all the users that are suffering are somehow incompetent.

Or maybe it's just typical developer hubris of the kind that hits everyone at some point.


We're software engineers not marketing folk who just want to check a box.

There's a difference between "have a feature just because other databases have it" and "this is a very useful feature that would have its own PostgreSQL-isms and also we got this idea because other databases have it". The query planner isn't infallible, so being able to hint queries to not accidentally use a temporary table that just can't fit in ram isn't just copying a feature "just because everyone else has it".


> they're willing to consider the idea if somebody wants to say how to do it without the pitfalls of other systems

That is basically a blanket ban. Saying you won't implement a widely-implemented feature unless someone comes up with a whole new theory about how to do it better is saying you simply won't implement it. Other databases do well enough with hints, and they do help with some problems.


Definitely my biggest issue with Postgres.

I have one shameful query where, unable to convince it to execute a subquery that is essentially a constant for most of the rows outside of the hot loop of the main index scan, I pulled it out into a temp table and had the main query select from the temp table instead. Even creating a temp table and indexing it was faster than the plan the Postgres query planner absolutely insisted on. Things like CTEs etc made no difference, it would still come up with the same dumb plan every way I expressed the query.

The worst thing is not even being able to debug or understand what is going on because you can't influence the query plan to try alternative hypotheses out easily.


Personally, I think they would entertain it, but the implementation has to be fairly good and they have to be up for maintaining whatever that implementation is. There was a lot of hemming and hawing about hot standby & replication for years, but once someone showed up with a credible implementation and track record of maintaining and fixing it, the objections seemed to get a lot quieter.

Hints are a somewhat invasive feature that are hard to tweak once they are integrated into applications. I don't think the half dozen people that are in the best position to consider its evolution have found it the use of their time they wish to expend.


Shameful? A bit much for what is essentially a gift to the community? They are building a free product and are entitled to their opinion on how to go about building it, and decisions on taking up scope is very much an integral part of it. In engineering, every decision is a tradeoff that might benefit one set of users while impacting another. Free software is all about choice so that different users can find that thing that is most suitable to them.


I've never used it but their is a postgres extension called pg_hint_plan [0] for query hints and I am guessing it is pretty decent because it is even mentioned in the AWS Aurora documentation

[0]: https://pghintplan.osdn.jp/pg_hint_plan.html


I can understand that people wants to get absolute control on their queries sometimes. I never used hints so I shouldn't even write this note but IMHO that page has a quite balanced view of pros and cons of existing hint systems.


It does. But when you're watching your database, and therefore your business, crash and burn because it decided to change the query plan for an important query, all these "problems with existing Hint systems" sound very irrelevant.

I'd love to have a way to lock a plan in a temporary "emergency measure" fashion. But of course it's hard/impossible to design this without letting people abuse it and pretend it's just a hint system.


>I'd love to have a way to lock a plan in a temporary "emergency measure" fashion.

Could this be achieved by extending prepared statements? [1] The dirty option would be to introduce a new keyword like PREPAREFIXED.

The first time such a statement is executed, the execution plan could be stored and then retrieved on subsequent queries. There would be no hints and the changes in code should be minimal.

Once a query runs successfully, the users can be sure that the execution plan won't change.

>But of course it's hard/impossible to design this without letting people abuse it

Is this more important than having predictable execution times?

[1] https://www.postgresql.org/docs/current/sql-prepare.html


> Is this more important than having predictable execution times?

To me: no. To the Postgres team: apparently :)


Exactly... DB deciding to switch query plans in production is sometimes terrible.


> that page has a quite balanced view of pros and cons of existing hint systems

I found the cons section somewhat disappointing given how much respect I have for Postgresql maintainers in general.

Most of the "problems" are essentially manifestations of dysfunction within users or PostgreSQL development itself. People won't report optimizer bugs if they can fix them themselves, etc. (far more likely they won't report the bugs if they have no way to prove their alternative query plan actually performs better, eg: by adding a hint).

Many are just assertions which I highly doubt would hold up if validated ("most of the time the optimizer is actually right", "hints in queries require massive refactoring" ...).


> (far more likely they won't report the bugs if they have no way to prove their alternative query plan actually performs better, eg: by adding a hint).

They don't need to run an alternative plan, they just need to report that the main plan did far more work than needed.

In the real world, when users have a problem in their mission-critical system, they ask for a solution, not just give up on the mission.

> Many are just assertions which I highly doubt would hold up if validated ("most of the time the optimizer is actually right", "hints in queries require massive refactoring" ...).

Their 20 years of experience beats your idle speculation.


> they just need to report that the main plan did far more work than needed.

And how do they know that the chosen plan is not already optimal? How many reports do you think you'd get if every time someone encountered a "slow" query they reported it? And what fraction of those would be finally found to be caused by a bad optimizer as opposed to the user's fault like a missing index, outdated statistics, ... 1%? 0.1%? See stack overflow for a sample of that ratio. Be careful what you wish for.

If hints existed performance reports could be "default plan of x is slow, I know because when I use hints w,z then it's fast".

What should they do in the meantime while they're waiting for the pg project to acknowledge the problem, someone to propose a solution, and someone to implement a fix, just sit on their hands and twiddle their thumbs while their database server spews smoke? Hints are an escape hatch that empowers users to take over a when the optimizer veers off course. Not having any way to control the plan is like a tesla autopilot that doesn't allow the driver to take control over the car in an emergency.

Also, distrusting users to report performance issues is a weird attitude and gives a bad taste. Maybe it's true, or maybe you need to make it easier.


Long-time developers of Postgres have spent a long time developing Postgres -- not necessarily running a deployment of it nor writing and running queries on growing datasets.

It's almost self-evident that this is the case, because if they had, they'd have implemented some more fine-grained way of overriding the query optimizer when it inevitably fails at times.


> Their 20 years of experience beats your idle speculation.

Perhaps.

But it comes across to me on that page as arrogance and dismissiveness of user needs. Hence why I find it disappointing, even if it is mainly in how it is expressed.


I recall using https://pghintplan.osdn.jp/pg_hint_plan.html for this feature, it did what it claimed to be able to do. It's a hack, but it really helps in the worst case scenarios.


There's now a github repo for pg_hint_plan: https://github.com/ossc-db/pg_hint_plan


I really appreciate that in SQL Server we can just use query hints. I don't think I will ever see eye-to-eye with the PostgreSQL team on this. No query planner is perfect. SQL Server's planner certainly isn't. We have had the same sort of "query plan suddenly changes in production and everything breaks" incidents, but they are easily fixed with query hints and/or forced plans, and you can be sure that future statistics updates won't break them again.


You can use pg_hint_plan for hinting query plans to postgres: https://github.com/ossc-db/pg_hint_plan


I still have nightmares of SQL servers optimizer taking down production systems at random times due to picking a bad plan, judicious use of hints has made my life much easier and allowed me as a developer to be in control and not have to ask the DB nicely and pray it listens.


I worked for a payments firm that submitted transfers in bulk every day, but had a monthly surge day, and a late cut-off period to submit or cancel payments. It doesn't matter if you can do everything else in 15 minutes, if the query planner decides to take a 30-second query and turn it into a 4-hour nightmare because different customers are making different numbers of payments today, and it trips over some Postgres heuristic.

The Postgres query planner was, quite frankly, the enemy. By the time I left we were at the point that we considered it a business risk and were looking at alternatives. If you need queries to run in a predictable amount of time — forget fast, simply predictable — then Postgres is quite simply not fit for purpose.


these query issues are most of the times not SQL Server problem, they are rather symptom of poor data model architecture, table design.

A lot of times I see in operational DB hot data gets mixed with warm and cold data, and your hot path query with ton of JOINs/subqueries will get rekt.

Proper redesign and rearchitecture helps to provide large buffer against these problems.

Also not everything should be inside SQL server, if you run large query every 5 seconds or something - probably consider using in-memory cache or denormalized model


Excuse me. You are replying with advice that is fundamentally inappropriate for a batch workload (describing a "hot path query" and suggesting in-memory caches), and it shows that you have not read and comprehended the post I wrote.

Besides which, if you need to re-architect and denormalize by moving records in and out of different tables just so you can run what is essentially a monthly report in a predictable (not fast! just predictable) amount of time -- well then, why bother using Postgres in the first place, you might as well bite the bullet and go with something NoSQL at that point, because at least when you read from a given Cassandra partition you know all the results will be right next to each other, 100%; why leave it to chance? you've been burned by Postgres before


As a developer I never had to blame SQL server, it was all the time develop mistake: poor architecture, poor query, some data quality issues. Dont blame the platform for developers mistakes, thats what I was trying to say


The wise man builds his house upon the rock. The foolish man builds his house upon the sand. I will 100% blame the platform when what it provides is deficient, for selecting a deficient platform is one of the gravest mistakes a developer can make.

And Postgres, again, is materially deficient in its query performance predictability, and its developers are insistent that they have no desire to allow the approaches that would mitigate it. If this matters to your application, then the prudent developer will drop Postgres and use a real database.


Must not have been working with it very long, off the top of my head having to disable the "new" optimizer in Sql 2014 and go back to the 2012 version due to bad plans https://sqlservergeeks.com/sql-server-2014-trace-flags-9481/

I have been working with it since way back in 6.5 and it has made many obviously wrong decisions in the optimizer over the years, its has gotten better but is by no means perfect, good thing is it has hints unlike PG!


So whats the issue with marginal difference in cardinality estimators? Both queries seem perfectly fine speed-wise.

Just splitting hairs, again most of the time developers fail to notice the significant data distribution drift, and dont rearchitect their data models then blame the engine for developers omissions and invalid assumptions



Blaming the table architecture doesn't get rid of the fact that the PostgreSQL planner can make decision that will cause a query to take an unknown amount of time.


I guess that is why cloud databases like spanner charge big bucks.


even plan locking itself would be a welcome feature. we've had multiple instances of plans going haywire (never traced it down, but 99% sure it was after autovacuum analyze) impacting prod and getting back to normal after a manual re-ANALYZE.


I haven't had much use for plan locking even after MS SQL server got it, even though its a nice feature with being able save load plans from a file etc.

I just like good set of hints since they are part of source code and I can put comments and see why they are there in source history later.


if nothing else I would love to have that just so we don't end up debugging heisenbugs that occur and then disappear when you try to find them (or we can't rule it out when we test the same query and it magically starts performing well).


Since the PostgreSQL folks are clearly against hints, maybe another random feature suggestion: If a (read-only) query is taking "too long", maybe you could launch a second copy with a different plan. Maybe even a third after twice as long and a fourth after twice as long again. Kill them all once any one returns. Trades some extra load now against waiting forever for a 'bad' query plan.

The idea obviously needs work and has probably already been suggested and dismissed, somewhere, but I thought I might throw it out there, especially with modern computers having so many cores.


On one hand this could theoretically work if the optimizer tracks historical SQLs + the query plans used at that time + their runtime and decides to try to use a different query plan, then aborts the SQL being executed when its runtime exceeds the historical runtime. But in practice it's likely to mess things up because data volume/distribution/machine load (CPU, disks, etc) might differ compared to the past etc... .

On the other hand, in general there are often too many postential combinations of query plans to try out (hundreds even for a relatively simple SQL) and trying them all out would need hours/days/etc... . The "good plan" might be something that a machine might categorize as "very unlikely to work" so it might end up being the one tested automatically at the very end.

Normal hints would still be a lot easier to handle in the code and for the user.


I'm not a DB expert, but aren't most queries bottlenecked by disk bandwidth? If so, addtl queries might hurt more than help.


Modern databases on modern server hardware are not limited by disk bandwidth typically, though Postgres is not a modern design and struggles to use the disk bandwidth available. That said, since these queries are accessing roughly the same data, concurrent queries will mostly be hitting cache. Memory bandwidth is often a larger limitation for databases these days, so it would still be wasteful.


Having been in a situation where Postgres suddenly decides to use a much slower query plan on a hot query in production, I'd agree.


Regarding this:

> Only a minuscule part of the Runs in the database are active at any given time. Most Stacks stay dormant most of the time. Whenever a user makes a commit, only the affected Stacks will execute Runs. There will obviously be Stacks that are constantly in use, but most won’t. Moreover, intensive users will usually use private Worker Pools. The public shared Worker Pool is predominantly utilized by smaller users, so that’s another reason for not seeing many Runs here.

> Right now, we’re iterating over all existing Stacks attached to the public worker pool and getting relevant Runs for each.

Is there an opportunity to make this query more robust with a bit of denormalization, if necessary, to indicate which runs and which stacks are active and a filtered index (index with a WHERE clause) on each to allow efficiently retrieving the list of active entries?

Best practices for database optimization tell you "don't index on low cardinality columns" (columns with few unique values), but the real advice should be "ensure each valid key of an index has a low cardinality relative to the size of the data" (it's OK to have an index that selects 1% of your data!).

That is, conventional wisdom says that indexing on a boolean is bad, because in the best case scenario 50% of your entries are behind each key, and in the worst case almost all entries are behind either "true" or "false". There are performance impacts from that and there's the risk the query optimizer tries to use the "bad" side of the index, which would be worse than a full table scan.

But with the improved advice and insight that 0.01% of rows have value "true", you can create an index with a `WHERE active = true` clause, et voila: you have an index that finds you exactly what you want, without the cost and performance issues of maintaining the "false" side of the index.


Yep, you could indeed use a partial index with the two relevant conditions to solve this query well. We're using those all over the place as well.


It’s impressive how many developer misconceptions I see cropping up in these articles. The point of the query planner is to replace an army of programmers, not evade design faults. Using a server RDBMS it’s trivially easy to decouple schema and query evolution from application. Yet somehow developers always seem to back themselves into situations requiring extreme coupling. Using plan guides in the MSSQL environment helps somewhat in patchwork solutions, tho typically the deficiencies are usually related to obsolete statistics…(and that’s not even necessary). In Postgres you should ensure design separation so you can rewrite the query easily.


Using a server RDBMS it’s trivially easy to decouple schema and query evolution from application.

Curious, how do you decouple query evolution from the DB layer of your application?


> Curious, how do you decouple query evolution from the DB layer of your application?

Use application-specific views, and the evolution driven by DB optimization, etc., can happen in view definitions without the app DB layer being involved. It’s a traditional best practice anyway, going back decades to the time when it was expected an RDBMS would serve multiple apps that you needed to keep isolated from each others changes, but it also works to isolate concerns at different levels for a single-app stack.

(Because of practical limits of views in some DBs, especially in the 1990s, and developer preferences for procedural code, a common enterprise alternative substitutes stored procs for views, to similar effect.)


hah, exactly. “application-specific views...Going back decades”, or “substitute Stored Procs for views”.

It’s not really much extra overhead to use a schema+views. I ask developers to avoid creating views with “Select *”, always specify the columns desired.

Further decoupling is possibly beneficial, e.g. CQRS is wonderful perspective for designing data models that may be more easily distributed or cacheable (by explicitly choosing to separate query schema from modification schema).


Oh yes views, they are not with no effort and become numerous and specific. But are extremely missed with a document store I'm working on write now ;)


This article is everything that's wrong with SQL.

How about just providing us access to the query plan and allowing users to directly modify it? It's like needing assembly language to optimize a really tight section of code but having that access blocked off.

Why do we have to rely on leakage to a higher level interface in order to manipulate the lower level details? Just give us manual access when we need it.


I think SQL is generally great but I do agree with you. Access to lower level primitives from which one could construct their own query plan would be awesome. Query hints (not available in Postgres) would also probably get you a lot of the way there.


Hey, author here.

The article describes a fun debugging and optimization session I’ve had recently. Happy to answer any questions!


Hey, I think you'll probably get better performance if you swap your subquery around a little from:

  SELECT COUNT(*)
  FROM runs runs_other
  WHERE (SELECT COUNT(*)
        FROM stacks
        WHERE stacks.id = runs_other.stack_id
          AND stacks.account_id = accounts.id
          AND stacks.worker_pool_id = worker_pools.id) > 0
   AND runs_other.worker_id IS NOT NULL
To:

  SELECT COUNT(*)
  FROM runs runs_other
  WHERE EXISTS (SELECT *
        FROM stacks
        WHERE stacks.id = runs_other.stack_id
          AND stacks.account_id = accounts.id
          AND stacks.worker_pool_id = worker_pools.id)
   AND runs_other.worker_id IS NOT NULL
Your previous option needs to hit every row in the stacks subquery to generate the count. The modified one uses EXISTS so it only needs to check there's at least one row, and then can short-circuit and exit that subquery. (a smart enough query planner would be able to do the same thing, but PG generally doesn't).


Thanks for the tip! I didn't know EXISTS can provide such benefits.

However, I've tried it out and it resulted in a 10x performance loss. Looking at the query plan, it actually got PostgreSQL to be too smart and undo most of the optimizations I've done in this blog post... It's still an order of magnitude faster than the original query though!


I stand corrected, and always glad to see the main point about perf optimisation proven. (always measure! :) )


Interesting write up! Did you happen to try this with a CTE instead of a subquery? ie something like:

  WITH
    tmp AS (SELECT accounts_other.id, COUNT(*) n
              FROM accounts accounts_other
              JOIN stacks stacks_other ON accounts_other.id = stacks_other.account_id
              JOIN runs runs_other ON stacks_other.id = runs_other.stack_id
             WHERE (stacks_other.worker_pool_id IS NULL OR
                    stacks_other.worker_pool_id = worker_pools.id)
               AND runs_other.worker_id IS NOT NULL
             GROUP BY 1
            )
  SELECT COUNT(*)                                                                  as "count",
         COALESCE(MAX(EXTRACT(EPOCH FROM age(now(), runs.created_at)))::bigint, 0) AS "max_age"
  FROM runs
    JOIN stacks ON runs.stack_id = stacks.id
    JOIN worker_pools ON worker_pools.id = stacks.worker_pool_id
    JOIN accounts ON stacks.account_id = accounts.id
    LEFT JOIN tmp ON tmp.id = accounts.id
  WHERE worker_pools.is_public = true
    AND runs.type IN (1, 4)
    AND runs.state = 1
    AND runs.worker_id IS NULL
    AND accounts.max_public_parallelism / 2 > COALESCE(tmp.n, 0)

I don't know this for sure, but in my experience it has seemed like Postgres is bad at optimizing subqueries, and will execute once per row.


Haven't run it, but based on the query plan in the first picture, PostgreSQL already does that part kind of right. It precalculates it for all accounts and then does a three-way hash join.


For Postgres 12+ you'll have to use to force the plan

> WITH tmp AS MATERIALIZED


Aren't CTEs still optimization barriers in Postgres, so wouldn't that make any subquert optimization problem worse?


The fun thing is, they have corrected a bunch of that in newer versions of postgresql.

I'm surprised that the problem stemmed from an index where the 'AND NOT NULL' took forevery to scan, and they didn't include a partial index, or an index in a different order since they are sorted.



Depends if materialization helps.

As a silly example, if the CTE materializes 1/100th of a table via table scan, scanning 1/100th of a table N times may be faster than scanning the entire table N times.


That could even be a feature if used to compute just the list of active jobs. Later pgs might need to be explicit to reinsert the optimization barrier, as the implicit one for all Cates was removed.


Mostly no, since PostgreSQL 12


Have you tried unnesting the subquery? From the images posted in your blog it's not clear to me how often the subquery is run. Maybe unnesting it somehow like this should work well:

    SELECT COUNT(*)        as "count",
         COALESCE(MAX(EXTRACT(EPOCH FROM age(now(), runs.created_at)))::bigint, 0) AS "max_age"
    FROM runs
      JOIN stacks ON runs.stack_id = stacks.id
      JOIN worker_pools ON worker_pools.id = stacks.worker_pool_id
      JOIN accounts ON stacks.account_id = accounts.id
      /*unnesting*/
      join (SELECT accounts_other.id,  COUNT(*) as cnt
       FROM accounts accounts_other
         JOIN stacks stacks_other ON accounts_other.id = stacks_other.account_id
         JOIN runs runs_other ON stacks_other.id = runs_other.stack_id
       WHERE accounts_other.id
         AND (stacks_other.worker_pool_id IS NULL OR
          stacks_other.worker_pool_id = worker_pools.id)
         AND runs_other.worker_id IS NOT NULL
        group by accounts_other.id) as acc_cnt 
      on acc_cnt.id = accounts.id and  accounts.max_public_parallelism / 2 > acc_cnt.cnt
     /*end*/
    WHERE worker_pools.is_public = true
      AND runs.type IN (1, 4)
      AND runs.state = 1
      AND runs.worker_id IS NULL
      /* maybe also copy these filter predicates inside the unnested query above */
Edit: format and typo


Hi Jacob, I would like to try rewriting your query using different JOIN FOREIGN [1] syntax alternatives, to study how readability is affected.

It would help if you could share the table definitions and foreign keys for the tables involved in the query. I could probably deduce most of it from the query thanks to proper naming of tables/columns, but just want to be sure I get it right. Many thanks!

    [1]: https://gist.github.com/joelonsql/15b50b65ec343dce94db6249cfea8aaa


what are the downsides?


The downside is that a PostgreSQL version update can change internal optimizer rules, which may result in the plan being changed to a less performant one again.

In that case it's good to have monitoring in place to catch something like that. We're using Datadog Database monitoring which gives you a good breakdown of what the most pricey queries are - it uses information from the pg_stat_statements table, which contains a lot of useful stats about query execution performance.

You should regularly check such statistics to see if there aren't any queries which are unexpectedly consuming a big chunk of your database resources, as that might mean you're missing an index, or a non-optimal query plan has been chosen somewhere.


> The downside is that a PostgreSQL version update can change internal optimizer rules

This has actually happened relatively recently.

PostgreSQL Common Table Expressions (WITH x as (some query)) used to be optimization fences by default, meaning that you can influence the optimizer using a CTE. This is a well known technique among PostgreSQL users that resort to it for lack of optimizer hints.

In Pg12 they enhanced the optimizer to remove the optimization fence by default, so the query plans of existing queries automatically changed and sometimes became much slower as a result. If you want the old CTE behavior you have to modify the WITH clause via AS MATERIALIZED.


Deduced schema based on the naming of tables/columns:

    CREATE TABLE accounts (
        id bigint NOT NULL GENERATED ALWAYS AS IDENTITY,
        max_public_parallelism integer NOT NULL,
        PRIMARY KEY (id)
    );
    
    CREATE TABLE worker_pools (
        id bigint NOT NULL GENERATED ALWAYS AS IDENTITY,
        is_public boolean NOT NULL,
        PRIMARY KEY (id)
    );
    
    CREATE TABLE stacks (
        id bigint NOT NULL GENERATED ALWAYS AS IDENTITY,
        worker_pool_id bigint NOT NULL,
        account_id bigint NOT NULL,
        PRIMARY KEY (id),
        FOREIGN KEY (worker_pool_id) REFERENCES worker_pools,
        FOREIGN KEY (account_id) REFERENCES accounts
    );
    
    CREATE TABLE runs (
        id bigint NOT NULL GENERATED ALWAYS AS IDENTITY,
        stack_id bigint,
        worker_id bigint,
        created_at timestamptz NOT NULL DEFAULT now(),
        type integer NOT NULL,
        state integer NOT NULL,
        PRIMARY KEY (id),
        FOREIGN KEY (stack_id) REFERENCES stacks
    );


cube222, does the schema above look more or less correct to you?

Below are the query plans I got when running the query against empty tables in PostgreSQL 15dev (not Aurora Serverless), for comparison.

Would be nice to populate the tables and add indexes, to make a real comparison. Would help a lot with some rough estimates on the number of rows in each table and count(distinct ...) of each column, etc?

Unoptimized version:

                                                                        QUERY PLAN
    ---------------------------------------------------------------------------------------------------------------------------------------------------
     Aggregate  (cost=82.14..82.16 rows=1 width=16) (actual time=0.009..0.012 rows=1 loops=1)
       ->  Hash Join  (cost=46.05..82.13 rows=1 width=8) (actual time=0.005..0.007 rows=0 loops=1)
             Hash Cond: (stacks.worker_pool_id = worker_pools.id)
             Join Filter: ((accounts.max_public_parallelism / 2) > (SubPlan 1))
             ->  Nested Loop  (cost=0.30..36.38 rows=1 width=28) (actual time=0.005..0.005 rows=0 loops=1)
                   ->  Nested Loop  (cost=0.15..36.18 rows=1 width=24) (actual time=0.004..0.005 rows=0 loops=1)
                         ->  Seq Scan on runs  (cost=0.00..28.00 rows=1 width=16) (actual time=0.004..0.004 rows=0 loops=1)
                               Filter: ((worker_id IS NULL) AND (type = ANY ('{1,4}'::integer[])) AND (state = 1))
                         ->  Index Scan using stacks_pkey on stacks  (cost=0.15..8.17 rows=1 width=24) (never executed)
                               Index Cond: (id = runs.stack_id)
                   ->  Index Scan using accounts_pkey on accounts  (cost=0.15..0.20 rows=1 width=12) (never executed)
                         Index Cond: (id = stacks.account_id)
             ->  Hash  (cost=32.00..32.00 rows=1100 width=8) (never executed)
                   ->  Seq Scan on worker_pools  (cost=0.00..32.00 rows=1100 width=8) (never executed)
                         Filter: is_public
             SubPlan 1
               ->  Aggregate  (cost=66.89..66.90 rows=1 width=8) (never executed)
                     ->  Nested Loop  (cost=33.72..66.89 rows=1 width=0) (never executed)
                           ->  Hash Join  (cost=33.56..58.71 rows=1 width=8) (never executed)
                                 Hash Cond: (runs_other.stack_id = stacks_other.id)
                                 ->  Seq Scan on runs runs_other  (cost=0.00..22.00 rows=1194 width=8) (never executed)
                                       Filter: (worker_id IS NOT NULL)
                                 ->  Hash  (cost=33.55..33.55 rows=1 width=16) (never executed)
                                       ->  Seq Scan on stacks stacks_other  (cost=0.00..33.55 rows=1 width=16) (never executed)
                                             Filter: (((worker_pool_id IS NULL) OR (worker_pool_id = worker_pools.id)) AND (account_id = accounts.id))
                           ->  Index Only Scan using accounts_pkey on accounts accounts_other  (cost=0.15..8.17 rows=1 width=8) (never executed)
                                 Index Cond: (id = accounts.id)
                                 Heap Fetches: 0
Optimized version:

                                                                  QUERY PLAN
    --------------------------------------------------------------------------------------------------------------------------------------
     Aggregate  (cost=82.14..82.16 rows=1 width=16) (actual time=0.010..0.012 rows=1 loops=1)
       ->  Hash Join  (cost=46.05..82.13 rows=1 width=8) (actual time=0.006..0.007 rows=0 loops=1)
             Hash Cond: (stacks.worker_pool_id = worker_pools.id)
             Join Filter: ((accounts.max_public_parallelism / 2) > (SubPlan 2))
             ->  Nested Loop  (cost=0.30..36.38 rows=1 width=28) (actual time=0.005..0.006 rows=0 loops=1)
                   ->  Nested Loop  (cost=0.15..36.18 rows=1 width=24) (actual time=0.005..0.005 rows=0 loops=1)
                         ->  Seq Scan on runs  (cost=0.00..28.00 rows=1 width=16) (actual time=0.004..0.005 rows=0 loops=1)
                               Filter: ((worker_id IS NULL) AND (type = ANY ('{1,4}'::integer[])) AND (state = 1))
                         ->  Index Scan using stacks_pkey on stacks  (cost=0.15..8.17 rows=1 width=24) (never executed)
                               Index Cond: (id = runs.stack_id)
                   ->  Index Scan using accounts_pkey on accounts  (cost=0.15..0.20 rows=1 width=12) (never executed)
                         Index Cond: (id = stacks.account_id)
             ->  Hash  (cost=32.00..32.00 rows=1100 width=8) (never executed)
                   ->  Seq Scan on worker_pools  (cost=0.00..32.00 rows=1100 width=8) (never executed)
                         Filter: is_public
             SubPlan 2
               ->  Aggregate  (cost=9851.00..9851.01 rows=1 width=8) (never executed)
                     ->  Seq Scan on runs runs_other  (cost=0.00..9850.00 rows=398 width=0) (never executed)
                           Filter: ((worker_id IS NOT NULL) AND ((SubPlan 1) > 0))
                           SubPlan 1
                             ->  Aggregate  (cost=8.18..8.19 rows=1 width=8) (never executed)
                                   ->  Index Scan using stacks_pkey on stacks stacks_1  (cost=0.15..8.18 rows=1 width=0) (never executed)
                                         Index Cond: (id = runs_other.stack_id)
                                         Filter: ((account_id = accounts.id) AND (worker_pool_id = worker_pools.id))


Are you happy with Aurora Serverless so far? Is it really the "painless scaling" it's made out to be? I'm really hesitant to use anything other than self-hosted Postgres with full control, because I'm always afraid of running into situations like these and not being able do what I need to do, like have an offline analytical read replica. Also I imagine the cost is an order of magnitude more than a self-hosted instance.


Not really. But it's mostly because the support for other features is basically non-existant. It also seems like all development is happening on Aurora Serverless v2 (which I think is in preview state?). The scaling also isn't seamless, it breaks your connections if you're using prepared statements - we have it pinned to a specific number of capacity units.

We'll probably move to plain Aurora at some point, as that supports all bells and whistles, but for now Aurora Serverless is good enough.


In this case denormalizing the data right into the runs table would likely give you better, but more importantly more predictable performance.


Similar article that instead solved it by recording more statistics: https://build.affinity.co/how-we-used-postgres-extended-stat...


From the article

> Aurora Serverless also doesn’t support read replicas to offload analytical queries to them, so no luck there either.

Amazon Aurora Serverless v2 does support read replicas. Although it’s still in preview.


aurora serverless v2 doesn't support postgres at all...


heh i had a similar style problem, i think it was more 1000x than 200x, but its because certain query structures the optimizer couldn't figure out, had to try like 4 different re-writes of the sql till the optimizer could understand it.

I guess at this point I just see it as just part of my job to do this insane stuff, good to know that others see this as hard :)


There is no need to trick query optimizer, just common sense and understanding of rdbms engine.

The query seems pretty standard and often occuring problem with nested queries. Every JOIN and nested query grows computation due to cartesian combination, so having your predicates as narrow and specific as possible is a must.

Also Nested Loop is a kiss of death for your query performance


Instead of doing this madness with complicated queries, I always advise to split complicated queries into many smaller simple queries and then merge the results in the code. This way you don't need to debug the queries as all of them are very simple SELECTs and the code that merges the results is one or two for loops.


in practice that means mroe data being sent to the application code that then has to process that. 99% of the time, you'll get more performance figuring out the right way to express the data structure you want in sql. The entire computation happens in the database so all the data is immediately available to process on disk and you only send what you need for bandwidth consumption. IF the query is slow, you need to think about how you setup your joins and being aggressive about your use of predicates in subqueries to reduce the size of temporary tables.


Not in every case. Some DBs allow you to reference the results and use the result reference in subsequent queries. As a result, data stays in the DB until you're actually ready to pull it.

Or of course, you can just use temp tables.


thats a nice feature. does postgres have it? in that case, its even MORE performant to push your data processing into the database via a query


That's pretty much equivalent to just using materialized WITH clauses, no? Those act as an optimization barrier and are materialized in order before the query is run.


> materialized WITH clauses

Are you talking about CTEs? And if so, are you suggesting that each "temp table" in the CTE is calculated/materialized a step at a time? If so, I see the query optimizer tear these barriers down and fusing all of the SQL together on a daily basis.


Not if you use `MATERIALIZED`:

https://www.postgresql.org/docs/14/queries-with.html#id-1.5....

> However, if a WITH query is non-recursive and side-effect-free (that is, it is a SELECT containing no volatile functions) then it can be folded into the parent query, allowing joint optimization of the two query levels. By default, this happens if the parent query references the WITH query just once, but not if it references the WITH query more than once. You can override that decision by specifying MATERIALIZED to force separate calculation of the WITH query, or by specifying NOT MATERIALIZED to force it to be merged into the parent query. The latter choice risks duplicate computation of the WITH query, but it can still give a net savings if each usage of the WITH query needs only a small part of the WITH query's full output.

Before PG12 this used to be the only way CTEs worked. Now most CTEs don't need to be materialized, but you can force it and reestablish the optimization barrier.


I'm not too sure about Postgres honestly. I was specifically talking about Snowflake, although there might be others that do it too.

https://docs.snowflake.com/en/sql-reference/functions/result...

Sometimes you can follow this pattern and never actually pull data into your application, which is especially true when you're doing ETLs (table --> table).


That's more custom logic and additional code to maintain though.

In this case we're using the Datadog Agent with its custom_queries directive to automatically create metrics based on periodically evaluated SQL queries.


This. Plus, you can use temporary tables, which limits the amount of application code you have to write. Put your intermediate results in temp tables, and build indexes on them if you have to. It's a cheap way of forcing the optimizer to do things your way.


That can sometimes be a solution... Just keep in mind, you are manually writing a join algorithm every time you need to do that, and sometimes it will be the right one for the task, other times not so much. The query planner will adapt your join type for you instead, if you can trust it.

As the shape of your data changes over time, your hard coded algorithm choice is not guaranteed to be the best one.


Sounds like just using CTEs. This is how I write some complicated queries, as a few CTEs with specific logic to get an overall query written in a more readable way.


That won't always work. Main reason for bigger queries is to get exact data you need, so you don't have to send unnecessary data over the wire which is slow.


Whenever I see "select count(star_sign)" I shake my head. The first optimization you learn as a newby dabbling in SQL is to always use "select count(some_fucking_id)" and not "star_sign". Who wrote this query should be fired, because it has 2 of this problem, one in a subquery!


That's harsh, could you explain why, please?

As far as I understand (and I'm the author) `COUNT(*)` means "count the number of rows, regardless of the values". `COUNT(id)` means "count all non-NULL id's" which ends up being the same for non-nullable columns, but requires additional processing for nullable ones. Thus the first option should be able to optimize out more processing time in some situations and at worst be the same.

This is confirmed by some quick googling, it's also confirmed by me running the query with a COUNT(*) and COUNT(id) - both run in the same amount of time, and it's also how it works in a SQL engine[0] I've implemented - COUNT(*) is able to optimize out the most underlying data loading.

[0]:https://github.com/cube2222/octosql


> [0]:https://github.com/cube2222/octosql

nice tool, thanks for making this available.


if i had to guess COUNT(*) may load everything per row datawise vs using a index. but I'd have to run tests to confirm =)


I do think the parent thinks that. But it's simply not true based on my research. COUNT(*) doesn't need any actual data from the rows, so I'm wondering if there's something else.

First result in Google confirms that at least for Postgres: https://www.citusdata.com/blog/2016/10/12/count-performance/

Another one from stackoverflow: https://stackoverflow.com/questions/2710621/count-vs-count1-... with a citation:

> Don't be fooled by those advising that when using * in COUNT, it fetches entire row from your table, saying that * is slow. The * on SELECT COUNT(*) and SELECT * has no bearing to each other, they are entirely different thing, they just share a common token, i.e. *.


yeah I didn't really believe it, just didn't feel like doing the effort of researching it because in reality its never come up in practice for me. which hinted at it not being a real problem anymore.


Yeah, if you have the ID's as null. But why you would do such thing in the first place? Have the ID be "serial primary key" and you'll never have null values in there ever!

Btw, that's the second optimization you learn. So you telling me that in your case, you have null values in your ID's? You should be fired trice then.

Since you're hell bent on doing tests, do one with a dozen millions rows, just ID and a varchar as columns, and in one case have the ID have nulls, and in the other case be like I said. Then see the difference when you run "select count(ID) where the_varchar_column = 'Steve'" vs "select(star_sign) bla bla" and come back to me. Image would be appreciated. Then go rewrite your DB structure and the query when you realize the performance


I'm not sure what value your harsh tone adds to your comment. Sharing experiences like this publicly benefits everyone. It obviously exposes blindspots in one's knowledge and it is expected that noone knows everything. That is the point of the. Comments functionality in HN. So that we can all share things that we know and absorb things we don't.

I personally enjoyed this article and learnt a few things from it. I learnt a few things from your comment too but it would have been better without all the harsh words.


I've never said I have null values, I've just used it to explain what performance optimization opportunities COUNT(*) has vs COUNT(id).

It would be great if you'd explain why what I've written is not the case or provided some citation for that.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: