My favorite tool for this kind of analysis was pgAdmin3, which had a very nice diagramming output for EXPLAIN ANALYZE which you can see here [0]. Hovering over the various images in the diagram would display their time, rows, and which statement caused them.
There is also this excellent post which shows how to break down an EXPLAIN and reason about the performance [1].
Thanks for the tip. Looks interesting, but still on the "analyze and do something" side vs monitoring then making suggestions.
In other words, this could be another input for the kind of tool I'm thinking of.
Kind of surprised this doesn't exist, as it seems doable. Given a combination of query optimizers, slow query monitoring, and common tuning techniques, even a rudimentary, heuristics-based recommendation tool could provide significantly more value then manually analyzing explain and log file output.
Looks like there is a service similar to what you asked for, but it only applies to MySQL, PerconaDB, and MariaDB right now [0]. Maybe as PostgreSQL grows in popularity more services will target it. Or go build your own, it sounds like a viable business.
Agreed. Also I like that it's so quick (at least for a regular EXPLAIN) - I would very often hit F7 to see the query plan visualization, even for just a split second, before pressing F5 to run the query. If I see the table-scan icon appear somewhere when I'm not expecting it to, it's a big hint I've done something dumb.
pgAdmin4 has become 100x better since the first preview was released - unfortunately it's still a web client and struggles with "large" data sets (I wouldn't call a million rows large, pgAdmin3 would return it in a snap, you could copy it to your clipboard, etc., pgAdmin4 wheezes under the weight of its webshit infrastructure)
It may be worth noting that while pgAdmin3 is officially deprecated, the binaries can still be downloaded and used. Latest version is 1.22.2 from November of 2016 [0].
I use Depesz's Explain tool regularly for discussing queries with distributed teams in Slack. It's an indispensable tool.
By the way, if you think this is cool, Hubert (creator of depesz) is a DBA at Instructure, making software for schools (Canvas) and employee training (Bridge). We open-source a ton of stuff (code.instructure.com).
We're hiring leads + senior engineers in Chicago, Salt Lake City, Seattle, and Budapest! If you're interested, feel free to reach out to neil+hn@instructure.com or check out what's available at https://jobs.lever.co/instructure?lever-via=NiHimSaI8r&team=...
I am a bit frustrated with one question regarding explain cost. Namely how to map the cost to wall time. Based on my research so far, andwer is about "you can't". But that is difficult for me to understand. I mean, I do understand that it does not map with high (or even relatively low) accuracy. But the cases where I have been needing the information are more like me wondering after 15 minutes of runtime whether the query is just heavy and takes more time than expected (up to even hours) or did I mess up something and the query in its current format is going to run approximately until the heat death of the universe and I better rewrite it. It just feels weird if I can't get even that accuracy out of the cost number.
So here is a little bit of explanation (which it sounds like you may already know). The general reason is that the cost reflects a lot of factors, and, if number of rows + available resources (CPU, memory, IO) + locks are constant, for the same logical result set, the cost does map to wall time. For example, given those constants I mentioned, if the execution time for a cost of 100 is 10 minutes, then you tweak the query (getting the same logical results) and the new cost is 50, the new execution time would be ~5 minutes (I think).
But most of those things I mentioned aren't constant. Available resources/system load changes, number of rows in tables change, and # of locks and length of holding on rows/tables change. So what you're left with is a cost that reflects the relative performance of a query compared to the same result set obtained by a different query.
For reference, in Oracle, this SO link [1] explains how cost is calculated.
Note: this is all my understanding, and someone more knowledgeable might know better.
The costs are there to choose a plan. Relative ordering (individually and in aggregate) is all that's really needed for this, rather than some function of time.
Actual run time is heavily dependent on distribution of data. There may be characteristics of your data not modelled in statistics used to calculate costs.
If I were running a query that might take 15 minutes, I'd break it down, and put limit clauses in pinch points to figure out which bits of the query are sensitive (e.g. how much run time changes with increments to limits).
But often just looking at the plan will tell me if the database is doing joins in the order I think would be more or less efficient, given the extra context I have on the data distribution.
> Namely how to map the cost to wall time. Based on my research so far, andwer is about "you can't"
That is pretty much it. For other database engines' query planners too.
Everything that goes into the cost estimate is at best an educated guess, so the result is never going to map neatly to wall-clock time or any other measure.
Remember: the query planner's job is not to find the best plan possible for a given statement. This would be an impossible task. The planner's job is to try find a plan that is good enough and do so fast as possible.
To cut a short post long:
The cost is an educated guess, based on index statistics where available. Assuming it is similar to the cost calculations SQL Server uses is a mix of expected CPU use, I/O bandwidth use, memory required for the operation to touch disk as little possible, how large a load of locks the process might need to take (depending on the current isolation level settings), and so forth. Because you might have faster/slower CPU cores, more/less of them, faster/slower RAM, more/less cache per core, a lower/higher latency network, a set of drives that are faster/slower for random/bulk IO, etc., than the reference systems this mix was based upon, the cost is not going to exactly map to anything on your machine even if it does on some reference machine(s) somewhere in the development/testing chain.
Those reference machines might not even really exist. In SQL Server's case they are probably machines that sat under some developer's desks in the late 90s, so the balance might be significantly off for most modern use cases. This is part of why MS are playing around with cardinality estimators and other related bits in recent versions: the guesses they used to make don't make as much sense on modern systems, so they are having to be updated to make them more relevant. In some cases the estimates (those applied to table variables for instance) are still a stab in the dark, but they are now a stab that is more likely to be close to correct on modern kit than the stabs it used to take.
This is why index hints are needed. Because of the many variables involved, all the query planner can hope to do to assess each plan against the other possibilities, without a more exhaustive analysis which for many queries might take longer than running the query on the worst plan ever would, is a best guess. If you asking it to do something complex, sometimes you have to guide it towards the better plan.
Postgres' cost model uses a bunch of constants that evaluate the relative cost of queries with regard to CPU and I/O and random vs. sequential reads. You can actually tune the cost model yourself by changing the values of the constants. Cost does not really map to wall-time in any way besides the default constants being chosen relative to the cost of a sequential read as a basis (and Postgres admits these relative values are not scientific or necessarily accurate).
> Cost does not really map to wall-time in any way
This is a claim that is extremely difficult for me to believe. Because this statement says that you really can't say that a query in one system with cost of 10 is any more likely to be run faster than a query with a cost of 10^300 (these are made up numbers) in another environment. If there were way to get a mapping with even accuracy within two orders of magnitude (e.g. a query with cost x is likely to run something between 10 and 1000 minutes) on my current system, I would be perfectly happy. Even three orders of magnitude accuracy would likely be helpful.
You seem to be ignoring the second half of that sentence:
> ...besides the default constants being chosen relative to the cost of a sequential read as a basis (and Postgres admits these relative values are not scientific or necessarily accurate)
The costs will always be relative to each other as defined by your constants. The whole point of the cost model is to compare plans so that the fastest one can be chosen. If you hyper-tune your constants according to your system (which I feel is probably pretty difficult to do accurately), you could maybe get something within the orders of magnitude you want.
Which reminds me to wonder - does anybody know why the cost model uses constants from a config file? Compared to all the amazing things that Postgres does, it seems like "measure the value of a sequential read and a random read at runtime, then use those values" should be pretty easy...
There are newer/commercial databases that use that sort of microbenchmarking approach. My guess is that no one has implemented it for Postgres, or there's a fork that does it, or it's somehow pluggable. I'm not sure if there's been any discussion of it in the developer communications, but its worth a look.
At least in my experience, things being slow is almost always due to a table scan rather than using an index.
Just the estimates (ie not actually running the query) should give clues as to which table it scans, and the predicates gives hints as to which index would have helped.
Not always easy fix though, sometimes what you're asking for just isn't possible to index in a nice way. Just the other day me and a colleague was trying to optimize a query, and we couldn't get it to use the index simply because the selectivity was too low. The most selective part of the where clause was a non-trivial and highly dynamic "exists" sub-query... We ended up using a few materialized views with manual refresh but still the query took 2-3 seconds (down from 10-15).
In the "possibly unknown Postgres tools" category, I'm also a huge fan of PG Hero [0]. Gives you a nice sortable UI to look at running queries, see what the longest ones are, the most frequently called, etc. This is all just reading from pg_stats and such but the UI is very nice.
I've used this in the past. What I find that it's missing is the _why_, not the _what_. It's not that my query is performing a full table scan that I care about. It's why it's not using one of the six indexes that I want to know.
I would pay good money for an extension that could answer how the query planner makes decisions about a given query.
Someone made a flame graph analyzer for Oracle queries, which seems like an even better way of visualizing explains, and seems like it would be fairly straightforward to translate to postgres: https://externaltable.blogspot.com/2014/05/flame-graphs-for-...
Indeed, a great resouce! I really liked how he explained and benchmarked an implementation for foreign keys to partitioned data (sadly not available natively yet) in this multi-part series from last year: https://www.depesz.com/2018/10/02/foreign-key-to-partitioned...
This is one of my go-to sites for optimizing Postgres queries. While all the information is available in the text output there's something nice about the site highlighting the rows scanned for the bad parts of the queries in a dark red ;)
My only nits - it's kind of annoying to have to click a checkbox to make sure my analysis is not public and to keep the window open so I can delete it after I'm done.
This is indeed an awesome tool. We've used it quite a bit for local profiling of known slow queries.
FYI because I wish someone had pointed this out to me when I stumbled across tools like this: For profiling really busy production DBs, I highly recommend https://pganalyze.com/
Not affiliated - just an ecstatic customer.
About the only thing I miss from the old Ingres database on OpenVMS was the query plan view. This page (with one of the most horrible backgrounds for viewing) http://ariel.its.unimelb.edu.au/~yuan/Ingres/us_38697.html has some examples of what I remember. Seeing the letters FSM still fills me with dread.
Yep... the one very good thing that I got out of pgAdmin was that it finally convinced me to get psql under my fingers so I wouldn't have to use pgAdmin.