How a single PostgreSQL config change improved slow query performance by 50x

quizotic · on Nov 23, 2017

The prescription of changing sequential_page_cost to equal random_page_cost is certainly reasonable for SSD, but I wonder if the underlying issues aren't somewhat deeper and more interesting. One difference between a sequential scan and an index scan is the amount of data being scanned. PostgreSQL stores information horizontally as rows and a sequential scan will have to read in all column values of all rows. An index scan will read through all values of a _single_ column. The 50x performance difference _might_ be just that the whole row is 50x wider than the width of the indexed join column.

An interesting second factor relates to the nature of the SSD storage. With SSDs a read request will pull back a 4K page, even if the read request was smaller. So it's not quite right to say that a sequential read and a random read cost the same on SSD, particularly if the same 4K page must be read multiple times. I suspect that the particular index technique used by PostgreSQL tends to organize data such that successive indexed values reside in the same 4K SSD page. IOW, it's not so much that the cost of random SSD access is the same as sequential SSD access (though that's true), as it is that the PostgreSQL index mechanism doesn't require multiple reads of the same 4K page.

if a Hash-based index was used instead of a Btree-based index, and if the table width was narrower, the sequential scan might have outperformed the index scan.

yipenghuang · on Nov 23, 2017

A sequential scan could also be faster if the join selectivity is poor. As an extreme example say if every id in event_type appeared in prop_keys for a given app. So scanning the index repeatedly would be a waste of time since you would have to scan the table anyway.

brianwawok · on Nov 23, 2017

Which is the point of stats, right? To know when it's better to use an index vs just read the whole table because you need 50%+ of it anyway.

cesarb · on Nov 23, 2017

PostgreSQL uses 8K pages, so it won't read less than that. IIRC, it also has its own cache of recently read pages, so it won't have to read the same page over and over.

pgaddict · on Nov 23, 2017

You can rebuild it with smaller pages, including 4kB, which may be beneficial for various reasons. The packages however stick to 8kB.

And yes, the database has it's own cache (aka shared buffers), on top of page cache (filesystem cache).

idontgetproton · on Nov 23, 2017

You're saying that during sequential scans, the time it takes per row is O(n) where n is the number of columns? I find that hard to believe. Can anyone confirm / deny this?

pg314 · on Nov 23, 2017

The number of columns in PostgreSQL is limited to 250-1600 [1] since a tuple (a row) can't span more than one page of memory. Since O-notation talks about asymptotic behavior, it doesn't really apply here.

But yes, tables with more columns normally take more time to scan sequentially. The complete tuple is always loaded (excluding the data of TOAST [2] attributes), there is no way to only load one column. This is one of the reasons that column-oriented databases can be faster than row-oriented databases [3].

[1] https://www.postgresql.org/message-id/42C3C382.5020108@cinec... [2] https://www.postgresql.org/docs/9.5/static/storage-toast.htm... [3] https://en.wikipedia.org/wiki/Column-oriented_DBMS

yipenghuang · on Nov 23, 2017

The time complexity for sequential scan takes O(r), because every row must be read once. Of course rows are made up of columns so you can also specify it as O(rc) where c is the average column length.

ahachete · on Nov 23, 2017

For those interested in postgresql.conf tuning, I did a recent presentation at two conferences recently, hope it's interesting. Slides: https://speakerdeck.com/ongres/postgresql-configuration-for-...

[shamless plug]

aargh_aargh · on Nov 23, 2017

There's a link to postgresqlco.nf in the description. Do you run the service? It doesn't accept a file being dropped. In fact, it doesn't seem to do anything (I looked at the HTML).

ahachete · on Nov 23, 2017

It's a WIP. It's coming, stay tuned... (yo may subscribe to get noticed).

lccarrasco · on Nov 23, 2017

They look great, thanks for sharing! Do you know if the talks were recorded?

ahachete · on Nov 23, 2017

pgconf.eu it wasn't recorded. HighLoad++ it was, but I think not published yet.

danjoc · on Nov 22, 2017

Thanks for the tip. This probably affects more people than there are people who realize it. It would be really interesting if PG would use machine learning to discover this sort of tuning on its own.

dmichulke · on Nov 23, 2017

Your suggestion sounds very much like that of a Silicon Valley start-up guy ("Machine Learning to the rescue!") but I believe here a deterministic detection of the underlying drive, possibly tied to a warning on start-up, would be easier to code, less bug-prone, and faster.

OTOH, if you speak about the general problem of optimizing configuration, then I distinctly recall having read something about automatic server configuration, IMO on AWS, but quite possibly only as a feature request.

viraptor · on Nov 23, 2017

I suspect it was the paper about Ottertune: http://ottertune.cs.cmu.edu (that was innodb, not pg though)

matt4077 · on Nov 23, 2017

There was an article and Software posted here about automatic tuning of pg with machine learning. Sorry for not searching—I’m on the road.

IIRC it was an academic paper and the process was somewhat byzantine when I tried to recreate it. But the results looked good.

davidcuddeback · on Nov 23, 2017

PostgreSQL does have built-in support for a genetic algorithm based query optimizer. I haven't tried it yet, so can't comment on how well it works. Docs are here: https://www.postgresql.org/docs/9.6/static/geqo.html.

pgaddict · on Nov 23, 2017

The idea is that up until some number of relations (8 by default, IIRC) the join tree is searched exhaustively, then it switches to the genetic algorithm. So it's kinda automatic.

quizotic · on Nov 23, 2017

+1 for the idea of tuning query optimization based on ML. I don't know of any DBMSs that take this advice, and that's remarkable in this day and age.

rev112 · on Nov 23, 2017

This looks quite relevant: https://aws.amazon.com/blogs/ai/tuning-your-dbms-automatical... ("Tuning Your DBMS Automatically with Machine Learning")

The authors propose a way to do some automatic tuning for MySQL and Postgres.

Link to the paper itself: http://db.cs.cmu.edu/papers/2017/tuning-sigmod2017.pdf

power · on Nov 23, 2017

Postgres has a query optimizer that uses Genetic Algorithms: https://www.postgresql.org/docs/current/static/geqo.html

Clarification: this is for planning how to execute a query, not for tuning the db settings

agnivade · on Nov 23, 2017

Wow, did not know that. TIL!

jve · on Nov 23, 2017

SQL Server 2017 has introduced 2 Automatic database tuning features: Automatic plan correction and Automatic index management (Latter for Azure SQL Only)

https://docs.microsoft.com/en-us/sql/relational-databases/au...

vthriller · on Nov 23, 2017

Oracle already promised to bring something similar back in september: https://www.oracle.com/database/autonomous-database/index.ht...

olavgg · on Nov 23, 2017

I always increase the seq scan cost, but another thing that helped me more was updating the table statistics. In that way you can make the planner better aware of your indexes.

In my case I increased STATISTICS to 5000 and the planner immediately start using the index instead of full table scan.

https://blog.pgaddict.com/posts/common-issues-with-planner-s...

pgaddict · on Nov 23, 2017

Please don't mess with the seq_page_cost. It's considered a reference value, so leave it set to 1.0 and tweak the other values.

mnw21cam · on Nov 23, 2017

Agreed. The query planner always has to have up-to-date and sufficiently detailed statistics in order to decide the best plan.

pgaddict · on Nov 23, 2017

Unfortunately the author does not say some pretty basic things - which PostgreSQL version, how much data, how much of it fits into RAM, what storage (and hardware in general) ...

If I understand it correctly, PostgreSQL was using the default configuration. Which is rather inefficient, and is more about "must start everywhere".

Decreasing random_page_cost makes sense if you have storage that can handle random I/O well (although I wouldn't go to 1 even if it's an SSD). But who knows if the data was read from storage at all? Maybe it'd fit into RAM (and just increasing effective_cache_size would be enough for the planner to realize that).

rpedela · on Nov 23, 2017

Setting random_page_cost = 1 is pretty common advice for SSD and works well in my experience. Typical advice for HDD RAID or SCSI is random_page_cost = 2 and SSDs are faster than them.

pgaddict · on Nov 23, 2017

I don't know who recommends random_page_cost=1, but IMNSHO it's a bit silly. Even SSDs handle sequential I/O better than random I/O. Values between 1.5 and 2.0 are more appropriate. I wouldn't really recommend 1.0 except when you know the data fits into RAM. There are other options that affect costs of random I/O, e.g. effective_cache_size.

branko_d · on Nov 23, 2017

Hmm... The author is filling a combo-box (probably just one column), yet the query is selecting all columns (SELECT *).

I would have tried selecting just the needed column (let's call it "foo"), with following indexes:

event_types (app, id)

prop_keys (event_id, foo)

This should cover the entire query with indexes (i.e. allow for index-only scan).

stubish · on Nov 23, 2017

You also get good results tweaking this particular knob if you have large amounts of RAM. If your blocks are almost always in OS cache, you are almost never going to make random seeks even if the PostgreSQL planner thinks you are.

pja · on Nov 23, 2017

That feels like something the database could do for itself - checking whether a range of mmap()ed blocks are actually in memory or not is a single syscall.

I guess for large indexes the overhead of walking the page tables is going to be large though, so it’s not necessarily going to be a net win.

pgaddict · on Nov 23, 2017

But you don't know which blocks you'll need at planning time, so you can't really check that.

You could of course check if the total database size is within RAM, but it's much more common to have database much larger than RAM (say 1TB on a machine with 128GB of RAM), but the actual working set (recent data processed by queries) is much smaller.

pja · on Nov 24, 2017

But you don't know which blocks you'll need at planning time, so you can't really check that.

Really? Postgres doesn't know which blocks are part of which tables / indexes? That seems ... suboptimal if so.

jeltz · on Nov 29, 2017

I guess one could implement a job which samples the pages of tables and indexes to check what percentage is typically in RAM and use that for an estimate, but to get which actual pages will be hit by a query you need to start executing it (you need to do an index lookup to see which index pages and table pages which will be accessed) which would defeat the purpose of query planning.

pja · on Nov 29, 2017

Granted, but (AFAICT) in this case the database was doing a full table scan because it didn’t think the index was in memory. Checking to see whether the index itself is already loaded seems like something the query planner in principle ought to be able to do efficiently. (Obviously the existing codebase might make it difficult.)

anarazel · on Nov 24, 2017

You can know it, but you've to actually evaluate the index lookups first, so ...?

topbanana · on Nov 23, 2017

This seems like something that should be measured at startup?

jeltz · on Nov 29, 2017

Sadly that is not (at least easily) possible because random_page_cost vs. sequential_page_cost is not just about the IO system but about all factors which can affect the cost of reading pages randomly vs reading pages sequentially. E.g. how often PostgreSQL's tables are in the file cache. So how much RAM your machine has available for PostgreSQL and your access patterns matter too.

Also I imagine the some expensive SAN solutions would be pretty tricky to measure given how smart they try to be with caching and moving between different kinds of disks.

emilfihlman · on Nov 23, 2017

Why aren't these tuned automatically? Should be pretty easy.

pgaddict · on Nov 23, 2017

It sounds simple until you actually try doing that. The thing is, reducing the costing to these two parameters is a significantly simplified model of what happens in practice. So you can't just run some I/O benchmark to measure random vs. sequential requests. For example the defaults that worked fine for a long time (seq_page_cost=1 and random_page_cost=4) certainly do not reflect the difference between random and sequential I/O on rotational devices (where the device can easily do 100MB/s in sequential access, but less than 1MB/s in random).

ric2b · on Nov 23, 2017

> For example the defaults that worked fine for a long time (seq_page_cost=1 and random_page_cost=4) certainly do not reflect the difference between random and sequential I/O on rotational devices (where the device can easily do 100MB/s in sequential access, but less than 1MB/s in random)

The postresql documentation explains why. They assume HDD random access is 40x slower than seq access but that you'll have a 90% cache hit rate, so random_page_cost=4 reflects 10% of 40x slower.

pgaddict · on Nov 23, 2017

Not entirely. The documentation says you can interpret it that way, not that it's how the numbers were determined.

AFAIK it's much more "We're using those numbers as defaults because they seem to be working well," rather than "We did extensive benchmarking and these are the right values!"

You can measure how much slower random I/O is fairly easily. But the question is how to derive PostgreSQL cost parameters from that. Should you use the same assumption about 90% cache hit ratio (why?) or should you use some different value?