The prescription of changing sequential_page_cost to
equal random_page_cost is certainly reasonable for SSD, but I wonder if the underlying issues aren't somewhat deeper and more interesting. One difference between a sequential scan and an index scan is the amount of data being scanned. PostgreSQL stores information horizontally as rows and a sequential scan will have to read in all column values of all rows. An index scan will read through all values of a _single_ column. The 50x performance difference _might_ be just that the whole row is 50x wider than the width of the indexed join column.
An interesting second factor relates to the nature of the SSD storage. With SSDs a read request will pull back a 4K page, even if the read request was smaller. So it's not quite right to say that a sequential read and a random read cost the same on SSD, particularly if the same 4K page must be read multiple times. I suspect that the particular index technique used by PostgreSQL tends to organize data such that successive indexed values reside in the same 4K SSD page. IOW, it's not so much that the cost of random SSD access is the same as sequential SSD access (though that's true), as it is that the PostgreSQL index mechanism doesn't require multiple reads of the same 4K page.
if a Hash-based index was used instead of a Btree-based index, and if the table width was narrower, the sequential scan might have outperformed the index scan.
A sequential scan could also be faster if the join selectivity is poor. As an extreme example say if every id in event_type appeared in prop_keys for a given app. So scanning the index repeatedly would be a waste of time since you would have to scan the table anyway.
PostgreSQL uses 8K pages, so it won't read less than that. IIRC, it also has its own cache of recently read pages, so it won't have to read the same page over and over.
You're saying that during sequential scans, the time it takes per row is O(n) where n is the number of columns? I find that hard to believe. Can anyone confirm / deny this?
The number of columns in PostgreSQL is limited to 250-1600 [1] since a tuple (a row) can't span more than one page of memory. Since O-notation talks about asymptotic behavior, it doesn't really apply here.
But yes, tables with more columns normally take more time to scan sequentially. The complete tuple is always loaded (excluding the data of TOAST [2] attributes), there is no way to only load one column. This is one of the reasons that column-oriented databases can be faster than row-oriented databases [3].
The time complexity for sequential scan takes O(r), because every row must be read once. Of course rows are made up of columns so you can also specify it as O(rc) where c is the average column length.
There's a link to postgresqlco.nf in the description. Do you run the service? It doesn't accept a file being dropped. In fact, it doesn't seem to do anything (I looked at the HTML).
Thanks for the tip. This probably affects more people than there are people who realize it. It would be really interesting if PG would use machine learning to discover this sort of tuning on its own.
Your suggestion sounds very much like that of a Silicon Valley start-up guy ("Machine Learning to the rescue!") but I believe here a deterministic detection of the underlying drive, possibly tied to a warning on start-up, would be easier to code, less bug-prone, and faster.
OTOH, if you speak about the general problem of optimizing configuration, then I distinctly recall having read something about automatic server configuration, IMO on AWS, but quite possibly only as a feature request.
PostgreSQL does have built-in support for a genetic algorithm based query optimizer. I haven't tried it yet, so can't comment on how well it works. Docs are here: https://www.postgresql.org/docs/9.6/static/geqo.html.
The idea is that up until some number of relations (8 by default, IIRC) the join tree is searched exhaustively, then it switches to the genetic algorithm. So it's kinda automatic.
SQL Server 2017 has introduced 2 Automatic database tuning features: Automatic plan correction and Automatic index management (Latter for Azure SQL Only)
I always increase the seq scan cost, but another thing that helped me more was updating the table statistics. In that way you can make the planner better aware of your indexes.
In my case I increased STATISTICS to 5000 and the planner immediately start using the index instead of full table scan.
Unfortunately the author does not say some pretty basic things - which PostgreSQL version, how much data, how much of it fits into RAM, what storage (and hardware in general) ...
If I understand it correctly, PostgreSQL was using the default configuration. Which is rather inefficient, and is more about "must start everywhere".
Decreasing random_page_cost makes sense if you have storage that can handle random I/O well (although I wouldn't go to 1 even if it's an SSD). But who knows if the data was read from storage at all? Maybe it'd fit into RAM (and just increasing effective_cache_size would be enough for the planner to realize that).
Setting random_page_cost = 1 is pretty common advice for SSD and works well in my experience. Typical advice for HDD RAID or SCSI is random_page_cost = 2 and SSDs are faster than them.
I don't know who recommends random_page_cost=1, but IMNSHO it's a bit silly. Even SSDs handle sequential I/O better than random I/O. Values between 1.5 and 2.0 are more appropriate. I wouldn't really recommend 1.0 except when you know the data fits into RAM. There are other options that affect costs of random I/O, e.g. effective_cache_size.
You also get good results tweaking this particular knob if you have large amounts of RAM. If your blocks are almost always in OS cache, you are almost never going to make random seeks even if the PostgreSQL planner thinks you are.
That feels like something the database could do for itself - checking whether a range of mmap()ed blocks are actually in memory or not is a single syscall.
I guess for large indexes the overhead of walking the page tables is going to be large though, so it’s not necessarily going to be a net win.
But you don't know which blocks you'll need at planning time, so you can't really check that.
You could of course check if the total database size is within RAM, but it's much more common to have database much larger than RAM (say 1TB on a machine with 128GB of RAM), but the actual working set (recent data processed by queries) is much smaller.
I guess one could implement a job which samples the pages of tables and indexes to check what percentage is typically in RAM and use that for an estimate, but to get which actual pages will be hit by a query you need to start executing it (you need to do an index lookup to see which index pages and table pages which will be accessed) which would defeat the purpose of query planning.
Granted, but (AFAICT) in this case the database was doing a full table scan because it didn’t think the index was in memory. Checking to see whether the index itself is already loaded seems like something the query planner in principle ought to be able to do efficiently. (Obviously the existing codebase might make it difficult.)
Sadly that is not (at least easily) possible because random_page_cost vs. sequential_page_cost is not just about the IO system but about all factors which can affect the cost of reading pages randomly vs reading pages sequentially. E.g. how often PostgreSQL's tables are in the file cache. So how much RAM your machine has available for PostgreSQL and your access patterns matter too.
Also I imagine the some expensive SAN solutions would be pretty tricky to measure given how smart they try to be with caching and moving between different kinds of disks.
It sounds simple until you actually try doing that. The thing is, reducing the costing to these two parameters is a significantly simplified model of what happens in practice. So you can't just run some I/O benchmark to measure random vs. sequential requests. For example the defaults that worked fine for a long time (seq_page_cost=1 and random_page_cost=4) certainly do not reflect the difference between random and sequential I/O on rotational devices (where the device can easily do 100MB/s in sequential access, but less than 1MB/s in random).
> For example the defaults that worked fine for a long time (seq_page_cost=1 and random_page_cost=4) certainly do not reflect the difference between random and sequential I/O on rotational devices (where the device can easily do 100MB/s in sequential access, but less than 1MB/s in random)
The postresql documentation explains why. They assume HDD random access is 40x slower than seq access but that you'll have a 90% cache hit rate, so random_page_cost=4 reflects 10% of 40x slower.
Not entirely. The documentation says you can interpret it that way, not that it's how the numbers were determined.
AFAIK it's much more "We're using those numbers as defaults because they seem to be working well," rather than "We did extensive benchmarking and these are the right values!"
You can measure how much slower random I/O is fairly easily. But the question is how to derive PostgreSQL cost parameters from that. Should you use the same assumption about 90% cache hit ratio (why?) or should you use some different value?
An interesting second factor relates to the nature of the SSD storage. With SSDs a read request will pull back a 4K page, even if the read request was smaller. So it's not quite right to say that a sequential read and a random read cost the same on SSD, particularly if the same 4K page must be read multiple times. I suspect that the particular index technique used by PostgreSQL tends to organize data such that successive indexed values reside in the same 4K SSD page. IOW, it's not so much that the cost of random SSD access is the same as sequential SSD access (though that's true), as it is that the PostgreSQL index mechanism doesn't require multiple reads of the same 4K page.
if a Hash-based index was used instead of a Btree-based index, and if the table width was narrower, the sequential scan might have outperformed the index scan.