You know, PostGres is a great database system, but its core data structure is prone to awful performance degradation if you have a lot of out of date rows, and sometimes the autovacuumer can't keep up.
There is something to be said for simpler B-Tree style databases like MySQL that are harder to screw up!
Also perhaps there could be an optional mode for Postgres that did something to the effect of immediate vacuuming for all modification statements. Collecting garbage as soon as its made. At the cost of slower statements, but no long term degradation. More predictable performance.
I'm not sure why this comment is downvoted. The behavior of PostgreSQL on workloads that require vacuuming of large tables -- and especially the default behavior in that case -- is hugely problematic. There's a major project ongoing to build a new storage engine to address this:
http://rhaas.blogspot.com/2018/01/do-or-undo-there-is-no-vac...
Like you, I've long wished I could pay a performance penalty on queries to collect garbage immediately rather than batching up this debt to pay all at once some time later.
[edited: I'd pay that cost on all queries, not just writes, if the performance were more consistent over time.]
It's not that simple... The whole point of keeping dead rows around is so the mvcc snapshot can show a consistent view of the database for any user within a transaction. If you instantly cleaned up those dead rows, other users suddenly don't have a consistent view of data, or have data disappear entirely on them.
The other way this is handled is to move dead rows to an undo log, and the min heap only contains the current version of any specific row, and a pointer to where to find the dead rows in the undo log.
The above approach keeps your main heap from bloating when updates and deletes happen, but it causes transactions which reference the old version of the rows to be slower, as well as slowing down rollbacks.
This is exactly what the zHeap project is doing along with the pluggable storage API.
> The above approach keeps your main heap from bloating when updates and deletes happen, but it causes transactions which reference the old version of the rows to be slower, as well as slowing down rollbacks.
I think that's consistent the parent's goal (and mine) of allowing ongoing requests to be slower in exchange for not having to pay a major cost later to vacuum.
Yup, I meant for it to be consistent with the OPs concerns. I just wanted to point out that "just paying for it up front" also has other trade-offs. For a lot of use cases, the way PG went ends up better for performance than using an undo log (after all the performance opimizations PG has put in place like freezing pages, etc). And for some use cases an undo log is by far better, and more consistent.
I can see that. I've really only worked with a few workloads where this behavior was quite bad. (In particular, the out-of-the-box policy to vacuum when the dead tuple fraction reaches 20% means that for an INSERT-mostly workload with a bit of UPDATE/DELETE, the vacuums get further and further apart and take longer and longer.) In the worst one I've worked on, the effects take several days or weeks to show up, but they degrade system throughput by 50-80% (!) for days or weeks on end.
The experience makes me wonder if the database is designed primarily using short-term, modest-scale workloads (which are much easier to test) and not behavior that only shows up after several weeks of a continuous, heavy workload. It obviously works for a lot of users.
There's nothing to "screw up" as it's just a different storage system implementation to handle MVCC. There is a project called zheap underway which is designed to do just what you say by doing in-place updates.
It's true though that Postgres still needs work on performance features and optimizations.
I reject the idea that all tools can be made simple enough for people to use them without understanding what they're doing.
Yes, of course we all want our tools to the best they can be, but that involves making tradeoffs in features because the maintainers don't have an infinite pool of time.
Postgres is a sufficiently advanced tool that if you're using it for any serious work you need to read the documentation for it. Ideally cover to cover, but at least skim the whole thing and pick out the important parts. The Postgres "Routine Database Maintenance Tasks" documentation [1] describes in great detail everything you need to do to keep your database working well.
I suppose, the attitude of the PG community. I can think of one other time when the PG community was very dismissive of criticism from users, even when it is valid...
Not to defend the community at large to a fault... But there are an amazing amount of bug reports which end up being nothing. Its easy to dismiss something as a non-issue by not understanding the underlying cause, which is why the NFS / fsync issue persisted for so long. To be fair, it was also an issue in other DBs.
Not that it forgives the fact the bug was reported 9 years ago and it just recently got fixed. I know personally I'd have messed it up.
If you use read committed, row based replication and explicit locks where necessary, rather than the default of repeatable read, avoiding deadlock isn't too hard.
It's usually the gap locks that bite you, I've found.
Interesting how this is tied to the PostgreSQL business model. Consultancy is how the ecosystem funds development for PG. So they have no real reason to automatically spot performance issues like for instance MongoDB Atlas does. Curious side effect of the business model. (I much prefer PG over MongoDB, but the automatic index suggestion thingie is pretty slick)
Is not having Perl such a worry? I'd imagine most systems that use this also use Git and that requires Perl, and I'm not aware of any GNU/Linux distro that doens't come with Perl (5, of course) pre-installed. While we're speaking of Perl, I just want to say how amazing Perl 6 is, especially compared to Perl 5, in every regard - from syntax to arrays/lists to the quality of the (admittedly small range of) libraries.
If you for some reason can not use docker and don't want to pollute your system globally with cpan packages, you can use a locallib install instead. CentOS example:
Microsoft SQL Server has had this built in since “forever”, but instead of looking at your config, it does its analysis based on workload trace, analyzing how that workload performs and what particular queries are having sub-par performance.
Thus the “tuning” matches your actual use.
If there can be done guaranteed improvements to your DBs performance based on static analysis of the config-alone, why on earth is that not the standard config? That seems like a no-brainer.
The SQL Server tuning advisor has historically been terrible for any database of any complexity at all. It imposes an enormous load, and the results are usually contradictory and often very sub-optimal if not detrimental. Works great for a hello world style database, though.
There was a post earlier (https://news.ycombinator.com/item?id=19422554) that I assume inspired this post and it was someone who benefited by correcting the DBMS' very wrong assumptions about the cost of random IO, and this was always the Achilles heel of SQL Server as well with any sort of fast storage -- it absolutely refused to accept that anything could be faster than about 150 IOPS, and would try to brute force through with terrible approaches. In this case the guy could just change the configuration and it would properly calculate query plan costs and execute accordingly, where SQL Server, to my knowledge, still doesn't let you change this costing, nor does it dynamically determine it.
In the mysql world, mysqltuner has been kicking around for more than a decade now that spits out suggested configuration changes. I haven't had a need to use it in several years though https://github.com/major/MySQLTuner-perl
There are just so many dials and switches that can be changed that can modify the performance of the database, it really has to be moving far beyond anything any one individual could possibly grok.
Standard config and actual use are very different things. The first one implies adequate performance for most use cases, while actual use implies adequate performance considering current use case.
This tool seems to be analyzing current use case.
PG tune is the standard pretty much, not sure about this new tool. In case you're not familiar with PG tuning... Here's the TL/DR, don't even think about going into production without running PG tune. The default postgres settings are not optimized, assuming you have any decent load on the system this will be a problem.
The real odd thing here is that `work_mem` (and to a lesser degree `maintenance_work_mem`) is still low enough out of the box that a script got written to highlight that it's too low. Increasing it is the first step one probably ought to look into when installing Postgres.
Huge numbers of postgres installations are on small machines and/or machines shared with other software. And whether you want a large work_mem hugely depends on whether you want to run an OLTP workload with a few hundred connections, or an analytics workload with like 2. It ain't as simple as you paint it.
Love this idea, but is this better done as part of, say, an existing linting framework? (update: oops, I thought this was about reading from your pg configuration file, not the live DB)
You could make use of things like IDE syntax highlighting, and in some cases, auto-fixing.
Glad to see they have a way to run it with docker, because I have too many bad memories of missing perl modules when trying to run perl scripts in the early 2000s (but that also tells you how outdated I am from perl-world)
There is something to be said for simpler B-Tree style databases like MySQL that are harder to screw up!
Also perhaps there could be an optional mode for Postgres that did something to the effect of immediate vacuuming for all modification statements. Collecting garbage as soon as its made. At the cost of slower statements, but no long term degradation. More predictable performance.