I think the real takeaway here is that any part of your underlying technology that you don't thoroughly understand puts you at risk, and that risk goes up significantly in a large volume environment. It's frequently not possible to have experts on-hand for everything, but you need to make sure you have them for your critical stuff. They don't necessarily have to be on your payroll either; really good consulting companies can do a lot for you as well. (Disclosure: I'm a partner at a consulting company).
When it comes to Postgres and data, I'll just say what I've been saying for ~20 years:
"All databases suck, each one just sucks differently."
Also, note that long running transactions can prevent cleanup of tuples. Look for old xact_start values of non-idle queries in pg_stat_activity (particularly "idle in transaction" connections) and old entries in pg_prepared_xacts.
After all of that discussion of very Postgres-specific limitations, they blame SQL? SQL is just a notation for relational algebra. It is not the problem here. There are better, distributed, scalable, write-optimized implementations of real SQL out there. Disclosure: I work for a company that makes one.
We did not mean to blame SQL more PostgreSQL specifically. However in our case though we want to build our infrastructure on Open Source components since we are an Open Source project and provide on-premise installations in addition to our cloud hosted environment, so we do not want to depend on commercial databases.
Wow, this was a great read. Fellow Djangonaut / Pythonista here, and I really enjoy reading David's blog posts. I've learned from and been inspired by him since what seems like early 2011 or so. I also didn't know Armin Ronacher worked for Sentry (re: his comment herein). What a cool company.
Also, great call on truncating that table. I've witnessed situations where nobody has been willing to accept loss, snd the result is a conference call that lasts until 4am and a lot of angry customers.
FWIW, I got the impression that they didn't truncate the table, but just forced an update to it's relfrozenxid to advance the freeze age and allow the database to restart. That means there's a risk that rows that were currently visible would suddenly vanish as the XID advanced and their xmin became "the future". Eventually record of those commits would vanish and presumably the rows would then get removed (though I'm not sure on that without checking the code).
I hit this issue all the time and it is never easy or fast to fix once postgres stops accepting writes. If you have a write heavy postgres database, the autovacuum settings are not sufficient. The entire database needs to be vacuumed aggressively.
Postgresql's MVCC model (copy complete row on any write, clean up old rows some time in the future) might work decently for read-centric workloads, it's unsuitable for write-heavy stuff.
We're running a setup that employs postgres for a write-heavy (but far from write-only) application, and are experiencing similarly troubling vacuum jobs. We're considering moving back to good old MySQL because of it. Anyone who's been here before and has some insight?
> Anyone who's been here before and has some insight?
Armin from Sentry here. We talked a bit about where we could scale with postgres and part of our problem is that we're quite write and also delete heavy which means our vacuum takes very long. A potential option would be to partition some of our time based tables and make a rolling window view over it. Then we could throw entire tables away after the hold period and our vacuums should be faster at least. We could also move some tables to another database which might help, there however we then need to deal with higher connection counts unfortunately.
Downside for us is that we also want to keep the setup straightforward for our on-premise customers so we need to be careful with not introducing unnecessary complexity.
> We talked a bit about where we could scale with postgres and part of our problem is that we're quite write and also delete heavy which means our vacuum takes very long.
Is this still a problem after you tuned vacuum to be actually aggressive? It sure isn't perfect but you can go a long way with with just adjusting a couple settings.
My tables are partitioned daily and I still have had issues. The problem is they all still need vacuumed but at least the individual table vacuums are shorter and it makes it easier to drop older data.
This is really a matter of architecting your database correctly. In all the years I used PostgreSQL for write-heavy and delete-heavy loads, I never had a problem with vacuum but I also understood how the database worked internally and how to design my data models to best fit that when performance mattered. Same thing for other database engines. If you can make it fast on MySQL, it just means you are doing it wrong on PostgreSQL.
There is not a database in existence that allows you to be oblivious to the underlying organization while still giving good write/delete performance. PostgreSQL is no different in that regard.
> If you can make it fast on MySQL, it just means you are doing it wrong on PostgreSQL.
This is quite the statement and inconsistent with your later point. Postgres never updates data in-place, and certain workloads can never be as fast on Postgres as using a different storage engine such as MySQL/TokuDB.
They run too slow because they're background processes. Table partitioning, maintenance_work_mem tuning along with table-specific vacuum and autovacuum settings will get you a long way.
Not really. There are good technical reasons for why most MVCC database engines intentionally avoid using transaction ids with enough bits such that they will never plausibly wraparound. More on that in a moment.
Empirically, you need at least a 64-bit transaction id to practically mitigate the possibility of wraparound; even MVCC systems with 48-bit transaction ids have wrapped around in real systems. Consequently, most databases like PostgreSQL are designed under the assumption that wraparound will happen and provide mechanisms for smoothly managing that in the hands of someone that knows what they are doing. No matter how much automatic safety you build into the database, there are performance optimization knobs that enable an incompetent or unskilled DBA to effectively disable or render ineffective the transaction wraparound management mechanisms. This is clearly explained in database documentation.
So back to why transaction id wraparound exists as a thing:
In an MVCC database, each logical record you store usually has multiple transaction ids associated with it that are stored with the record. Because most relational database records are typically quite small, the transaction id overhead in the heap row headers can be a significant percentage of the total storage cost. If you store transactional information in your indexes as well (some do, some don't) the relative space consumption is even worse.
This has an enormous performance impact. A typical row header in an OLTP database is 20-30 bytes. In cases where implementors have shaved a couple bytes off the row header size with heroic efforts, it has generated a 5-10% performance boost across the board due to the more compact representation. The difference between a 32-bit transaction id and a 64-bit transaction id is at least 8 bytes! Yes, you could use 64-bit transaction ids but at the cost of a database that is much more bloated -- and therefore slower -- than the 32-bit version. Since there are good technical solutions to transaction id wraparound, it would be silly to make everyone suffer from poor performance to keep a few poor DBAs from shooting themselves in the foot.
That is not to say you should never use a 64-bit transaction id in a database engine design -- I have, in fact -- but that you should understand the technical tradeoffs and implications.
Is there a reason that you couldn't store the transaction IDs in the row headers in a more compact form - for example using the full 64 bits for txmin, but storing txmax as a 24 bit delta from txmin?
"Additionally we had been running with far too much delay in vacuuming which meant lower load on the system, but more idle time in maintenance."
So, vacuum wasn't given the resources to keep up with their load, and it's not clear if they were supplementing with manual vacuums during quiet times. Nor was it clear when they started reacting, as PostgreSQL (also outlined in their documentation link), will start squawking about wraparound well before it shuts down, or the monitoring they had in place for whether or not autovacuum/vacuum was keeping up with the workload, or the number of locks their application was taking (locks can block autovacuum).
Adjusting the autovacuum settings by table will give you finer control over the postgresql.conf parameters to better match the workload for specific tables, as well.
Partitioning would have also helped make the actively written part of the table smaller, and older data could be vacuumed with VACUUM FREEZE, or deleted later. There are extensions to help make that easier.
Regarding monitoring, check_postgres.pl would give them an idea that their vacuuming settings needed adjustment sooner, or their application locking needed to be adjusted to not block autovacuum.
PostgreSQL ... will start squawking about wraparound
...
or the monitoring they had in place
If only there was a tool or service they could use that could monitor log files for signs of problems. (I couldn't resist being a little snarky, the situation seems so perfect.)
All kidding aside, kudos to Sentry for being so candid publicly about their problems. We've seen so many companies avoid providing any technical information. About the best we'll hear in some of these is that the problem was "not terrorism related".
We actually knew about the problem with delay and had been working to improve it. We were a couple days away from failing over to the new hardware (safely) and unfortunately we didn't have any early warnings in the logs. I haven't yet looked at why.
We had at it 50ms on the previous setup, though I wish we I knew why that value was used. Likely it was a default with the Chef cookbook we forked off of, or we read something that convinced us at the time it was a good idea.
Shame you're being downvoted. 2 billion is not that big a number and I found it surprising since other databases are using GUID's and UUID's and similar to deal with that problem. There was an unfortunate lack of explanation for that in the article.
Disclosure: I'm the architect for a distributed database. We started building our database because we had similar problems with not being able to accept writes when our Master went down. This is never fun, but you guys seemed pretty well prepared, staying true to your company's motto - thank you for writing this article.
First off, if your data is primarily logs you should not have to rely upon a primary Master. PostgreSQL isn't designed to be Masterless, which presents an obvious barrier. If you set up a Master-Master database then any/all of your servers could accept writes, even if another one is down. And since they are logs, you are getting idempotent behavior for free anyways, as everything is create/insert you don't have to worry about concurrent updates to the same record - which is what Master based systems are designed for.
I actually DO NOT recommend our database (http://gunDB.io/) because we are not stable or production ready in any way shape or form. But we're in our seed stage funding and trying to spend the time exploring problem spaces and designing solutions for them. If you ever have time to chat, please drop me a line at mark@gunDB.io so I can learn from you.
When it comes to Postgres and data, I'll just say what I've been saying for ~20 years:
"All databases suck, each one just sucks differently."