Transaction ID Wraparound in Postgres

JimNasby · on July 23, 2015

I think the real takeaway here is that any part of your underlying technology that you don't thoroughly understand puts you at risk, and that risk goes up significantly in a large volume environment. It's frequently not possible to have experts on-hand for everything, but you need to make sure you have them for your critical stuff. They don't necessarily have to be on your payroll either; really good consulting companies can do a lot for you as well. (Disclosure: I'm a partner at a consulting company).

When it comes to Postgres and data, I'll just say what I've been saying for ~20 years:

"All databases suck, each one just sucks differently."

mercurial · on July 24, 2015

> "All databases suck, each one just sucks differently."

That's true for languages and frameworks as well.

nierman · on July 24, 2015

Running more aggressive vacuums that will clean up "old" tuples/tables is good for off-peak times. See Josh Berkus' short three part blog series on postgres' vacuum/freeze: http://www.databasesoup.com/2012/09/freezing-your-tuples-off...

and his python script for doing this: https://github.com/pgexperts/flexible-freeze

Also, note that long running transactions can prevent cleanup of tuples. Look for old xact_start values of non-idle queries in pg_stat_activity (particularly "idle in transaction" connections) and old entries in pg_prepared_xacts.

aristus · on July 23, 2015

After all of that discussion of very Postgres-specific limitations, they blame SQL? SQL is just a notation for relational algebra. It is not the problem here. There are better, distributed, scalable, write-optimized implementations of real SQL out there. Disclosure: I work for a company that makes one.

the_mitsuhiko · on July 23, 2015

Armin from Sentry here.

> they blame SQL?

We did not mean to blame SQL more PostgreSQL specifically. However in our case though we want to build our infrastructure on Open Source components since we are an Open Source project and provide on-premise installations in addition to our cloud hosted environment, so we do not want to depend on commercial databases.

j_s · on July 23, 2015

Happened to see your children's book in your profile; thanks for that!

http://amzn.com/B00QL616IC

mangeletti · on July 24, 2015

Wow, this was a great read. Fellow Djangonaut / Pythonista here, and I really enjoy reading David's blog posts. I've learned from and been inspired by him since what seems like early 2011 or so. I also didn't know Armin Ronacher worked for Sentry (re: his comment herein). What a cool company.

Also, great call on truncating that table. I've witnessed situations where nobody has been willing to accept loss, snd the result is a conference call that lasts until 4am and a lot of angry customers.

JimNasby · on July 24, 2015

FWIW, I got the impression that they didn't truncate the table, but just forced an update to it's relfrozenxid to advance the freeze age and allow the database to restart. That means there's a risk that rows that were currently visible would suddenly vanish as the XID advanced and their xmin became "the future". Eventually record of those commits would vanish and presumably the rows would then get removed (though I'm not sure on that without checking the code).

kuroyi · on July 23, 2015

I hit this issue all the time and it is never easy or fast to fix once postgres stops accepting writes. If you have a write heavy postgres database, the autovacuum settings are not sufficient. The entire database needs to be vacuumed aggressively.

rosser · on July 23, 2015

If you have a write heavy postgres database, the autovacuum settings are not sufficient.

Then you tune it to run more aggressively. There are configuration parameters specifically for this.

dinedal · on July 23, 2015

> I hit this issue all the time

Curious to know your problem space and context? User generated content? Sensor logs?

kuroyi · on July 23, 2015

Essentially a large volume of sensor data. Tons of inserts. Some updates in other tables.

tokenizerrr · on July 23, 2015

Could you elaborate on what settings you found insufficient?

EmielMols · on July 23, 2015

Postgresql's MVCC model (copy complete row on any write, clean up old rows some time in the future) might work decently for read-centric workloads, it's unsuitable for write-heavy stuff.

We're running a setup that employs postgres for a write-heavy (but far from write-only) application, and are experiencing similarly troubling vacuum jobs. We're considering moving back to good old MySQL because of it. Anyone who's been here before and has some insight?

the_mitsuhiko · on July 23, 2015

> Anyone who's been here before and has some insight?

Armin from Sentry here. We talked a bit about where we could scale with postgres and part of our problem is that we're quite write and also delete heavy which means our vacuum takes very long. A potential option would be to partition some of our time based tables and make a rolling window view over it. Then we could throw entire tables away after the hold period and our vacuums should be faster at least. We could also move some tables to another database which might help, there however we then need to deal with higher connection counts unfortunately.

Downside for us is that we also want to keep the setup straightforward for our on-premise customers so we need to be careful with not introducing unnecessary complexity.

anarazel · on July 23, 2015

> We talked a bit about where we could scale with postgres and part of our problem is that we're quite write and also delete heavy which means our vacuum takes very long.

Is this still a problem after you tuned vacuum to be actually aggressive? It sure isn't perfect but you can go a long way with with just adjusting a couple settings.

the_mitsuhiko · on July 23, 2015

> Is this still a problem after you tuned vacuum to be actually aggressive?

I need to check how long vacuum takes at the moment, but I think it's in the high hours.

kuroyi · on July 23, 2015

My tables are partitioned daily and I still have had issues. The problem is they all still need vacuumed but at least the individual table vacuums are shorter and it makes it easier to drop older data.

jandrewrogers · on July 23, 2015

This is really a matter of architecting your database correctly. In all the years I used PostgreSQL for write-heavy and delete-heavy loads, I never had a problem with vacuum but I also understood how the database worked internally and how to design my data models to best fit that when performance mattered. Same thing for other database engines. If you can make it fast on MySQL, it just means you are doing it wrong on PostgreSQL.

There is not a database in existence that allows you to be oblivious to the underlying organization while still giving good write/delete performance. PostgreSQL is no different in that regard.

EmielMols · on July 24, 2015

> If you can make it fast on MySQL, it just means you are doing it wrong on PostgreSQL.

This is quite the statement and inconsistent with your later point. Postgres never updates data in-place, and certain workloads can never be as fast on Postgres as using a different storage engine such as MySQL/TokuDB.

pramsey · on July 23, 2015

Always good to know what is happening under the covers... https://www.pgcon.org/2015/schedule/events/829.en.html https://www.youtube.com/watch?v=b1fcvkl0ffQ

pornel · on July 23, 2015

I've had update-heavy InnoDB table balloon in size, because its purge thread wasn't keeping up.

MyISAM didn't have such problem, but it also couldn't handle high update concurrency.

kuroyi · on July 23, 2015

Reducing the autovacuum_vacuum_cost_delay seems to have helped in my case. The autovacuums just run too slow.

vladsanchez · on July 23, 2015

They run too slow because they're background processes. Table partitioning, maintenance_work_mem tuning along with table-specific vacuum and autovacuum settings will get you a long way.

hendzen · on July 23, 2015

So the root issue here is that postgres is using a 32-bit transaction ID?

jandrewrogers · on July 23, 2015

Not really. There are good technical reasons for why most MVCC database engines intentionally avoid using transaction ids with enough bits such that they will never plausibly wraparound. More on that in a moment.

Empirically, you need at least a 64-bit transaction id to practically mitigate the possibility of wraparound; even MVCC systems with 48-bit transaction ids have wrapped around in real systems. Consequently, most databases like PostgreSQL are designed under the assumption that wraparound will happen and provide mechanisms for smoothly managing that in the hands of someone that knows what they are doing. No matter how much automatic safety you build into the database, there are performance optimization knobs that enable an incompetent or unskilled DBA to effectively disable or render ineffective the transaction wraparound management mechanisms. This is clearly explained in database documentation.

So back to why transaction id wraparound exists as a thing:

In an MVCC database, each logical record you store usually has multiple transaction ids associated with it that are stored with the record. Because most relational database records are typically quite small, the transaction id overhead in the heap row headers can be a significant percentage of the total storage cost. If you store transactional information in your indexes as well (some do, some don't) the relative space consumption is even worse.

This has an enormous performance impact. A typical row header in an OLTP database is 20-30 bytes. In cases where implementors have shaved a couple bytes off the row header size with heroic efforts, it has generated a 5-10% performance boost across the board due to the more compact representation. The difference between a 32-bit transaction id and a 64-bit transaction id is at least 8 bytes! Yes, you could use 64-bit transaction ids but at the cost of a database that is much more bloated -- and therefore slower -- than the 32-bit version. Since there are good technical solutions to transaction id wraparound, it would be silly to make everyone suffer from poor performance to keep a few poor DBAs from shooting themselves in the foot.

That is not to say you should never use a 64-bit transaction id in a database engine design -- I have, in fact -- but that you should understand the technical tradeoffs and implications.

caf · on July 24, 2015

Is there a reason that you couldn't store the transaction IDs in the row headers in a more compact form - for example using the full 64 bits for txmin, but storing txmax as a 24 bit delta from txmin?

claar · on July 23, 2015

Some Googling reveals that they've discussed 64-bit TXID since 2005, but resisted due to performance and complexity concerns.

The two biggest discussion threads w/Tom Lane seem to be:

2005 http://postgresql.nabble.com/Re-PATCHES-lastval-td2213404.ht...

2012 http://postgresql.nabble.com/PostgreSQL-64-Bit-XIDs-Transact...

claar · on July 23, 2015

Oops, got that first link wrong, sorry.

2005 http://postgresql.nabble.com/A-couple-of-patches-for-Postgre...

Kassandry · on July 23, 2015

From their article:

"Additionally we had been running with far too much delay in vacuuming which meant lower load on the system, but more idle time in maintenance."

So, vacuum wasn't given the resources to keep up with their load, and it's not clear if they were supplementing with manual vacuums during quiet times. Nor was it clear when they started reacting, as PostgreSQL (also outlined in their documentation link), will start squawking about wraparound well before it shuts down, or the monitoring they had in place for whether or not autovacuum/vacuum was keeping up with the workload, or the number of locks their application was taking (locks can block autovacuum).

Adjusting the autovacuum settings by table will give you finer control over the postgresql.conf parameters to better match the workload for specific tables, as well.

Partitioning would have also helped make the actively written part of the table smaller, and older data could be vacuumed with VACUUM FREEZE, or deleted later. There are extensions to help make that easier.

https://github.com/keithf4/pg_partman

Josh Berkus of PgExperts has some great advice here:

https://dba.stackexchange.com/questions/21068/aggressive-aut...

So does Jim Nasby, here:

http://bluetreble.com/2014/10/postgres-mvcc-and-a-look-at-va...

Regarding monitoring, check_postgres.pl would give them an idea that their vacuuming settings needed adjustment sooner, or their application locking needed to be adjusted to not block autovacuum.

https://bucardo.org/check_postgres/check_postgres.pl.html#tx...

https://bucardo.org/check_postgres/check_postgres.pl.html#la...

https://bucardo.org/check_postgres/check_postgres.pl.html#lo...

It's easy to mention things after the fact, though, and it's good they got things up and running again for their customers.

PhantomGremlin · on July 23, 2015

   PostgreSQL ... will start squawking about wraparound
   ...
   or the monitoring they had in place

If only there was a tool or service they could use that could monitor log files for signs of problems. (I couldn't resist being a little snarky, the situation seems so perfect.)

All kidding aside, kudos to Sentry for being so candid publicly about their problems. We've seen so many companies avoid providing any technical information. About the best we'll hear in some of these is that the problem was "not terrorism related".

zeeg · on July 23, 2015

David from Sentry here.

We actually knew about the problem with delay and had been working to improve it. We were a couple days away from failing over to the new hardware (safely) and unfortunately we didn't have any early warnings in the logs. I haven't yet looked at why.

JimNasby · on July 23, 2015

You need to monitor how close you are to a wraparound. https://bucardo.org/check_postgres/check_postgres.pl.html#tx... is one way you could do that.

skehlet · on July 23, 2015

Can you share what you had for your autovacuum_vacuum_cost_delay, and your approximate rate of transactions?

zeeg · on July 24, 2015

We had at it 50ms on the previous setup, though I wish we I knew why that value was used. Likely it was a default with the Chef cookbook we forked off of, or we read something that convinced us at the time it was a good idea.

Kassandry · on July 23, 2015

Happens. Glad you're back up and running, and best of luck with your new setup. =)

snuxoll · on July 23, 2015

To an extent, you wouldn't want the transaction ID to be any bigger either because of the clog - which postgres needs to determine row visibility.

anarazel · on July 23, 2015

That's the 'clog' (commit log), not the xlog (WAL, for durability).

snuxoll · on July 23, 2015

Bah, you're right, I've edited my post. You'd think after working with PostgreSQL for years I'd have my terminology straight.

polskibus · on July 23, 2015

Exactly my thought, I wonder if anyone can shed some light on this matter.

juliangregorian · on July 24, 2015

Shame you're being downvoted. 2 billion is not that big a number and I found it surprising since other databases are using GUID's and UUID's and similar to deal with that problem. There was an unfortunate lack of explanation for that in the article.

marknadal · on July 23, 2015

Disclosure: I'm the architect for a distributed database. We started building our database because we had similar problems with not being able to accept writes when our Master went down. This is never fun, but you guys seemed pretty well prepared, staying true to your company's motto - thank you for writing this article.

First off, if your data is primarily logs you should not have to rely upon a primary Master. PostgreSQL isn't designed to be Masterless, which presents an obvious barrier. If you set up a Master-Master database then any/all of your servers could accept writes, even if another one is down. And since they are logs, you are getting idempotent behavior for free anyways, as everything is create/insert you don't have to worry about concurrent updates to the same record - which is what Master based systems are designed for.

I actually DO NOT recommend our database (http://gunDB.io/) because we are not stable or production ready in any way shape or form. But we're in our seed stage funding and trying to spend the time exploring problem spaces and designing solutions for them. If you ever have time to chat, please drop me a line at mark@gunDB.io so I can learn from you.