How Postgres Makes Transactions Atomic

brandur · on Aug 16, 2017

Author here with just a quick note that writing this was a pretty great learning experience. Postgres has always been the ultimate black box — it does some amazing things, but I generally have no idea how.

I read a lot of Postgres code to get this finished, and I'm happy to say that for a codebase with so many authors, the quality of Postgres' is very high. Like any complex program it can be a little hard to trace through, but most of the naming is pretty self-explanatory, and it comes with some really amazing comments that walk you through the high-level design. I'd definitely recommend taking a look if anyone is curious.

ralusek · on Aug 16, 2017

Thanks for the writeup, it's at the necessary level of depth to convey the important concepts. I didn't have time to read through all of it yet, so maybe this is addressed, but there has always been something about transactions that I don't understand. How do they accomplish queries that rely on indices within a transaction? If they're not keeping a copy of the index for that transaction, but you wanted to do a sorted query that may or may not include some of your newly inserted rows prior to committing, do they simply do the query without the use of indexes? Do they do the query with indexes to create a temporary table, then run it again non-indexed on the temporary table including your new records? It would seem like that would have many edge cases to address.

felixge · on Aug 16, 2017

Not OP, but I think the answer is that PostgreSQL updates the the index right away, which points to the heap tuple (and it's visibility information). This index value can already be accessed by concurrent transactions, but they decide if its visible or not by looking at the xmin/xmax of the heap tuple it belongs to. If the transaction is rolled back, it's up to VACUUM to remove both the dead heap tuple, as well as its index tuples.

AFAIK there are also optimizations that allow queries to avoid having to look up the individual heap tuples for visibility information by marking the entire page on the index as visible (useful for index-only scans). Yet other optimizations exist to not require updating indexes when updating existing rows where none of the indexed columns change (HOT).

Maybe somebody with a more in-depth understanding can comment, but hopefully the above is somewhat correct :).

petergeoghegan · on Aug 17, 2017

That is correct.

There is one small inaccuracy in your summary, which is that a structure called the visibility map summarizes if a heap page is all visible; that's how index only scans work. At no point do we mark an index page as all-visible, because a HOT update would invalidate that.

felixge · on Aug 17, 2017

Thanks for the clarification about the visibility map.

That being said, I don't understand why HOT would interfere with a hypothetical mechanism for marking index pages as all-visible. If a tuple gets a HOT update, I think the index values should remaing unchanged? And if they do, shouldn't their visibility on an all-visible index page remain unchanged as well?

Unrelated: Thank you so much for your work on "UPSERT" :)

petergeoghegan · on Aug 17, 2017

I guess that I said HOT UPDATE because you talked about HOT. What you describe goes against how visibility information is maintained. There is an interlock between indexes and tables for VACUUM that would have to be considered. It would be hard to set an all-visible bit on a leaf page in a race-free manner.

A much simpler way of putting it would be that deletes would have to touch indexes if summarized visibility info was stored in indexes, even in cases where overall there is hardly any benefit from putting per-page visibility information in index pages. It's not impossible to make something like this work, but it's not obviously compatible with the current design.

I'm glad that you found UPSERT useful.

felixge · on Aug 17, 2017

I wasn't really proposing to change how visibility information is maintained, my grasp on the internals is far to weak for that :). I just tried to make sure I understood your comment, which I think I do now. So thanks again for another insightful reply :)

skyde · on Aug 16, 2017

I think what your looking for is called range lock. When a query sort or filter rows using a range predicate. If using repeatable read isolation. A range lock is created to detect conflict with another transaction inserting new matching row.

petergeoghegan · on Aug 17, 2017

That's not how postgres works. You're talking about 2PL. Serializable isolation level in postgres does have something called predicate locks, but those are actually quite a different thing.

skyde · on Aug 17, 2017

No I was talking exactly about Predicate Locking. They detect writes which conflict with earlier reads I dont think this could be called 2PL because it's only used to detect conflict.

Correctness requires that any insert into an index generate a rw-conflict with a concurrent serializable transaction if, after that insert, re-execution of any index scan of the other transaction would access the heap for a row not accessed during the previous execution.

felixge · on Aug 16, 2017

Thanks for writing this up. As somebody who also likes to look into the internals of the PostgreSQL black box from time to time, posts like this are much appreciated :).

munro · on Aug 16, 2017

> https://brandur.org/assets/postgres-atomicity/heap-tuple-vis...

These diagrams are really beautiful, I love the subtle separation in the lines showing that they're made of hypens. Is there some software that makes these easy to generate?

jdelaney · on Aug 16, 2017

I don't know if this is what the author used but I use Monodraw[0] to make diagrams like this.

[0] https://monodraw.helftone.com/

willglynn · on Aug 16, 2017

I had the same question and dug around in the repo to find out. It is indeed Monodraw:

https://github.com/brandur/sorg/tree/master/content/raws/pos...

I'm excited to use this tool.

pritambaral · on Aug 17, 2017

I wish I could use it on Linux.

deathanatos · on Aug 16, 2017

While pretty, whatever generated it appears to have removed the semantic information that it's text from it. (All of the "characters" are actual SVG paths, not actual text.)

colanderman · on Aug 17, 2017

Yes I'm disappointed that the image isn't made from monospace Unicode box drawing [1] and block element [2] characters.

[1] http://www.unicode.org/charts/PDF/U2500.pdf

[2] http://www.unicode.org/charts/PDF/U2580.pdf

sorentwo · on Aug 17, 2017

This is precisely the kind of article I hope for when I browse the front page.

Thank you for all the effort you put into writing this. It is clearly the product of a lot of effort and craft.

aneutron · on Aug 16, 2017

As a beginner in the software engineering realm, I cannot thank people who give these kind of write-ups enough. It helps me explore patterns and beautiful pieces of engineering. On a side note I wonder if the functions are really named like that or did the author rename some for clarity?

aidos · on Aug 16, 2017

Postgres has a super readable codebase and the functions in the article seem to come directly from the code. When you understand the general model of how postgres works it's fairly easy to browse the code. Having said that I haven't look at the deeper structures at all before and have mostly just perused the query planner / executor code.

We were looking through it earlier to see why our nested loop nodes where performing badly — which ultimately made us realise we were just misreading the EXPLAIN output :-)

cheez · on Aug 16, 2017

May I recommend the following readings:

- http://www.aosabook.org/en/index.html - https://www.amazon.com/Beautiful-Architecture-Leading-Thinke...

combatentropy · on Aug 17, 2017

Thank you for the article, but I do have one quibble:

  > We could ensure that any process takes out an exclusive lock
  > on a file before reading or writing it, or we could push all
  > operations through a single flow control point so that they
  > only run one at a time. Not only are these workarounds slow,
  > but they won’t scale up to allow us to make our database fully
  > ACID-compliant

As much as I like Postgres, I also like SQLite, which uses a simple file lock and yet is utterly acidic (https://www.sqlite.org/transactional.html). From SQLite's FAQ:

  > SQLite allows multiple processes to have the database file
  > open at once, and for multiple processes to read the database
  > at once. When any process wants to write, it must lock the
  > entire database file for the duration of its update. But that
  > normally only takes a few milliseconds. . . .
  >
  > If your application has a need for a lot of concurrency, then
  > you should consider using a client/server database. But
  > experience suggests that most applications need much less
  > concurrency than their designers imagine.

--- https://www.sqlite.org/faq.html#q5

felixge · on Aug 17, 2017

AFAIK SQLite doesn't achieve atomicity by virtue of its global write lock. It does so by using a rollback journal (default) or WAL.

So it's not really all that different from PostgreSQL in that regard and the authors point remains valid :).

You might also enjoy this talk by the main author of SQLite: https://www.youtube.com/watch?v=ZvmMzI0X7fE (SQLite: Protégé of PostgreSQL)

combatentropy · on Aug 17, 2017

The paragraph I quoted is talking about two things, one right after the other: concurrency and acidity.

First, it talks about concurrency. It says that a naive solution to concurrency is file locking. It says this is slow. So I first trot out SQLite as an example that is not slow.

It says, "Not only are these workarounds slow, but they won't scale up to allow us to make our database fully ACID-compliant." To me it sounded like he said choosing file-locking for concurrency will mop you into a corner, that you won't ever become ACID, regardless of what you tack on, like a journal.

One other clarification about concurrency: SQLite says it's fast enough for most websites. Then it turns around and says that if you need a lot of concurrency, you should go with another database. The contradiction evaporated when it dawned on me that when database makers say "concurrency," they're talking about many users at a time connecting to a database directly: picture a bunch of SQL gurus connecting directly to an SQLite file and issuing raw SQL statements. Or replace the gurus with several "application" servers that all write to a database on yet another server. For that, SQLite would be bad, we all agree, including SQLite.

But that's not what most of us, the 99% of app developers, are thinking of when we hear "concurrency." We're thinking of several people hitting our website. In that case, SQLite would still hold up (https://sqlite.org/whentouse.html, see Server-side database).

Anyway, again, I'm not against Postgres. It's what I use. I'm not talking about SQLite because I'm invested in it but because I admire it. It's just an example of a simple solution that meets many people's needs, without resorting to more complexity.

kuschku · on Aug 17, 2017

> But that's not what most of us, the 99% of app developers, are thinking of when we hear "concurrency." We're thinking of several people hitting our website. In that case, SQLite would still hold up (https://sqlite.org/whentouse.html, see Server-side database).

It doesn’t. At all.

The quassel project supports PostgreSQL and SQLite as backends.

And, as expected when you use a database as log storage for an IRC bouncer, you end up with many concurrent threads writing and reading to the same table.

The result is that if a user is reading a table right now, no one can write to it - and that, in turn, means that the thread receiving messages (and responding to PINGs, which is required to keep the connection to the IRC server alive!) gets blocked, and has to wait on the threads currently requesting messages.

So if you connect and set your client to, on connection to the bouncer, request the last 500 messages for each channel, you might actually time out. Yes. When you have enough data, then connecting to the bouncer can kill the bouncers other connections.

If you have more than one thread doing writes or reads – in a usual web application you’ll have dozens or hundreds of worker threads reading and writing concurrently – then SQLite just isn’t useful.

Disclaimer: Not speaking for the quassel project here. This comment is just my opinion.

combatentropy · on Aug 17, 2017

  > if a user is reading a table right now,
  > no one can write to it

In 2010 SQLite came out with a mode that keeps the writer from blocking readers: https://www.sqlite.org/wal.html

  > > 99% [...] SQLite would still hold up
  > an IRC bouncer

I would consider a chat client one of the 1%. Although there are many chat clients, and other examples that we could think of, there are thousands of websites and apps of the "small to medium" size that SQLite might be a good fit for.

https://www.sqlite.org/features.html

kuschku · on Aug 18, 2017

> I would consider a chat client one of the 1%. Although there are many chat clients, and other examples that we could think of, there are thousands of websites and apps of the "small to medium" size that SQLite might be a good fit for.

The issue with those is still that if you have a too high writing load, SQLite will become a bad fit. It’s perfect for any low writing, high read situations, as many websites are, but for more dynamic applications it seems to be less usable.

Also, the FAQ of SQLite still lists

> SQLite supports an unlimited number of simultaneous readers, but it will only allow one writer at any instant in time. For many situations, this is not a problem. Writer queue up. Each application does its database work quickly and moves on, and no lock lasts for more than a few dozen milliseconds. But there are some applications that require more concurrency, and those applications may need to seek a different solution.

felixge · on Aug 17, 2017

I admire SQLite as well. It's a fantastic piece of software with many use cases.

You're correct that there are some mostly-ready-only situations where a file-lock isn't a problem (perhaps a small CMS, blog or similar).

But using coarse grained locking often does mop you into a corner when it comes to concurrency. Just look at MongoDB (older versions had a global write lock), Python (GIL), etc.. The sibling post of this comment gives a real-world SQLite example. Anyway, you're right that coarse grained locking has nothing to do with atomicity itself. I may have misread your first comment, in retrospect you may have meant to say exactly that.

combatentropy · on Aug 17, 2017

  > in retrospect you may have meant to say exactly that

Yes, but in retrospect I can see that maybe I should have more clearly separated the two questions. Thank you for the video. I watched it. I didn't know that SQLite looked so often at Postgres. But other parts of it were vaguely familiar. Maybe I saw it a long time ago and forgot about it.

  > there are some mostly-ready-only situations where a file-lock
  > isn't a problem (perhaps a small CMS, blog or similar)

I haven't used SQLite much, but it promises greater size and speed than most people would guess. As you can see from its website and the video you shared, D. Richard Hipp keeps tabs on where it's being used. "Developers report that SQLite is often faster than a client/server SQL database engine [when used as the database back-end on an application server]" (https://www.sqlite.org/whentouse.html) --- which is exactly what most people use a database for.

  > The sibling post of this comment gives a real-world SQLite example

--- of an application that I wouldn't use SQLite for. His mention of the writer blocking readers suggests that he wasn't using it in WAL mode. I had thought that the rollback journal and WAL were two names for the same thing. But SQLite didn't come out with the WAL until 2010. As it explains on its website, the rollback journal and WAL are alike, just backwards versions of each other. In rollback-journal mode (still the default) SQLite writes changes directly to its original database file but first backs up a snapshot of the data to the journal. With WAL mode turned on, the database first writes to the WAL, which eventually writes back to the original database (https://www.sqlite.org/wal.html).

sgt · on Aug 16, 2017

Postgres is entirely designed around the idea of durability, which dictates that even in extreme events like a crash or power loss, committed transactions should stay committed.

Although if I had a $100 for every time I've had DB corruption in Postgres over the years...

That being said, since 9.4 (or maybe 9.5) these incidents have mostly stopped happening and it's been remarkably stable.

orthecreedence · on Aug 16, 2017

Hi, very curious under what situations you had data corruption. I use postgres a lot and haven't experienced (or noticed) this. I'm wondering if I'm using it differently than you or if I've been lucky =]

sgt · on Aug 16, 2017

Most of our cases have been standby servers having corrupted WAL files, to be discovered only when the master has had a physical disk crash or similar.

After a few tough lessons like that we've put monitoring in place for ensuring that the standby and/or PITR servers have testing in place e.g. PITR servers must have uninterrupted WAL sequences.

Most Postgres corruption (on a master) actually happens due to dodgy hardware, but we don't really use bad hardware. In cases where we used VM's we've suspected something fishy in VMWare itself. In one case we suspected the SAN that the VMWare host was using.

So mostly not Postgres' fault. There were however some known replication bugs in the 9.x versions that were quite nasty. I think those bugs have now all or mostly been fixed, so 9.6 and 10 (beta) onwards are extremely stable.

But the risk of HW issues still remain as the biggest reason for postgres corruption.

I cannot emphasize enough how important backups are (both standby servers and PITR) and also to verify that these are indeed valid for that day when you're going to need to use them.

orthecreedence · on Aug 16, 2017

Great response, thank you. Seems mostly replication-related, which admittedly I haven't delved into much. I've used postgres to hold a lot of data, but haven't gotten to the point where it needed to scale out yet. This is great to know for when that starts to be a concern. I'll definitely heed your advice on backups (already do them, but could put more automation into verifying they work).

Manozco · on Aug 16, 2017

You would probably be less rich than if you had 1$ each time MongoDB loses data :P /s

gnaritas · on Aug 16, 2017

> Although if I had a $100 for every time I've had DB corruption in Postgres over the years...

You'd be broke. :)

random_comment · on Aug 16, 2017

That probably depends on a bunch of stuff.

- PG only added checksums in 9.3

- his disks or RAID controller could be shit or he could be using some 'very cool' filesystem

- Some people turn off PG settings for performance that really shouldn't be turned off

- Some people think ECC ram isn't worth the money

- or overclock their CPU on anything other than a gaming machine

Usually it's shit hardware

https://wiki.postgresql.org/wiki/Corruption

christophilus · on Aug 17, 2017

"Every transaction in the database is applied in serial order, with a global lock ensuring that only one is being confirmed committed or aborted at a time." Is this true? I assumed you could transact in parallel, if there was no conflict, e.g. different tables were involved or maybe even different rows.

dullgiulio · on Aug 17, 2017

The confirmation for non-conflicting transaction is a very fast operation, it's not a bottleneck to protect it with a global lock. The whole process of writing the transaction is in parallel, you just have written data that might be discarded if the final commit fails. I don't think you can tell that two transactions do not conflict without holding a lock on the global state.

andrewshatnyy · on Aug 16, 2017

Those diagrams are badass! Anyone knows what was used to make em?

arnarbi · on Aug 17, 2017

Not OP, but I like https://monodraw.helftone.com/

There are a few online variants, e.g. http://asciiflow.com/ and https://textik.com/

For extra credit use these fonts: https://int10h.org/oldschool-pc-fonts/fontlist/

bullen · on Aug 17, 2017

This is a non problem IMO. Why would you ever need to write the same data from two different sources simultaneously? What is the user case?

exabrial · on Aug 16, 2017

Wait, they're -not- atomic? This surprises me because of the number of comparisons between MySql and Postgres... Comparing apples to oranges

lightbyte · on Aug 16, 2017

You misread the title and didn't see the "How" (I did as well).

exabrial · on Aug 16, 2017

Yep! I did, ooops

exabrial · on Aug 16, 2017

Ok Everyone, I realized my mistake, no need to keep downvoting, sheesh

kuschku · on Aug 17, 2017

This is why sometimes the ability to edit a comment later on (and if it would only be additive editing) would be useful, to correct mistakes made.

pjc50 · on Aug 16, 2017

Yes, they're atomic?

aneutron · on Aug 16, 2017

Did you read the article ?