Hacker News new | past | comments | ask | show | jobs | submit login
The part of Postgres we hate the most: Multi-version concurrency control (ottertune.com)
257 points by andrenotgiant on April 26, 2023 | hide | past | favorite | 143 comments



I must admit as a web practitioner since 1994 I have a bit of an issue with this:

> In the 2000s, the conventional wisdom selected MySQL because rising tech stars like Google and Facebook were using it. Then in the 2010s, it was MongoDB because non-durable writes made it “webscale“. In the last five years, PostgreSQL has become the Internet’s darling DBMS. And for good reasons!

Different DB's, different strengths and it's not a zero sum came as implied. MySQL was popular before Google was born - we used it heavily at eToys in the 90s for massive transaction volume and replacing it with Oracle was one of the reasons for the catastrophic failure of eToys circa 2001. MongoDB gained traction not because it's an alternative to MySQL or PostgreSQL. And PostgreSQL's marketshare today is on a par with Mongo and both are dwarfed by MySQL which IMO is the true darling of web DB's given it's global popularity.


A non-trivial component to MySQL popularity was that easy installation (not necessarily administration) and comparatively low resource usage with good performance at default settings (even today one needs to run some basic calculations for postgres in production, IMO) meant that cheapest possible dynamic hosting using Linux, Apache, PHP3, and MySQL 3, was what simply was the only available option for many. This codified LAMP stack, people learned from tutorials/courses/word of mouth how to write web apps with PHP and MySQL, used cheap LAMP hosting, optionally installed LAMP servers themselves, etc.

This also led to popularity of bigger reselling setups (I don't miss installing cpanel...) and services like Dreamhost.

MySQL in this way gained a virtuous cycle completely unrelated to Google. Hell, most people I know, who dealt with LAMP space for years, never knew Google had anything to do with MySQL (most people that knew about it were... Lispers. Because of who built the first version of Google Ads)

Even Mac OS X Server shipped with MySQL and PHP because of that, in 2001.


Another factor besides performance vs earlier versions of Postgres (they're now more at parity) was Postgres didn't come with replication included. I think that was a big hinderance for adoption during the LAMP stack's hey day.


Honestly, at the time when LAMP was gaining the userbase, said userbase for considerable portion did not care about replication because there was only one server they had.

Replication was something you did when you got succesful enough to have it, or were a MSP providing it at premium to others.


I remember it differently - we needed replication for "hot" backups. At that time, scalability was a major issue - so anyone (including businesspeople) wanted to have a scalable architecture. MySQL spoke to the practical (default install on cPanel hosts, easy replication) and the aspirational (you're going to blow up and need to scale).

Digg.com also had a really influential technical team - hearing about how they did things set a lot of baseline defaults for a lot of people.


maybe you were on the more funded side of history in this. As for me, Digg is way after LAMP got solidly plonked into "what I need for a dynamic website on cheap".

Essentially, start at 2000-2001 and more and more people going into running websites for all kinds of reasons (forums, blogs, webshops, etc. often hosted on low end offerings)


I didn't enter the workforce until 2004, so yeah missed some of the early early days of PHP/MySQL. I used it for government work, was definitely not well funded haha! But I suspect digg started with MySQL b/c of similar reasons as anyone else, then helped amplify the cycle.


"Cheap" is the key word here, and that usually meant shared hosting, which was like 99% MySQL.


> MongoDB gained traction not because it's an alternative to MySQL or PostgreSQL.

Honestly I think it only gained traction because many Node devs refused to learn SQL and the document model is familiar because it's closer to JSON data.

These days Mongo is good but that wasn't the case back 10+ years ago.


Mongo was so comically bad. I remember trying to sort through a slow query and thought: ah ha! I'll just add an index. Unfortunately on that version of Mongo, creating an index would occasionally just crash the server process.

I think Mongo became popular because it's ad tech and those guys knew how to be buzzword compliant. JSON-esque documents are one thing, but Mongo is Javascript to the core. All of a sudden your JS devs don't have to learn SQL they can just shit out some queries in javascript. Of course that came with some pretty severe drawbacks.


My favourite story about MongoDB is that it was so bad and popular at the same time that when a competitor developed a wire-compatible database that was miles better they simply bought it and released it as the next version of MongoDB.


which db was that?


I think I meant WiredTiger.


This is true. And now MongoDB is miles better.


As I remember, MongoDB got popular before node.js, so there wasn't really a lot of backend JavaScript developers out there to make a difference.

We first used Mongo ~11 years ago with Java. For us the benefit was that we could dump unstructured data into it quickly, but still run queries / aggregations on it later.


> A non-trivial component to MySQL popularity was that easy installation

...Along with replication and being joined with the hip to PHP. As to installation, there was a point in time in the early 2000s where you could sudo to root, type 'mysql' and be talking to a live MySQL on most Linux distros that I used. No wonder a lot of people defaulted to it.


Replication came later - but the fact that you could do

  sudo apt-get install mysql-server mysql-client
  sudo -i mysql
and be logged in as admin into mysql database was indeed a huge reason for defaulting to it.

EDIT: Of course, at that time, there was no Ubuntu teaching everyone to sudo all the time, so drop all instances of sudo and add a su - at start ;)


> Of course, at that time, there was no Ubuntu teaching everyone to sudo all the time

Maybe that's why I am used to logging in as root rather than a user. I started in 1999 and have been surprised how few users now do


Most prod configs I've seen now days disable SSH-ing in as root and password auth so it just becomes:

$ ssh user@server <rsa key> $ sudo -i <user pass> #


Mysql also came with pretty much any webhost.


That's exactly my point. A lot of people started with dynamic websites by using cheap webhosting that you FTP'ed your PHP files to, and used PhpMyAdmin to manage your smallish database, and 2000-2009 they still formed a strong portion of market for starting out (I chose 2009 because that's when EC2 becomes more accessible for this due to RDS)


MySQL has had replication since May 2000.


Replication being easier as driver for developers defaulting to MySQL


Yes, replication. MySQL made it dead easy to have DB clusters in minutes.

I'd wish PostgreSQL would have as simple when it comes to replication and failover like MySQL does. It's always a pain when switching masters back and forth.


> in the 2010s, it was MongoDB because non-durable writes made it “webscale“

I think this is the best video on that topic: https://www.youtube.com/watch?v=b2F-DItXtZs


In the early days of the web MySQL was extremely faster than anything else because it was an ISAM file with no support for transactions. That was OK for many people that self-hosted the db on the not very powerful CPUs of the time.

I remember people stating that transactions are useless, and maybe they are for some workloads, see the success of MongoDB years later.

The transactional engine InnoDB was added in version 3.23 [1] in 2001 [2] .

[1] https://en.wikipedia.org/wiki/Comparison_of_MySQL_database_e...

[2] https://en.wikipedia.org/wiki/MySQL


> MongoDB gained traction not because it's an alternative to MySQL or PostgreSQL.

Disagree. It gained traction because it was an alternative to MySQL in the ways that mattered - fast, easy to administer, widely known, good enough. Yes, there are significant differences in the details of what they do - but in terms of someone looking for a backing datastore for their webapp, they're actually competing in a very similar space.


PostgreSQL became the internet's darling DBMS long before that. Oracle's acquisition of MySQL in 2008 made people finally take notice of PostgreSQL. Before that, most developers barely knew it existed.


Yeah, Postgresql was the FreeBSD of DBMSes. Solid, conceptually integral, well documented.*

I recall doing an evaluation of open source databases in 2001. MySQL didn't even have row-level locking, let alone any concept of transactions. I summarised it as "easy to use; but only for data you don't care about".

* Not that Postgres (as it was then) was without warts in 2001. A huge one was its "object orientation": table inheritance. What it needed then, and would still be nice to have, is object orientation at data type (column) level, an extension of the SQL domain.


You can create types in postgresql and use them as columns... so you can have your "object" style encapsulation at a column level. So you can have a "currency" type that has both the amount and the currency.


yeah, i was surprised at the cluelessness of that remark. lamp was definitely not a 'rising tech stars' thing. hopefully the author is more careful about accuracy when it comes to database architecture than when it comes to www history

did google even use mysql? certainly if they did they never talked about it publicly in the early 02000s, and of course facebook didn't even exist then

lj, though, they used the fuck out of mysql

/. originally didn't use a database; i (an ordinary user) accidentally posted an article by trying to post a comment on an article that didn't exist yet; i guess they got appended to the same file. but when it did switch to a database (i don't know, about the time google was founded?) it was of course mysql


MySQL (then Vitess) ran Youtube, but nowadays I do believe most product teams are using Spanner.


Yeah I think (heard anecdotally) both google/YouTube and Facebook (and many others) started with MySQL. Spanner for distributed writes has inspired most implementations although Google is the only one I know about that implements TrueTime (atomic clocks). The same year that the Spanner paper came out (after Percolator) an alternate approach (Calvin) was also published, and some of us are using that (our DB's design is inspired by it, but we've done a lot of enhancements since then).


youtube did, but google didn't buy youtube until three years before the end of 'the 2000s'


>> MySQL which IMO is the true darling of web DB's

A "darling" is something you want to use, not something that you are using. Many do not want to use MySQL due to Oracle control. Postgres is definitely the darling of the past few years.


> Oracle was one of the reasons for the catastrophic failure of eToys circa 2001

Would love to hear a from-the-trenches summary of that.


Part of the popularity of the early MySQL was marketing. I hope I’m not wrong here, there was something written about MySQL people posting misinformation in forums. Another is the ease of having it up and running. Another was I think there was some IP address component to setting up users which made it look complicated


> there was something written about MySQL people posting misinformation in forums.

They were absolute liars of the first water back in the day, absolutely. In the 3.x era there were claims that transactions were only for people who didn't know how to program! You'd struggle to find most of the absolute nonsense that was being pushed, because it's mostly gone down various memory holes, but it was absolutely breathtaking.


One of the weird things about Postgres MVCC is that it is "optimized for rollback," as one person memorably quipped to me. This is not to imply a design principle, it's more a description of how things ended up, and the general argument behind this quip is Postgres lacks "UNDO" segments.

On the one hand, this does make the model Postgres uses admirably simple: the WAL is all "REDO," and the heap is all you need to accomplish any kind of read, but at the expense that stuff that normally would be copied off to a sequential UNDO log and then vaporized when the transaction commits and all possible readers have exited remains comingled with everything else in the main database heap, needing to be fished out again by VACUUM for purging and figuring out how to reclaim numerical space for more transactions.

There may be other solutions to this, but it's one unusual quality Postgres has relative to other MVCC databases, many of which sport an UNDO log.

There are downsides to UNDO, however: if a read needs an old copy of the tuple, it needs to fish around in UNDO, all the indices and synchronization need to account for this, and if there's a rollback or crash recovery event (i.e. mass-rollback of all transactions open at the time), everything has to be shuffled back into the main database storage. Hence the memorable initial comment: "Postgres is optimized for rollback."


Coming to Postgres, UNDO logs and no vacuum

https://github.com/orioledb/


what is an UNDO log and how does it solve the problem?


In most cases[1], when you update a tuple in Postgres, a new tuple is put somewhere else in the same heap, with different visibility information, "xmin", "xmax". The old tuple remains where it is. Index pointers to it likewise remain unchanged, but a new entry is added for the new tuple. The old version gains an updated "xmax" field indicating that version was deleted at a certain logical point.

Later on, VACUUM has to plow through everything and check the oldest running transaction to see whether the tuple can be "frozen" (old enough to be seen by every transaction, and not yet deleted) or the space reclaimed as usable (deleted and visible to nothing). Index tuples likewise must be pruned at this time.

In systems with an UNDO log, the tuple is mutated in place and the contents of the old version placed into a sequential structure. In the case where the transaction commits, and no existing concurrent repeatable read level transactions exist, the old version in the sequential structure can be freed, rather than forcing the system to fish around doing garbage collection at some later time to obsolete the data. This could be considered "optimized for commit" instead of the memorable "optimized for rollback."

On the read side, however, you need special code to fish around in UNDO (since the copy in the heap is uncommitted data at least momentarily) and ROLLBACK needs to apply the UNDO material back to the heap. Postgres gets to avoid all that, at the cost of VACUUM.

[1] The exception is "HOT" (heap only tuple) chains, which if you squint look a tiny bit UNDO-y. https://www.cybertec-postgresql.com/en/hot-updates-in-postgr...


Yup. A lot of heavy users of Postgres eventually hit the same barrier. Here's another take from Uber: https://www.uber.com/blog/postgres-to-mysql-migration/

I had a similar personal experience. In my previous job we used Postgres to implement a task queuing system, and it created a major bottleneck, resulting in tons of concurrency failures and bloat.

And most dangerously, the system failed catastrophically under load. As the load increased, most transactions ended up in concurrent failures, so very little actual work got committed. This increased the amount of outstanding tasks, resulting in even higher rate of concurrent failures.

And this can happen suddenly, one moment the system behaves well, with tasks being processed at a good rate, and the next moment the queue blows up and nothing works.

I re-implemented this system using pessimistic locking, and it turned out to work much better. Even under very high load, the system could still make forward progress.

The downside was having to make sure that no deadlocks can happen.


I remember when Uber got roasted by the postgresql mailing list over this: ultimately, a post mortem was done on all of Uber's claims, and it was basically proven that they were incompetent, did not read any available "best practices" guides, did not seek any external help, and treated it like it was some sort of mysql-esque database and used it as wrong as humanly possible.

Uber's workload at the time, ironically, was not enough to make a postgresql server running moderately decent hardware to fall over if you actually read the manual.

Uber's engineering team will never be able to live this down.


That's still the PostgreSQL problem: it has insane defaults.

https://www.postgresql.org/docs/current/runtime-config-resou... tells you what all the parameters do, but not why and how to change them.

"If you have a dedicated database server with 1GB or more of RAM, a reasonable starting value for shared_buffers is 25% of the memory in your system." Why not set it to 25% of the memory in my system by default, then?

"Sets the base maximum amount of memory to be used by a query operation (such as a sort or hash table) before writing to temporary disk files. If this value is specified without units, it is taken as kilobytes. The default value is four megabytes (4MB)." Yes, and? Should I set it higher? When?

https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Serv... hasn't been updated for two years and explains only a handful of parameters.

"If you do a lot of complex sorts, and have a lot of memory, then increasing the work_mem parameter allows PostgreSQL to do larger in-memory sorts which, unsurprisingly, will be faster than disk-based equivalents." How much is a lot? Do I need to care if I'm running mostly OLTP queries?

"This is a setting where data warehouse systems, where users are submitting very large queries, can readily make use of many gigabytes of memory." Okay, so I need to set it higher if I'm running OLAP queries. But how high is too high?

https://wiki.postgresql.org/wiki/Performance_Optimization is just a collection of blog posts written by random (probably smart) people that may or may not be outdated.

So when someone complains their Postgres instance runs like ass and smug Postgres weenies tell them to git gud at tuning, they should be less smug, because if your RDBMS requires extensive configuration to support nontrivial loads, you either make this configuration the default one or, if it's significantly different for different load profiles, put a whole section in the manual that covers day 1 and day 2 operations.


Defaults are hard to change because it makes upgrading even scarier


If you're running pg at scale, it pays to have a least one person familiar with the config file.

That the defaults don't handle top users is hardly an issue.


That's not how I remember it.

> The Uber guy is right that InnoDB handles this better as long as you don't touch the primary key (primary key updates in InnoDB are really bad).

> This is a common problem case we don't have an answer for yet.

It's still not how I remember it.

Quote from https://www.postgresql.org/message-id/flat/579795DF.10502%40...

I still prefer Postgres by a long way as a developer experience, for the sophistication of the SQL you can write and the smarts in the optimizer. And I'd still pick MySQL for an app which expects to grow to huge quantities of data, because of the path to Vitesse.


Migrating entire workload is way more fun and exciting than reading manual. How else are you going to demonstrate your impact!


Anyone has a link to that mailing list thread to share?


From what I can Google it seems to be the opposite of that, where they acknowledged Postgres's shortcoming in the mailing list:

https://www.reddit.com/r/programming/comments/4vms8x/why_we_...

https://www.postgresql.org/message-id/5797D5A1.5030009%40agl...


I didn't look at the reddit link but the full mailing list thread is more nuanced than that: https://www.postgresql.org/message-id/flat/579795DF.10502%40...


I’ve never heard of this, it sounds fun but I won’t take it at face value without a source


Do you have a link to the mailing list discussion?


>> In my previous job we used Postgres to implement a task queuing system, and it created a major bottleneck, resulting in tons of concurrency failures and bloat

Yet, every month or two an article about doing exactly this is upvoted to near the top of HN. It can of course work but might hard to replace years later once "barnacles" have grown on it. Every situation is different of course.


I've built this queue system probably 5 times, first 2-3 were failures as consumer concurrency was 1 without us noticing for hours. The bloat comes from updating the work instead of deleting I assume, did for me. There are definitely many ways to not do it right but kinda works.


Yeah, it is strange that hokey, home grown solutions built on top of Postgres are suddenly in vogue.


skip locks are the secret for queues in pg. did you use skip locks?


> Another problem with the autovacuum in PostgreSQL is that it may get blocked by long-running transactions, which can result in the accumulation of more dead tuples and stale statistics. Failing to clean expired versions in a timely manner leads to numerous performance problems, causing more long-running transactions that block the autovacuum process. It becomes a vicious cycle, requiring humans to intervene manually by killing long-running transactions.

Oh man, a previous company I worked at had an issue with a hot table (frequent reads + writes) interfering with autovacuum. Many fires over a six month period arose from all of that. I was (luckily) only on an adjacent team, so I don't know the details, other than vacuums taking over 24 hours! I'm sure it could have been prevented, but it seemed horrible to debug


yeah, this is called "cancellation." Autovacuum is very polite and tries to let go of a lock when there's a conflict. So it lets go, over and over, until it triggers a heuristic deciding "no, not succeeding in this session could be dangerous!" and then people begin to notice it.

Last I checked (....a few years ago, so things may have changed,) the theory of autovacuum heuristics may not have changed much since the turn of the millennium, they're probably about due.


That's interesting, MVCC was the thing that drew me to Postgres to begin with!

Way back I was working on an in-house inventory app written in Visual Basic against SQL Server 2000, I think. That one just put locks on tables. It had the "charming" characteristic of that if you weren't very, very careful with Enterprise Manager, loading a table in the GUI put a lock on it and just keep on holding it until that window was closed.

Then the running app would eventually snag on that lock, maybe keep holding some other lock that something else would snag on, and 5 minutes later I'd hear one of the operators screaming "Nothing is working! I can't take any orders!" from the room next to me.


MVCC and optimistic concurrency control are very pleasant to work with for anyone who spent a decade with manually locking SQL databases. It takes the human error and developer mistakes away from the process, or protect against it. You can still slow down your queries with deadlocks, but at least you cannot corrupt your data by accident.

Though, any other optimistic concurrency control scheme can be better, but PostgreSQL was at the right place at the right time when people started to leave from MySQL.


There are many alternatives to table locking, including more conventional row locks.

MVCC is great, but this article does identify some of the puzzling design choices of the Postgres implementation. The index problems are particularly bad, and seemingly avoidable.


READ COMMITTED SNAPSHOT


Clever Clickbait - Of course at the end of the article they offer a solution - their product (and of course it’s AI enhanced) to the problem they have overhyped.


I asked them to take that bit out at the end and it looks like they did.

General remark for startups wanting attention on HN: it's not good to end an interesting article with a call-to-action that makes your article feel like an ad. Readers who read to the end experience that as a bait-and-switch and end up feeling betrayed.

What works much better is to disclose right up front what your startup is and how it's related to the article content. Once you've gotten that out of the way, the reader can then dive into the (hopefully) interesting content and end the article on a satisfying note.

Btw, I have a set of notes on how to write for HN that I'm working (slowly) on turning into an essay. If anyone wants a copy, email me at hn@ycombinator.com and I'll be happy to send it. It includes the above point and a bunch more.


There's another interesting article on front page about oauth that's actually almost exactly the same. A very long article about oauth implementation. And sadly I knew the add was coming the whole time and there it was as the last paragraph. It seems that unfortunately or fortunately some of the best really informative intermediate depth blog posts (read: not medium surface level stuff) tends to be an advert by a company offering a very technical product.


I think it’s disturbing you asked someone to change their content and even more disturbing that they complied. You are experienced at moderating Havker News have no business being a global censor for content out in the world. This sucks.

As a reader I’d have appreciated the original. And I’d appreciate a nice HN alternative.


Perhaps I should explain. What I actually did was suggest that it would be in their interest to take out the bit that some readers were complaining about, because it felt like an ad at the end. Of course they were free not to follow my suggestion.

I admit that's not precisely how I described it in the GP comment but it never crossed my mind that anyone would care. Commenter objections never fail to surprise!

Edit: I think I was right that it was in their interest as well as all of ours, because earlier the thread was dominated by complaints like this:

https://news.ycombinator.com/item?id=35718321

https://news.ycombinator.com/item?id=35718172

... and after the change, it has been filling up with much more interesting on-topic comments. From my perspective that's a win-win-win, but YMMV.


Even if they decided not to change it which is fully within their rights and somehow that gets the article moderated off the front page of Hacker News, isn't that still moderating Hacker News? No one is getting censored, just like nobody is entitled to have their article be on Hacker News.


> Of course at the end of the article they offer a solution - their product (and of course it’s AI enhanced)

We have been working on automatic database optimization using AI/ML for a decade at Carnegie Mellon University [1][2]. This is not a gimmick. Furthermore, as you can see from the many comments here, the problem is not overhyped.

[1] https://db.cs.cmu.edu/projects/ottertune/

[2] https://db.cs.cmu.edu/projects/noisepage/


Okay, so, Noisepage appears to be open source https://github.com/cmu-db/noisepage/

But I can't find the Ottertune Github page

Is any part of Ottertune open source?


Is there hope of ever seeing Ottertune for MSSQL ?


I've personally ran into the problems mentioned in the article many times, unsure it's "overhyped".


It's a problem, but not an AI problem. It has a clear cause and obvious mitigation strategies.


If there was a clear cause and obvious mitigation strategies then they would have been built into Postgres already.


and rhetoric:

So how does one work around PostgreSQL’s quirks? Well, you can spend an enormous amount of time and effort tuning it yourself. Good luck with that.


> and of course it’s AI enhanced

did they mention LLM/ChatGPT?..


In a previous version of the article they concluded by pitching their "AI-powered cloud database tuning" product.


I don't think so


This post has a valid point. But the last line makes it clear why they care so much about it.

Yeah, table bloat and transaction ID wraparounds are terrible, but easily avoidable if you follow a few simple guidelines. Typically in my experience, best way to avoid these issues are to set sensible vacuum settings and track long running queries.

I do hate the some of the defaults in the Postgres configuration are too conservative for most workloads.


> "But making sure that PostgreSQL’s autovacuum is running as best as possible is difficult due to its complexity."

The problem, as the article states it, is that a "sensible" vacuum setting for one table is a terrible setting for another depending on how large these tables are. On a 100 million tuple table you'd be waiting 'til there there were 20 million garbage tuples before taking action.


You can set autovacuum reloptions on a per-table basis, if they differ that much for your your use case.


> I do hate the some of the defaults in the Postgres configuration are too conservative for most workloads.

it is also black magic to tune them.


Paying an overhead cost of 53 bytes per row is also too expensive for MVCC in my opinion.


What last line? The literal last line is "We’ll cover more about what we can do in our next article."

Do you mean this one?

> At OtterTune, we see this problem often in our customers’ databases. One PostgreSQL RDS instance had a long-running query caused by stale statistics after bulk insertions. This query blocked the autovacuum from updating the statistics, resulting in more long-running queries. OtterTune’s automated health checks identified the problem, but the administrator still had to kill the query manually and run ANALYZE after bulk insertions. The good news is that the long query’s execution time went from 52 minutes to just 34 seconds.


It previously had this closing line, with links to their products (https://web.archive.org/web/20230426171217/https://ottertune...):

> A better approach is to use an AI-powered service automatically determine the best way to optimize PostgreSQL. This is what OtterTune does. We’ll cover more about what we can do in our next article. Or you can sign-up for a free trial and try it yourself.

That was removed after the article was posted to HN, at dang's suggestion - he posted about it elsewhere in these comments.


MVCC for Amazon Redshift;

(pdf) https://www.redshiftresearchproject.org/white_papers/downloa...

(html) https://www.redshiftresearchproject.org/white_papers/downloa...

I've been told, very kindly, by a couple of people that it's the best explanation they've ever seen. I'd like to get more eyes on it, to pick up any mistakes, and it might be useful in and of itself anyway to reader, as MVCC on Redshift is I believe the same as MVCC was on Postgres before snapshot isolation.


This paper, at least by my skimming, seems to describe Redshift's historic SERIALIZABLE ISOLATION level, but does not mention Redshift's newer SNAPSHOT ISOLATION capability.

https://aws.amazon.com/about-aws/whats-new/2022/05/amazon-re...

For concurrency scalability, AWS now configures SNAPSHOT ISOLATION by default if you use Redshift Serverless but non-serverless still defaults to SERIALIZABLE ISOLATION.


Yes. I intended to write exactly this at the end of my post, but I managed to word it completely wrongly. The document describes MVCC as it has been in Redshift until about a year ago, when snapshot isolation was introduced.


Not bad but I like this one too

http://www.interdb.jp/pg/pgsql05.html


My main takeaway from this article: as popular as Postgres and MySQL are, and understanding the legacy systems built for them, it will always require deep expertise and "black magic" to achieve enough performance and scale for hyper scale use cases. It justifies the (current) trend to have DB's built for distributed tx/writes/reads that you don't have to become a surgeon to scale. There are other DBs and DBaaS that, although not OSS, have solved this problem in a more cost-efficient way than having a team of surgeons.


I would argue, you handle the hyper-scale use case when you are actually in hyper-scale. Trying to pre-maturely optimize this is almost always a waste of time and chances are you will screw it up anyway. Almost nobody gets to that scale anyway. If you do get to that scale, you have the money and resources to fix the problem(s) at that time.


i mean, sort of? There is some subtly lost in this oft-repeated advice. i've worked at 3 companies now that were initially based on a single RDBMS but have outgrown the scale of what is reasonable to serve off that architecture. They are consumer scale (10s of mill) users, but not hyperscale (IMHO 100m+). The amount of engineering cost to migrate a complicated growing company/product off a mono-db architecture is astounding. Conservatively i'm talking 10+ dev years of effort, at each company. Easily 10s of millions of $$$, maybe 100m+. None of them are "finished". It's really really time consuming and hard, once you have 100s of tables, 100s of thousands of lines of code, dozens of teams, etc.

I'm all about avoiding premature optimization, and its fine to start with a classic postgres. But please don't cling to that - if you see MVP success and you actually have a reasonable chance of getting to >1mill users (ie, a successful B2C product) please please dont wait to refactor your datastore to a more scalable solution. You will pay dearly if you wait too long. Absolutist advice serves noone well here - it really does depend on what your goals are as a company.


Of course subtlety matters, but as you start scaling and noticing pain points, that is when you start working towards fixing them. First you just throw hardware at the problem and that tends to scale really really well for a really long time. It's pretty rare, even at very large scale that you MUST move off of PG, there are plenty of well tested scaling solutions, if you have the $$$'s to spend.

10+ years of dev work for a few hundred tables worries me a lot. My last conversion was about 20 years of data across a few hundred tables and we did two-way data synchronization across DB products with about 1 month of work, with 2 devs. We kept the sync running for over a year in production because we didn't want to force users over to the new system in a big hurry. We only stopped because the license on the old DB product finally expired and nobody wanted to pay for it anymore.


Yes again the common refrains - just throw hardware at it. I/we of course know this and all the systems I’m referring to did that first until they couldn’t. But you’re kind of missing my point - im saying by the time you are noticing scale pain points it’s often too late. Too late insofar as your system has likely grown so much in breadth (complexity, features, subsystems, lines of code, services, etc) that all depend on this one db. All this vast amount of stuff all written assuming all tables are accessible to everyone. It becomes a tangled web of data access patterns / tables that is very hard to break apart.

Nevermind the other aspect the pat advice doesn’t mention - managing a massive single RDMS is a goddamn nightmare. At a very large scale they are fragile, temperamental beasts. Backups, restores, upgrades all become hard. Migrations become a dark art , often taking down the db despite your best understanding. Errant queries stalling the whole server, tiny subtleties in index semantics doing the same. Yes it’s all solvable with a lot of skill, but it ain’t a free lunch that’s for sure. And tends to become a HUGE drag on innovation, as any change to the db becomes risky.

To your other point yes, replicating data “like for like” into another RDBMS can be cheap. But in my experience this domain data extraction is often taken as an opportunity to move it onto a non RDBMS data store that gives you specific advantages that match that domain, so you don’t have scaling problems again. That takes significantly longer. But yes I am perhaps unfairly including all the domain separation and “datastore flavor change” work in those numbers


I have managed and written tooling for RDBMS from dinky GB-sized up to the multi-thousand-shard PB-scale. What you're saying is absolutely true. What a small team with vision can do when they see the ramp coming pays off 100-fold just a year or two in the future.

I think this kind of anticipation was part of Pinterest's early success, for example. They got ahead of their database scaling early and were able to focus on the product and UX.


I think we are basically in agreement about everything, but coming from different perspectives. There is no "right" answer, but pre-mature optimization is almost always the wrong answer.


And what's a more scalable solution? (in your mind)


unfortunately i have no direct experience with anything that i would consider a direct replacement for the generic utility of postgres(or any RDBMS). Mostly I have been involved in moving specific domains to storage technology that has opinions that work well with the problem at hand.

eg if it looks key-value ish, or key + timestamp (eg user transaction table), dynamodb is incredible. Scales forever, never have to think about operations. But not generally queryable like pg.

if it looks event-ish or log-ish, offload to a redshift/snowflake/bigtable. But append only & eventually consistent.

if you really need distributed global mutations, and are willing to pay with latency, spanner is great.

if you can cleanly tenent or shard your data and theres little-to-no cross-shard querying then vitess or some other RDBMS shard automation layer can work.

There are a few "postgres but distributed" dbs maturing now, like cockroach - i havent personally used them at a scale that i could tell you if it actually works or not though. AFAIU these systems still have tradeoffs around table layout and access patterns that you have to think about.


I guess the question is, which MVCC strategy would be the "right" one to pick for a modern relational database? The paper linked focuses on main memory databases, and being main memory allows you to do things you can't do when disk based.


I have same the question. I skimmed through the linked paper for conclusion, they highlight the techniques which can be used to improve, but does not say which MVCC to use for a modern database. May be I need to do a careful reading.


I took a look at https://github.com/orioledb/orioledb which is a project attempting to remedy some of Postgres' shortcomings, including MVCC. It looks like they're doing something similar to MySQL with a redo log, as well as some other optimizations. So maybe this is the answer.


IMHO part of the issue is that Postgres was built on the assumption that snapshot isolation would be widely used. I don't think this has proven to be the case.

Snapshot isolation isn't as robust and straightforward as strict serializability, but it also isn't as performant as READ COMMITTED. It seems like the worst of both worlds.


IMHO the problem is that

1) too many people don't understand why they probably should use snapshot isolation (or stricter) and don't understand what guarantees they don't get when read committed is used

2) it's not the default, defaults matter, a lot

3) it makes transactions spurious fallible when the db can't make sure that committing two parallel write transactions won't brake the consistency guarantees, a lot of frameworks don't have the right tools to handle this, people don't expect it, it can make in the transaction interleaved interactions with other systems harder, etc.

(as a side not I assumed you meant REPEATABLE READ when you said snapshot isolation, as it's the least strict isolation level which uses snapshot isolation)


IMO the best thing about snapshot isolation is that it's conceptually easy to understand and reason about.


As an aside, Andy Pavlo (one the authors here) has his CMU database course videos up on YouTube and they are tremendous. I’ve spent 2 decades developing web applications but am not exaggerating when I say that I’m 10x more knowledgable on databases having watched his courses during Covid.


Could you share the links?


https://youtu.be/oeYBdghaIjc

There are multiple years available for his first DB class but that’s the one I watched. I almost called it his ‘basic’ class but there’s literally only like one or two classes on SQL before he dives into all the various layers of the internals.

There’s also a few of his advanced courses. And then you’ll see guest lectures from industry on about every one of the new DB platforms you can think of.

They’re all under “CMU Database Group” on Youtube.

Highly recommend.


You can do the projects (the autograder is public) and join a discord community of non-CMU people that are following along too! e.g., [0] for Fall 2022.

[0] https://15445.courses.cs.cmu.edu/fall2022/faq.html#q8


For all the crap on YT, there’s some real gold if you dig and clue the algo in. The SICP lectures I’d place in this category as well.



This was a fun read. But now I have a couple of questions

1. Since MySQL keeps delta to save storage costs, wouldn't read and writes slower because now I have to build the full version from the delta

2. On secondary indexes, they highlight the reads will be slower and also say:

> Now this may make secondary index reads slower since the DBMS has to resolve a logical identifier, but these DBMS have other advantages in their MVCC implementation to reduce overhead.

What are the other advantages they have to make reads faster?

Compared to MySQL, I remember reading that Postgres MVCC lets you alter the table without locking. Now I found out that MySQL also does not require locks. So, how are they doing?

Are there any similar posts which explain MySQL MVCC architecture?


1. It's a backward delta. So it's only needed when a transaction touches a row that has been modified by a concurrently running transaction. (Details differing depending on isolation level etc.). Writes can be faster because only the modified attributes need to be written to the delta undo log, not the whole row. I guess reads can be faster too in some cases, e.g. less fragmentation and wasted space (better cache usage,) over time due to in-place updates, and less pointer chasing to find the correct version if the vacuuming isn't keeping up with pruning old versions.

As for MySQL and locks, the original MyISAM table format used locks, but InnoDB tables are MVCC like pgsql.


> Writes can be faster because only the modified attributes need to be written to the delta undo log, not the whole row.

what is delta undo log?


It's an undo log containing the deltas from the latest version (instead of the entire row), so that a previous version can be reconstructed in case of a rollback or if a concurrently running transaction needs the previous version.


Once you figured out all the point in this article, it's a matter of fine tuning, can take some times but eventually it will works. The only thing I still struggle with is the Table Bloat.

On managed Postgres (i.e: gcp, aws) you pay for the disk, but when you can't run a VACUUM FULL because it locks the table, you end up with a lot of allocated storage for nothing and you can't shrink the disk size (at least on gcp). Storage is cheap but still feels like a waste.


https://reorg.github.io/pg_repack/

Dead easy to run and no long-held locks


Absolutely critical once you get above a certain table size.


So why Postgres chooses the worst MVCC design compared to MySQL and Oracle? Is this because of legacy reasons or other factors?


Legacy reasons. The idea was that you wouldn't need a WAL because the table itself is the log. And then you could support time-travel queries if you never cleaned up the expired tuples.


_And_ it wasn't even originally designed to be used for concurrency control at all...


Is MVCC actually superior by some other considerations? Less lock contentions, transactional DML.


The problem is not mvcc but postgres’ implementation details of it.


In what way? I didn't see anything obviously improper when I learned how serialization isolation worked.


TFA covers its issues? It’s specifically compares postgres’ to other implementations’ (oracle and mysql).


Also the paper linked from TFA goes into the various implementation options in more detail: https://db.cs.cmu.edu/papers/2017/p781-wu.pdf


A delicated full vacuum process that need full locking, how little isolation between tables, etc

Lots of the pain point have been mitigated in the last ten years. It is now as simple as other comparable complex db can go (i.e. Not simple, but you can't find better product)


Can the MVCC implementation be swapped via Postgres extensions?


I think there is a new project from the Postgres community. They try to replace the storage engine to solve the inefficiency caused by MVCC

https://github.com/orioledb/orioledb


No. It would be a major surgery on the internals. See the article for my comment at the attempt to do this with the Zheap project:

https://wiki.postgresql.org/wiki/Zheap


Shame that zheap seems to have fizzled out. Do you think there's any prospect of it being resurrected and eventually mainlined?


Depending on the use case, I'd consider a foreign data wrapper.


Am I correct in thinking that PG's MVCC implementation results in a worse story around offloading some mild OLAP workloads to a replica without affecting the primary? Anecdotally, it seems that MySQL handles this better but I don't understand the internals of both enough to explain why that is.

https://aws.amazon.com/blogs/database/manage-long-running-re...


Here is a rebuttal of many of the points raised by Uber against PostgreSQL: https://www.2ndquadrant.com/en/blog/thoughts-on-ubers-list-o...


> Oracle and MySQL do not have this problem in their MVCC implementation because their secondary indexes do not store the physical addresses of new versions. Instead, they store a logical identifier (e.g., tuple id, primary key) that the DBMS then uses to look up the current version’s physical address.

This doesn’t have anything to do with MVCC. I’m sure PostgreSQL could implement an index format that piggybacks on another index rather than pointing at the physical page directly, without overhauling MVCC.


And this is wrong. Oracle stores the physical address in the index. The ROWID is the relative file number (in the tablespace) + block offset in the file + an index in the block row directory. The difference is that Oracle rows usually don't move and are updated in place. Because old version diff goes to undo segments.


I work for a db vendor and Acid compliance (also implemented with MVCC) is a big selling point. Yet, most use cases I later see don’t require such rigid controls on updates. This means customers are paying for this as transactionally consistent updates are more expensive than eventually consistent ones.


Question: Why would I need more than one extra version of the same row? I would think that with transactional locking everybody else is waiting on the first update to commit before getting their own changes in, unless the db is somehow trying to lock columns-per-row instead of entire rows.


That would require all queries, including read only queries, to participate in strict two phase locking. That has very poor performance under even very mild contention, not to mention all that mutual locking and unlocking overhead between read only queries is largely un-needed.

So what MVCC databases do is keep enough versions to cover the oldest running query instead. Now read only queries don't need to hold any locks at all, they just prune the newest version older than the transaction id the query started at.


If my query started 1000 ms ago, and every 200ms a transaction completed, I'm perfectly fine with getting results of some/most/all of those 5 commits. I usually don't need the database to enforce a 1000-ms-old snapshot for my own sake, which is why I'm using read-committed isolation instead of repeatable read etc. Are we saying that not enforcing this delay would break the database somehow even if I'm fine with it?

Edit: I should clarify that I recognize the need for one extra version, since read-committed txns shouldn't see it until it is committed. Other writes must wait for the commit until they can write, though - it seems like there's some optimistic-writing thing where we let a bunch of writes queue up for one record knowing that we're going to have to a problem when one of them commits and the others find out they should have waited before trying to write or something, because we didn't force them to acquire a write lock before writing.


It gets complex. There are a lot of tradeoffs between short and long running transactions, and different choices in the design space each have pluses and minuses.

Of course if you're running reduced consistency things are easier. Wrong answers are often faster. But when people select a transactional database, it's usually because they at least need snapshot consistency, and often they require full serializability.


We have MySQL/MariaDB in RDS and ever since we migrated MariaDB to 10.6.12 we get at least 1 table corruption per day. Only work around available is to restart database just like windows-95.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: