Why Uber Engineering Switched from Postgres to MySQL

jedberg · on July 26, 2016

> MySQL supports multiple different replication modes:

> Statement-based replication replicates logical SQL statements (e.g., it would literally replicate literal statements such as: UPDATE users SET birth_year=770 WHERE id = 4)

Postgres has that too (using a 3rd party tool, but it's an officially supported tool). We were using it on reddit 10 years ago. It caused a lot of problems. I wouldn't call that an advantage for Mysql.

Honestly, reading this it seems like the summary is: "We don't follow great engineering practices so we need a database more forgiving". Which is fine if that's how you want to run your business, but isn't really the death knell for Postgres.

A specific example:

> This problem might not be apparent to application developers writing code that obscures where transactions start and end. For instance, say a developer has some code that has to email a receipt to a user. Depending on how it’s written, the code may implicitly have a database transaction that’s held open until after the email finishes sending. While it’s always bad form to let your code hold open database transactions while performing unrelated blocking I/O, the reality is that most engineers are not database experts and may not always understand this problem, especially when using an ORM that obscures low-level details like open transactions.

Your developer should understand database transactions. But you should make it easier for them by abstracting it so that they don't have to. And in this particular case, I'd say they shouldn't be using the database to do locking around sending a receipt. It should be put into a queue and that queue should be processed separately, which avoids the transaction problem altogether.

Negitivefrags · on July 26, 2016

The actual summary of the article is "The design of Postgres means that updating existing rows is inefficient compared to MySQL".

Yes, there were some other points that were just extra annoyances for them but clearly that point was the most important to them. It's what the header image and the first 60% of the article was talking about and yet nobody seems to be engaging with that point in this thread.

Is the design choice bad? They never said it was. It's just an engineering trade off. It's very possible that most workloads benefit from this design. But if you workload involves updating lots of existing rows at large scale, then MySQL is going to be a better choice for you.

jedberg · on July 27, 2016

But it's only an issue if you rely on lots of transactions for data consistency and my point was that it sounds like they are relying on transactions too much which is why they need a more "forgiving" database, which is the part I quoted.

Also they didn't mention anything about the auto vacuumer, which mostly solved the issue they are talking about.

Their lack of mention of the vacuumer and not seeming to know that Postgres supports statement level replication makes me wonder if they took a deep dive into the wrong part of the technology.

Negitivefrags · on July 27, 2016

None of what you said addresses the issue that I (or they) are talking about.

On Postgres an update requires a rewrite of every index of the row.

On MySQL it only requires an update of the indexes that were touched by the update.

If you have a table with 10 indexes then this means doing 10 extra writes physically to the disk.

dialogbox · on July 27, 2016

As I know, not every updates rewrite indexes in PG. It has single page clean up and HOT (heap only tuple) update optimization.

Please refer to 64~ page of https://momjian.us/main/writings/pgsql/mvcc.pdf.

continuational · on July 27, 2016

10 indexes on one table seems a bit much. It sounds like a table that hasn't been normalized.

nsxwolf · on July 27, 2016

Though that is quite proper for a data warehouse.

dikei · on July 27, 2016

A data warehouse should use much fewer transactions though.

stubish · on July 27, 2016

Both vacuuming and logical replication are discussed in the article. In particular, vacuuming is easier with InnoDB since the changed records all exist in the redo log whereas PostgreSQL needs to scan the whole table. pglogical is mentioned for people running PG9.4+ as a way of doing minimal downtime cross version upgrades, which wasn't an option back with PG9.2 unless you go with something like slony1 or londiste.

RedCrowbar · on July 27, 2016

> vacuuming is easier with InnoDB ... PostgreSQL needs to scan the whole table

There's been quite a few improvements to VACUUM in 9.6 [1], including avoiding full-table scans.

[1] https://www.postgresql.org/docs/9.6/static/release-9-6.html#...

garyclarke27 · on July 27, 2016

Agreed, they even admit Postgres is faster for querying because MySQL requires 2 index lookups. I like the immutable data design of Postgres, is more robust and enables fast transactional DDL. My design never deletes or alters a tuple, so all of these problems that Uber is obsessing about, go away. Question is - Why do they need so many updates? Is there a problem with their data capture?

spotman · on July 26, 2016

Try enforcing this on teams that use ORMs like hibernate with 500 developers.

Super, duper, common issue, you will find this at every large shop at some point in its life time, usually around the time of hiring people and expanding extremely fast, and taking on some tech debt.

All functions of extreme scale, hyper growth, and yeah, not following the absolute best practices all the time, but tech debt is like any debt, you get something now, and pay later. If they continue going up and to the right they will be able to afford it.

djsumdog · on July 26, 2016

ORMs have mostly been just painful at ever shop I've been at. I've used ActiveRecord, Squirl, Hibernate, django.db .. they're all various level of suck.

The one huge advantage of an ORM is the ability to support multiple databases, but that only really works if you can do everything using the ORM. The moment you have a function too complex that you need to write some SQL, now you need some case statements and multiple integration tests for all the database your product needs to support.

They remove some boiler plate while adding others. In one of my own projects, I just created several files (pgsql.commands, mysql.commands, etc.), a basic set of classes around them and a base set of commands that will work for all the DBs I wanted to support (so the command files had an inheritance model, albeit only one layer).

With all that being said, most ORMs I've used do have explicit transaction support. I know Squirl had a `transaction {}` block you could wrap commands around. Transactions shouldn't be an excuse. They should be off by default and explicit added around blocks of things that need to be atomic.

> Try enforcing this on teams that use ORMs like hibernate with 500 developers.

I realize this is a hyperbole (I hope) because you really shouldn't have 500 developers all on the same monolithic project (unless you're developing like...the Linux kernel). Getting your team to at least try to implement best practices does take some effort, but with things like weekly demos and code reviewed commits, it's do-able.

paisawalla · on July 26, 2016

I used to be a huge proponent of ORMs everywhere, but I've come to realize that if you're writing your app in such a way that a developer needs to be able to do any arbitrary data fetch or transformation whenever they want, that's your real problem. The set of retrievals and transformations you want to support should be well-defined, and abstracted into a layer whose interface allows the level above it to only think in terms of models and not their backing store.

After you have that, then it doesn't even matter on the backend. The models you present to the layer above can have complex and changing relationships to the actual storage -- maybe they contain data which is derived from what's in the database, but transformed after being fetched so that none of their properties actually correspond to a column or a field in a store. In my experience -- having seen the tragedy that is a Rails project gone full ActiveRecord -- this pattern enforces an excellent separation of concerns and constrains a problem which can otherwise grow unboundedly in complexity.

barrkel · on July 26, 2016

I don't really agree; I think you either need to have a very thin layer between your DB facts and your domain, or else use the DB as a kind of persistence layer for a complex graph.

The latter only really works if you've got a primary implementation language and aren't integrating lots of applications / libraries written in different languages communicating with the same database. You need to go down the SOA / distributed RPC / FFI route to integrate different languages, and that has its own complexities.

Personally I prefer treating the DB as a canonical store of facts. Models with a lot of code are pretty suspect. Retrievals can be tuned to just the facts required, ma'am - you don't accidentally drag in the banana + gorilla + whole forest. Doesn't stop you building a higher-level service layer if that's what you need, either. You'll need that when you scale up anyway; chatty models won't work at that level either.

paisawalla · on July 27, 2016

Yeah, I'm saying that as your app grows out of being a simple CRUD app into something more useful and involved, there will be less of a relationship between what you need to store things efficiently, and what you need to present them well. Your model will become more graph-like, probably. For this reason, patterns designed around AR-style models will fail to scale. I disagree that this only works in a mono-lingual environment, although you will need tooling and infrastructure to support it; a model-centric architecture typically doesn't afford the possibility of multi-lingual support.

The code doesn't go in the models, it goes in the service/arbitration layer. DB as a store of facts is obvious -- DB as a 1:1 representation of what yet-unforseen features, UIs and platforms will need is a naive and limiting assumption. You have to build your application in a way that future product needs won't be constrained by storage and modeling decisions, which is a tension that Rails apps frequently encounter.

mercurial · on July 26, 2016

> The one huge advantage of an ORM is the ability to support multiple databases, but that only really works if you can do everything using the ORM.

It has several advantages: support for multiple databases (which is useful, sometimes), the ability to serialize/deserialize an object graph in one go, and sometimes a decent query builder which lets you compose queries as opposed to concatenating strings.

Unfortunately, it's also terribly easy to destroy performance by using lazy collections configured the wrong way for your use case and not notice it, to the point where I strongly advocate using query builders instead.

> I realize this is a hyperbole (I hope) because you really shouldn't have 500 developers all on the same monolithic project (unless you're developing like...the Linux kernel). Getting your team to at least try to implement best practices does take some effort, but with things like weekly demos and code reviewed commits, it's do-able.

The problem is that when you come in later, that the codebase grew way too fast and the deadlines are tight, retrofitting best practices on an existing ball-of-mud can be daunting.

jlg23 · on July 27, 2016

> It has several advantages: [1] support for multiple databases (which is useful, sometimes), [2] the ability to serialize/deserialize an object graph in one go, and [3] sometimes a decent query builder which lets you compose queries as opposed to concatenating strings.

[1] was the point of the comment you replied to and it provided a very important constraint as well

[2] rarely needed & easy to implement with recursive queries in native SQL

[3] building queries is pretty straight forward; what ORMs usually tend to bring to the table is knowledge about the schema and therefore compile-time error reporting - but this can be done in any language where one has the level of reflection, or, in worst case, by a two-stage compilation process where stage 1 generates code from the schema that can then be used by the compiler for verification in stage 2

mercurial · on July 27, 2016

I don't see how recursive queries are going to help you serialize a new Foo with a new field of type Bar with a new field of type FooBar, each going in a different table. That's what I mean by serializing an object graph.

As for building queries, SQL is straightforward (mostly). The problem is that it composes very badly. Any time you need to implement something like an advanced search (ie, lookup the same information, but with a number of different search criteria only known at runtime), the best you can do is concatenating partial queries and hoping you got the parentheses count right. Not to mention that a query builder will help with stuff the SQL syntax is miserable for, like IN clauses.

jlg23 · on July 27, 2016

> I don't see how recursive queries are going to help you

You construct a graph of objects with very few queries, transform it and write it back; with knowledge of the db-schema this will outperform any ORM-based solution and give much greater flexibility.

> As for building queries [...] The problem is that it composes very badly. [...] the best you can do is concatenating partial queries and hoping you got the parentheses count right.

There is an area between full-fledged ORMs and string concatenation. In a purely functional approach queries are composed by composing functions that compile to queries. Postmodern[1][2] is a good, open source example, though in most commercial projects we just built our own wrappers for the tasks and databases at hand. This also allows for much better performing code since for a lot of tasks hooking up the db-reader to the json emitter without going through object instantiation reduces memory and cpu consumption by an order of magnitude (or two), while in the same project, some code benefits from a OOP approach (for which you just use a reader that constructs the objects on the fly).

[1] http://marijnhaverbeke.nl/postmodern/

[2] http://marijnhaverbeke.nl/postmodern/s-sql.html (yes, it does string concatenation at run time, but it does it for you, you don't worry about getting the parenthesis count right)

CuriouslyC · on July 26, 2016

This is why I like SQL Alchemy. It provides an ORM that acts as a veneer over a very powerful SQL expression library, and you can move back and forth between the two seamlessly, even mixing them in individual statements.

nawitus · on July 26, 2016

That's what I've often wondered about ORMs. First they try to abstract away all the database internals so that the developer hardly even knows what's happening. However, ultimately the abstraction always ends up having some limitations. In the end the internals need to be exposed, but since your code base is still mostly based around the ORM it can end up being a mess.

treve · on July 26, 2016

If you're team is 500 strong you definitely should be at a point where you can do things right. If not, the 'debt' is not mainly technical, it's clusterfuck of bad decision making.

spotman · on July 27, 2016

Yea sure some people in the sea of 500 will be doing elegant clean things, working in the core part of the business ( usually at or near the cash register program ) but if you have reason to have 500 developers you don't simply get the luxury of having every project be written perfectly.

Landing somewhere in-between by getting things done not perfect but shippable, and doing it fast as possible is par for the course ime. The best code is always written the 2nd or 3rd time, never the first.

Especially on the case of hyper growth worrying your competitors like lyft or postmates or amazon might get something pivotal out first.

People have to learn somewhere, usually things like this they learn at scale, on the job.

mamcx · on July 26, 2016

Something that hit me researching how build a relational language and thinking how hard could be to remove SQL and put instead my own flavor, (super-oversimplification):

- SQL "bad" - ORM "good" - NoSql apis "good"

So, the thing is that when facing with SQL "everyone" try to "abstract" it more.

Or instead use NoSql, because is "easier".

Fine.

Then if exist a market demand for a better API for the databases, why the databases guys not DO IT?

Yep, I know SQL is supposely the way for it, but bear with me: I live in the FoxPro era so I know what is code "to the metal" in database without SQL (and it was fine and easy).

If the SQL layer could be optional and more bare layer is provided (you can copy the dbase ideas!) then the problem of ORM mappers could be solved far easier (I imagine!).

How this could be?

With a AST api layer, for example. So I can send:

TABLE "Customer" SELECT "*"

So, imagine a kind of LLVM but for databases...

liuliu · on July 26, 2016

LINQ: https://msdn.microsoft.com/en-us/library/bb397926.aspx

Most of the time, the problem is to "embed" a foreign language (SQL) into an existing one (Python, Ruby, etc.), but you can do it in a proper way such as LINQ, then you have type-checking etc.

fogetti · on Aug 4, 2016

I wish your thinking would be more widespread. I never understood how come that restful HTTP became very popular without having any universal query language around and merely providing an HTTP client with each service built, while in the DB world everybody is obsessed with SQL? Why not just provide the developer with an API/library for the underlying algorithms/datastructures?

For example I hate so much when I have to read mySQL documentation jus to find out what went wrong when I used ORDER BY and the optimization didn't kick in? I mean, if I have to keep in my head the whole optimization mechanism of the DB engine and all the steps of the filesort algorithm going on behind the scenes just to write a single wretched SQL statement, then what's the point of SQL in the first place?

And this was just one example. The truth is that the SQL abstraction is leaking. Most of the commands don't encapsulate anything at all.

avereveard · on July 26, 2016

it's very hard to abstract away from SQL. to be honest, many nosql have integrated concepts similar to grouping, having, joins in their API as well as they matured, because that's what data elaboration needs.

mamcx · on July 26, 2016

But still everyone do it. That must tell something, right?

And I'm not talking about NoSql borrowing some relational concepts, but the opposite, and more directly, the API.

I know that most folks (including the ones that downvote!) have no clue what I'm talking about, because almost nobody (recently) have experience in talking against a database without SQL. Is like if the only way to talk to a NoSql was using Json + REST. That constrain badly your mind.

The inner relational model is far more rich, and simpler, SQL was not designed to be used by developers and it show (specially when you have contrived syntax as with CTE)

And before you ask what is the problem with CTE, is exactly because the way to have that is create a contrived syntax that obscure what is goin on. SQL is too restricted, yet too broad for interfacing.

barrkel · on July 26, 2016

Well, data transformations are much easier to read and write in SQL than e.g. Java, what with temporary collections built up in memory, random maps, lists, etc.

CTEs aren't required very often - you generally only need them for recursive CTEs, and that's iterative retrieval analogous to pointer-chasing. It's typically a sign of a data model that's poorly suited to relational storage, e.g. trees and graphs.

I have issues with the irregularity of SQL syntax - it deeply annoys me that we have both 'where' and 'having' for doing exactly the same thing, one pre-fold and one post-fold - and I don't like my lack of access to the potential the index has for related fetches (window functions are not pleasant to work with) - but mostly my problems come not from lack of access to the relational nature, but lack of power over poor implementation / query planner / etc. details.

mamcx · on July 26, 2016

Agree, but that are problems at the end, when most developers have issues at the start (basics). Baffle me when i read that relational databases are "hard" and you need to teach some basics even for experienced developers. Plus, the thing here is that yeah, everyone know SQL have issues.

So they make a ORM

Then it create worse issues. Is like the Database-side and the app-side are at war (without intention!) with each other and the database-side only concern themselves for the kind of issues that only happened for specialized tasks.

And the app-side still try to interface to the database, but poorly.

----

Articulating this better, I hope: Is like the movement now to improve the syntax and semantics JS. Or like ASM.js.

So, why not have SQL 2.0 with the fixes everyone since 20 or more years ago already know? And the libraries/API made to acknowledge that databases are not primarily used by "end-users" but app developers?

But that is probably like ask why not clean C++... ;)

I know this is just ask for the moon (obviously not exist one "SQL", only several more or less alike implementations), but the trouble with ORM and databases is more than a decade old and the answer is mostly "yep, we have that problem. Move along"

ufmace · on July 27, 2016

I started using CTEs much more often for complex queries actually. IMHO, they're more clear and easier to read than sub-queries, especially when those sub-queries get nested 3 or 4 layers deep.

I agree with the ideas of irregularities. The WHERE vs HAVING doesn't bother me much, and I can't think of a better syntax off hand. My SQL pet peeves are that the SELECT list is at the beginning - I don't know what columns I want exactly until after I type out the JOINs, so I usually type SELECT * and then fill in later. I'd rather put it between WHERE and ORDER BY. I'd also like UPDATE to put the WHERE before the SET, so you don't risk blowing up a ton of data if you forget or miss the WHERE.

Probably a lost cause to get that in, but I would think it wouldn't be too hard to at least support those syntax changes in addition to the current standard.

_lwad · on July 27, 2016

Complex queries using CTEs may be easier for you to reason about when you are writing them, but are a nightmare to understand if you are not the one who wrote them. We have heaps of CTE-intensive technical debt and usually it takes longer to understand what they are doing than it would take to write them from scratch from a specification.

If you are using them to ease your understanding, chances are that you are doing the job in a very memory intensive way. It also tends to be slower because the generated intermediate results do not have indexes nor good statistics that could help the query planner to be efficient. Essentially, you are taking upon yourself the query planner job and assuming you'll do it better yourself.

jrochkind1 · on July 26, 2016

> SQL was not designed to be used by developers

Sure it was. Who else would it have been designed to be used by?

dragonwriter · on July 26, 2016

> > > SQL was not designed to be used by developers

> Sure it was. Who else would it have been designed to be used by?

More accurately than the grandparent: SQL wasn't designed exclusively to be used by developers. It was designed to be accessible to analysts (domain-focused analysts, not systems analysts.)

forgotpwtomain · on July 26, 2016

Yes, but there is no point in blaming the database for that. It's seriously not a database issue. Particularly Postgres has well exposed internals for monitoring and killing of long running transactions (if you are not setting statement timeout).

spotman · on July 26, 2016

Sure, but the show must go on.

mml · on July 26, 2016

Experience with Mysql replication (even simple master/slave) leads me to believe Uber is going to have some rather nasty surprises at some point.

darkr · on July 26, 2016

MySQL has had solid and flexible replication options for a long time. Postgres has only just started to catch up in the last couple of years.

Don't get me wrong, I would generally choose Postgres over MySQL for an RDBMS with replication requirements these days, but I'm not sure I would have made that same descion a few years ago.

There are valid reasons that long established companies such as Google, Twitter, Facebook and countless others chose MySQL as their primary data store.

dijit · on July 26, 2016

it's not that "replication options just started to catch on" with postgresql, it was that when the postgresql team take something into core it has to be fully fleshed out and fully supported with no surprises.

And with that in mind, I would take the slow to deliver replication options postgresql supplies in base over the previous third-party options which where comparable to the built-in options in mysql... and I'd take it over mysql's built-in options.

and I've used both postgresql and mysql in highly transactional/reliable production environments for over 5 years.

kriro · on July 27, 2016

I guess I'll be "that guy" who brings up non-technical concerns. I'd think very long and hard about using MySQL these days. Oracle is pretty much bizzaro treating it and widening the feature gap between the open source and enterprise version. Wikimedia is using MariaDB so I'd guess it works well for certain high scale use cases.

[I know there's different opinions but I'd rather have a fully open database if I have a choice. For the record I default to PostgreSQL]

dunk010 · on July 27, 2016

> There are valid reasons that long established companies such as Google, Twitter, Facebook and countless others chose MySQL as their primary data store.

I think you're misrepresenting things here. While all do use MySQL for some specific tasks, it's clear that all also have many other central datastores which are not MySQL. In Google's case this is a gross overstatement.

niccaluim · on July 27, 2016

Twitter's underlying data stores are and always have been MySQL.

spotman · on July 27, 2016

Not really, YouTube runs on MySQL. Pretty sure it's still the largest video site in the world.

mdani · on July 27, 2016

YouTube runs several MySQL instances behind Vitess (source: http://blog.vitess.io/2015/03/scaling-mysql-in-cloud-with-vi... )

secure · on July 27, 2016

At the same time, YouTube is only one of Google’s services, and there is no evidence that the remaining services run on MySQL as well.

bogomipz · on July 26, 2016

Could you elaborate?

harel · on July 26, 2016

You can read it in their next blog post: "How we migrated back to PostgreSql from mySql after migrating from PostgreSql to mySql". My experience with mySql replication was OK, but I did nothing fancy - just master/slave standard stuff. However the database itself was allowing you to do stuff by default (at the time at least) that is really bad. Not to mention scaling while I was using it was pretty much "you're on your own buddy" kind of deal. Today I'm scared to touch mySql because I don't trust Oracle. I use MariaDB on an old project which used mySql but anything new is postgres.

oneweekwonder · on July 26, 2016

Personally I prefer Postgres. But at the shop we use MySQL. One of the leads personally believe in Percona Server[0] and mentions some of the old mysql folks work there. I like their docs and the idea of a more tuned mysql config.

[0]: https://www.percona.com/doc/percona-server/5.7/index.html

rpcope1 · on July 26, 2016

I can second using Percona; I wasn't much of a believer in MySQL until I inherited a medium sized application using Percona 5.7. Once you get past a few of the initial warts (like making sure the database doesn't silently truncate your data--yuck), it's been really performant, stable and impressive (and we're running a setup with a fairly complex schema with master operating at ~3000 QPS average, a slave actively replicating for read load balancing and backup, and Sphinx for FTS).

Knowing what to be careful with on MySQL, I would certainly consider using the Percona flavor again in a new application.

noir_lord · on July 26, 2016

I hit an in production bug with the silent truncation field thing the other day, old system (2009ish) that was been moved.

Table had a field that was varchar(128) and wasn't used (original filename but system generated a hashed filename and served it with that name (for sharding, lots of files)) so for 5 years MySQL had been silently killing everything > 128 characters and it didn't matter.

Then along comes our hero (me in this story) who does all the upgrades, tests everything thoroughly (he thought) and it's all fine, deploy, test everything fine.

Next morning, 9 bug reports "Can't upload files at all FIXITNOW!!!".

Turns out company was auto generating PDF's from some report software that dumped all the data into the filename and none of them where less than 128 chars.

MySQL 5.7 started not silently dumping data into the bitbucket in the sky and instead threw an error (when quite reasonably you tried to shove 300 characters into a varchar(128)).

TLDR: MySQL doing the right thing broke the broken.

smcleod · on July 26, 2016

Heh, yes came here to say something rather similar...

avereveard · on July 26, 2016

I've tried to setup a replicated postgres with autofailover and it honestly is a pita.

the only sources of failover are rando scripts over the internet, that you have to download hammer in to your version dialect and hope you don't trigger one of the many uncovered failover modes.

sure log shipping works, but that's far, FAR from a working solution. the gap requires ton of development hour, testing etc.

can't really blame people for using things with mature tooling.

(but I didn't switch to MySQL that'd be madness! there are plenty of good, replicated redundant stores out there, both nosql and sql)

fanf2 · on July 27, 2016

Does repmgr handle most of this?

avereveard · on July 27, 2016

most of it. but in the master-slave configuration the clients can only work if connecting to the master, and when the master switch, scripts needs to go to each clients and update their config. not immensely bad if you also use pgbouncer so you can do it on the fly without restarting the whole client, but exceptionally vulnerable to split brains and the like.

snuxoll · on July 29, 2016

I've handled this with a DNS CNAME record that points to the active master, though pgBouncer is a valid solution as well.

_qc3o · on July 26, 2016

Uber engineering in general is extremely disappointing. They consistently oversell/overhype mundane processes and technologies both software and otherwise. I suspect this is mostly because they are driven more by marketing and business needs than actual engineering bottlenecks. That is the only way I can explain all the technological missteps they keep bragging about.

nine_k · on July 26, 2016

Sending an email to a local MTA should be pretty fast.

That is, if you cannot and will not do async stuff in your program, there are ready-made tools that will do that particular thing asynchronously for you, and have been doing so for years (or even decades).

jedberg · on July 26, 2016

True, but really your local MTA is just acting like a specialized queue, since the first thing it will do is send the message to your relay.

nine_k · on July 26, 2016

Yes, the whole point is to avoid building your own queue where a ready-made, purpose-made solution already exists.

jedberg · on July 26, 2016

Well there are pros and cons. If you have already have a queue structure set up in your environment, it may not be a good idea to have to maintain a second queue structure in the form of a bunch of local MTAs. You'll need monitoring (how many messages are queued in each local MTA?) and a way to keep those MTAs up to date.

If you just put an item in to a queue, and then have a specific cluster of machines that does nothing but grab items from that queue and push them through an email infrastructure, then it'll be a lot easier to maintain.

yeukhon · on July 26, 2016

I was wondering how could that happen? It sounds like someone is trying to do two things in parallel.

I would expect

    def my_view(request):
        try:
            receipt = generate_receipt(...)
            receipt.send_email(...)
        except SomeKindOfEmailError as e:
            # okay do something else

and this should be synchronous and thus blocking. So to not block, they either wrote co-routines (asynchronous) or execute things in parallel. Have I interpreted their problem incorrectly?

jedberg · on July 26, 2016

Open db connection

get data for receipt

generate receipt

send email

write success to database

close connection

To a junior programmer this would probably look reasonable, and to be fair, it takes some experience and getting burned, or good training, to know it is not.

mdpopescu · on July 26, 2016

I was about to argue but then I realized: I've been working in the MS world for so long, I forgot that not everybody has connection pools.

The MS recommendation is the opposite: get the connection early, finish with it late, let us worry about the "real" connection. I've been working on multi-TB databases like that, with hundreds of concurrent requests, and never had problems.

gnaritas · on July 27, 2016

This has nothing to do with pooling, you can't use a connection pool in this scenario because you're in a transaction. Even in the MS world, that requires a single connection to be used throughout without releasing it back to the pool. If one isn't in a transaction then each access to the db uses a fresh connection from the pool and this problem never comes up, but the OP here is assuming the time between open and close is a transaction.

tormeh · on July 26, 2016

So the correct version is, I guess:

Open db connection 1

get data for receipt

Close db connection 1

generate receipt

send email

open db connection 2

write success to database

close db connection 2

I guess you could speed this up a lot by doing it in bulk instead of opening and closing db connections twice for every email. Anyway, the version you wrote sounds reasonable for everything but really big operations to me, but then again I'm fairly junior.

jedberg · on July 26, 2016

You can still do it in one transaction, sort of. Replace "send email" with "queue email" and then either eliminate the step that writes the success to the database (pushing that off to the queue processor, who will open or already have open a database connection) or instead write to the DB that you queued the item.

The advantage you gain is that the queue is a nice buffer if something gets held up with email sending, and also, the queue processor can work in bulk, say sending 100 emails, and then opening a connection to the DB and writing them all in one statement.

1024core · on July 27, 2016

So if the "send email" step fails (temporarily), the next worker to come along will grab the same receipt and send it again?

I think a better solution would be to use a centralized queue that actually does the mail sending, and retries in case of failure.

srd · on July 27, 2016

That depends on the implementation of the queue. Most queues have a "failed/retry" concept, where they mark failed jobs to be retried in some set future point in time. So one failed job does not hold up your entire queue processing.

ChoHag · on July 27, 2016

> a centralized queue that actually does the mail sending, and retries in case of failure.

A bit like SMTP?

lazyant · on July 27, 2016

not necessarily, in the first write we can update receipt status as "queued" so queue system can update to "sent" or retry

viraptor · on July 26, 2016

Depending on the number of email sending workers, it could be closer to:

    transaction 1:
        mark an unsent receipt as being processed
        get the marked receipt
    send email
    transaction 2:
        mark the receipt as sent

nawitus · on July 26, 2016

You don't need to close the database connection to not have an open transaction, right?

jedberg · on July 26, 2016

No you don't. But depending on the ORM, some of them batch writes unless you explicitly tell them not to, but will flush when you close the connection. So some programmers will just close the connection because they don't understand the difference or because they've learned that closing the connection guarantees that the data is written.

mschuster91 · on July 26, 2016

> So some programmers will just close the connection because they don't understand the difference or because they've learned that closing the connection guarantees that the data is written.

To be fair, this problem isn't limited to databases. Filesystems, even HDDs/SSDs, have been known to readily ignore flush() calls in order to achieve better benchmark results.

nine_k · on July 26, 2016

This still can be a pretty reasonable and correct course of action. Everything depends on the amount of data you are locking, and amount of transactions open simultaneously.

When there's a noticeable contention due to the number of parallel transactions, one should consider ways to loosen their data guarantees (the status can be unset for some time) and go for an explicit asynchronous approaches.

yeukhon · on July 26, 2016

Thanks for the illustration. But if Uber team was using ORM, then expect ORM to take care of the pool of connections.

get data for receipt would be a SELECT and only write success to database would do INSERT or UPDATE. I expect junior programmer to complete the above in at least two SQL calls. I have a feeling they were trying something smart.

EDIT: hmm reading the other commenter above, probably they are trying to lock on the data for full data integrity. Okay. That makes sense then. I was looking at the problem from the wrong angle.

riking · on Aug 1, 2016

Grapevine says that they were holding a transaction open for the entire duration of the Uber ride.

Thaxll · on July 27, 2016

The fact that Postgres didn't have official replication feature prior to a recent version is just unacceptable.

ledjon · on July 26, 2016

I would argue that most of these Postgres "flaws" are actually advantages over MySQL when you look at them holistically rather than the very specific Uber use-case.

Postgres's MVCC is superior (can rollback DDL, can add indexes online, can have open read transactions for a VERY long time without impacting other parts of the system)

Postgres supports many types of indexes, not just b-tree. One thing it doesn't have is clustered b-tree indexes... which is really what MySQL does that makes it somewhat "better." I wonder how Uber adds an index to a table that already has 1B+ rows in it with mysql?

Postgres have WAL level replication is a better guarantee of actually replicating the data correctly. I cannot tell you how many times I've had to tell my boss that the "mysql replicas might be slightly out of sync with the master" because of various replication issues. The way it handles triggers and scheduled events alone is garbage and can very easily break replication and/or silently cause inconsistency.

As for data corruption, if there is a bug that causes corruption, then there is a bug. I don't think that is a fundamental design flaw as implied in this article. You shouldn't rely on 1/2 assed replication design to accidentally save you from the data corruption bug. There are many downsides to the design MySQL has that are simply not listed here.

I have been both a professional MySQL administrator as well as Postgresql (as well as SQL Server and many NoSQL engines). Many of these Postgres issues are only issues at crazy huge scale, and I would say at that point you probably want to move away from relational anyway. MySQL has its own very large set of problems at scale as well.

It sounds like Uber is using MySQL as just a data bucket with primary keys ("Schemaless") which is good -- because you can't alter tables to save your life with MySQL.

At the end of the data each developer/business needs to use what works for them, but I would really shy away from pointing to this article as a linchpin in the "MySQL vs. Postgres" war (if there even is such a thing.)

busterarm · on July 26, 2016

Can't agree with this post enough.

I find their whole writeup to be terribly myopic. When they started their service, Postgres was almost certainly the right choice for what they were building and their MySQL setup was not. Now Postgres is less effective for them.

These kind of tech switches are _inevitable_ if you're making the right choices for your organization.

This strikes me as very similar to the article where Twitter ditched Rails. The focus should be inward... how they chose a tool that didn't support their use case and how they solved the problem, but instead they're about the flaws (that aren't really flaws) of the tool.

It's always the craftsman.

stubish · on July 27, 2016

While the article is discussing PostgreSQL vs MySQL, Uber's actual win seems to be from switching their data model to their sharded Schemaless key=value store. I expect they would have got this win no matter what backend they chose for Schemaless.

_Codemonkeyism · on July 26, 2016

"1B+ rows in it with mysql?"

Been there, really no fun.

midom · on July 26, 2016

As long as you can shard (across multiple instances, or even within same instance, to avoid B-Tree latching), 1B+ rows within MySQL is piece of cake.

Also, MySQL* is getting LSM-Tree support lately, which makes high performance data ingestion combined with OLTP workload quite feasible.

* https://github.com/facebook/mysql-5.6/tree/webscalesql-5.6.2...

_Codemonkeyism · on July 27, 2016

"I wonder how Uber adds an index to a table that already has 1B+ rows in it with mysql?"

"As long as you can shard "

Not sure how sharding helps with 1B+ tables when adding indices, care to share?

Swannie · on July 28, 2016

This is: "Split the index in your application code."

And then do your joins, in your application code.

fasteo · on July 26, 2016

Percona to the rescue [1]. Works flawlessly, at least in our case.

[1] https://www.percona.com/doc/percona-toolkit/2.1/pt-online-sc...

barrkel · on July 26, 2016

Am there right now, it's not really a problem - you just have to plan your schema changes and use Percona OST.

maximegarcia · on July 26, 2016

https://github.com/soundcloud/lhm

Saved my life a few time :)

_Codemonkeyism · on July 27, 2016

Thanks!

Groxx · on Aug 6, 2016

Also been there. Percona Toolkit works fine (though it's not load-aware enough, so we changed the backfill logic).

Plus, in 5.7, there are a fair number of online-DDL changes, and adding indexes is (usually) one of them.

jobu · on July 26, 2016

> It sounds like Uber is using MySQL as just a data bucket with primary keys

They have a couple posts about "Schemaless", but I still don't understand why they used MySQL as the data store instead of something like Cassandra. ( https://eng.uber.com/schemaless-part-one/ ) From that post it looks like they basically built a no-sql database on top of a relational database.

The only reason given was operational trust ( "If we get paged at 3 am when the datastore is not answering queries and takes down the business, would we have the operational knowledge to quickly fix it?" ). The project took nearly a year to roll out, and in that time the operation knowledge could surely be trained, hired, or contracted.

smoodles · on July 26, 2016

Operating Cassandra at the scale that Uber is going to require is going to be painful and as operationally draining as MySQL if not more.

There are really not a large number of options here anymore with the departure of FoundationDB from the market. CockroachDB might be an option in a few years, though I'm still confused why they are moving towards a SQL-ish vs key-value interface...

nickpsecurity · on July 26, 2016

"departure of FoundationDB from the market"

Pissed me off so much. Only thing close to Google's F0 RDBMS on the market, at a reasonable rate, and the beginning of a good offer to enterprises. Then, "poof!" It's a good example of why I tell companies to not put anything critical into something from a startup. If they do, better have a synchronized, backup option tested and ready to go.

"why they are moving towards a SQL-ish vs key-value interface..."

That's easy: most databases and buyers use SQL. Key-value is preferred by startups & non-critical, side projects in big companies you see here a lot but aren't representative of most of the market. Need first-rate, SQL support. I think EnterpriseDB shows that it's also a good idea to clone a market leader's features onto alternative database.

ngrilly · on July 29, 2016

> Only thing close to Google's F0 RDBMS

Did you mean F1 (instead of F0)?

nickpsecurity · on July 29, 2016

Yeah, yeah. I keep getting the 0 and 1 mixed up. Thank you.

ddorian43 · on July 26, 2016

What dbs would you suggest in their scale ? that are easier operationally than cassandra ?

MikeKusold · on July 26, 2016

I was at MesosCon and ended up talking to some Uber people. They are currently using Cassandra in prod. I can't speak as to why they use MySQL the way they do though.

verma7 · on July 27, 2016

I gave a talk at MesosCon about how we (are starting to) run Cassandra across multiple datacenters at Uber (https://www.youtube.com/watch?v=U2jFLx8NNro, https://schd.ws/hosted_files/mesosconna2016/60/mesoscon-uber...).

mandeepj · on July 26, 2016

Thanks for sharing this. I was wondering - the limitations they have listed could be easily overcome with cassandra

dastbe · on July 26, 2016

So arguably, they are using mysql as a storage engine rather than as a database.

They don't explicitly answer the question "Why didn't you use InnoDB/WiredTiger/etc. for your dataplane?", but you get the idea that they were very happy with the specific characteristics of MySQL for their use case and so they built on top of it. It also sounds like they had some deadlines (specifically, the death of their datastore) that they had to meet :).

carapace · on July 26, 2016

I had that same thought, that the time spent rolling their own system could be better spent just learning some existing good-enough thing.

A great way to get familiar with something is to be the folks who write it. It's also much more fun to design and implement something new than to just learn some other fella's software. I'm guilty of this myself.

But I've started to remind myself that "somebody else has had this problem" and there's probably a good enough solution out there already.

Put another way, is what you are trying to do really so novel? In the case of Uber's infrastructure, you would have to talk for awhile to convince me that they really really need something not-off-the-shelf.

mdani · on July 27, 2016

IMHO some possible reasons of not using Cassandra could be the following.

You can't use Cassandra if you need atomic increments (yes, they're included but painfully slow due to several trips required to satisfy PAXOS).

Also there are no transaction rollbacks (atomic batches always go one way - forward).

You may hit GC pauses if the JVM is not tuned properly.

If the use case involves its of deletes then tombstone related issues need to be considered.

stubish · on July 27, 2016

I wouldn't have trusted Cassandra back then either. 0.9, 1.0 or maybe 1.2 was reaching sufficient maturity to actually be recommended. Modern Cassandra has come leaps and bounds, with the 2.x series finally becoming stable this year and just recently 3.0.x finally getting blessed by the community as stable enough for production. And ScyllaDB hot on their heels.

fjordan · on July 26, 2016

Tools such as the percona toolkit (https://www.percona.com/software/mysql-tools/percona-toolkit) provide the ability to perform safe and performant online alters for MySQL. Behind the scenes it actually creates a new table with the new schema and utilizes triggers to apply writes to the new table as the original data is being copied. Once it completes the table is renamed and the triggers are removed.

Percona also recommends using this tool to safely perform alters for any Galera-based replication product (Percona XtraDB Cluster, MariaDB Galera Cluster, etc.)

markoot · on July 26, 2016

I have used this tool many times in production, and in accordance with advice about avoiding foreign keys.

It only failed catastrophically, causing a critical production incident, twice.

Each time took dozens of engineer hours to vet in advance, thus costing thousands of dollars.

Online DDL changes and indexing with pg cost us... basically nothing, and never caused downtime.

My and Pg each have their place... but if you want to modify large tables, MySQL is almost certainly the wrong tool for the job.

verelo · on July 26, 2016

While this is true, i don't personally think Percona tools alone makes these changes 'safe'. You can create scenarios where performance is so degraded that you introduce replication delay or service interruptions (and yes the tools have ways to work around that, but it's a significant challenge to tune this correctly on production), data can be lost during the table switch over, and the complication of managing a product / database with a table in an extended migration is very difficult.

This space is far from done, until we can change data structures on the fly...we're going to find there to be a constant struggle between schema and schema-less and neither side will be right or wrong.

Solution: Quantum states; We suspect every possible scenario is happening already, so we should just migrate our database to the appropriate quantum state, where the data is in the structure we desire. Sounds insane, it probably is.

spotman · on July 26, 2016

Your absolutely correct that it is not risk free, however, if you do not use foreign keys (which no huge scale mysql install like schemaless does) and you know what your doing (read: calculating the expected load, adding throttles and sanity checks, backups, etc) online alters are extremely feasible with mysql and folks have been doing them for years.

I know first hand shops like FB use a similar method in production.

Also, big note, the whole point of "schemaless" (And other things, like fb's friend feed architecture) is that you don't make schema changes.

chucky_z · on July 26, 2016

online schema change is absolutely riddled with crippling bugs if you use foreign keys. Just a heads up, we've had to basically take everything it does and make an in-house version. The idea and execution are great, when it actually works.

barrkel · on July 26, 2016

Foreign keys are a major problem in MySQL if you do lots of writes, especially bulk writes. As is auto-increment (table-lock for load data infile, yay!).

We allocate our own IDs, and disable foreign keys for bulk loading.

madhusudhan000 · on July 26, 2016

Well, for one thing, if they say schemaless, then you are using the wrong tool for the job.

Sure postgres has JSONB (comparing to mongo db), Key value store such as HStore, but they do well if they fit on one machine. The moment you hit that scale you have to realize that there are tools specifically built for this.

There is the phoenix project https://phoenix.apache.org/ that salesforce is using for scaling. Definitely worth a try.

But again, the title of the article sounded nothing more than a rant to me.

peatmoss · on July 26, 2016

The term schemaless within Uber is more of a product name. It's kind of a joke with many of the engineers, that it's anything but devoid of schema.

grogers · on July 26, 2016

Support for online schema changes has been getting a lot better in recent releases of MySQL. No idea if it is totally bulletproof right now though, we are still in the "stone ages" of doing failover based schema changes, not even pt-online-schema-change.

It seems that the replication arguments in the post are pretty weak (migrating to 9.4+ to allow heterogeneous versions would still have less downtime than migrating to MySQL), but clustered indexes seem like a huge win for their use case of updating only a subset of each row. Whether that is reason enough to totally ditch postgres, I don't know.

carterehsmith · on July 26, 2016

I just want to add that newer versions of MySQL can add indexes online. In fact, it looks like all DDL can be done online, even though some if it (like adding a column) may require the table to be rebuilt, which can take time.

Note that the master is still available for read and write during this. Replicas will lag though.

njs12345 · on July 26, 2016

From the MySQL docs: http://dev.mysql.com/doc/refman/5.7/en/alter-table.html

> Updates and writes to the table that begin after the ALTER TABLE operation begins are stalled until the new table is ready, then are automatically redirected to the new table without any failed updates.

While the master may be technically up for writes during this period, it's not much a goer if your table is large and has any write traffic at all, as anything writing will stall for what may be an extended period.

carterehsmith · on July 26, 2016

It should never be an extended period. Everyone everywhere will advise to keep your transactions as short as possible.

Sure this is sometimes boring additional work - eg you don't delete 1M records with one statement, you break it into 1,000 statements each deleting 1,000 records.

Sucks, but keeps your db and your users happy.

BTW this is true for PostgreSQL and MySQL and Oracle and every db that allows concurrent DML.

njs12345 · on July 27, 2016

No, the migration itself will cause transactions to stall.

For example, if you have a table with 1M user records, and you run a migration to add a column in MySQL, then any updates to the table will be stalled while the table is rewritten to add the extra column (which may take a while). This is independent of how many records it touches - even if the transaction only touched 1 record and would take 10ms to execute, if the migration takes 10 minutes it may be stalled for up to 10 minutes.

In Postgres you can add a nullable column, and the table will only be locked for a very short amount of time, independent of the size of the table.

ComodoHacker · on July 28, 2016

>can have open read transactions for a VERY long time without impacting other parts of the system

Unless you're on a replica.

return0 · on July 26, 2016

> the very specific Uber use-case

Other than the "schemaless" layer, the issues mentioned are fairly common.

sam_pointer · on July 26, 2016

We did something very similar at EA Playfish, at least one alumni of which is part of the Uber engineering team.

We used a 2 column InnoDB-backed table for all of our data storage, massively sharded, and run in a 3-host master-slave-slave configuration.

At that time EC2 would routinely kill hosts without the courtesy of a poke via ACPI and as such we became very good at quickly recovering shards. In a nutshell this mechanism was to have the new host contact a backup slave, perform an lvm snap, pipe the compressed snap over a TCP connection, unroll it and carry on, letting replication take up the delta.

That enabled us to not only manage the 10 million or so daily active users of that title, but was also the platform under the 12 or so additional titles that studio had.

We had lots and lots of very simple things and failures were contained.

I think at the time we were the 3rd-largest consumer of EC2 after Netflix and "another" outfit I never learned the name of. EA being what it was, however, we were never permitted to open source a lot of the cool stuff Netflix and ourselves seemed to develop in parallel.

sandGorgon · on July 26, 2016

this is frikking awesome! do you have any of the lvm and pipe scripts publicly available ? i'm kinda struggling with building and setting up lvm on ec2 automatically and was wondering if there is any tidbits you can pass along.

sam_pointer · on July 27, 2016

Unfortunately not. That company and codebase is very much dead, and I don't own any of it. Even if I did it would be a bunch of context-less shell (in the first iteration) and then a bunch of context-less Chef (in the second platform iteration).

Things I do remember:

- We used ephemeral volumes for all data stores. This was pre-provisioned IOPs and EBS was flaky as heck back then. I can't remember the disk layout, although we did experiment a great deal.

- We took great pains to ensure there was enough space to make the snapshot (IIRC is was a telemetry/monitoring item)

- The pipe scripts were essentially "netcat".

The best I can offer is this talk: http://vimeo.com/57861199

fusiongyro · on July 26, 2016

The article could be summed up as "Postgres is not a distributed database." MySQL isn't either, although it certainly has more friendly replication technology. I think it's a lot more likely that what's really happening here is that they've designed their "schemaless" schema or its supporting software to handle the kind of soft errors that MySQL is permitting and Postgres was not.

We have MySQL replication across the country where I work and I certainly wouldn't characterize it as robust; it fails every 3-6 months. MySQL replication is certainly a lot older and easier to use than Postgres's, but SQL databases are fundamentally CP systems. When you say "This design means that replicas can routinely lag seconds behind master, and therefore it is easy to write code that results in killed transactions" it sounds like you're blaming the way replication was implemented for a physical problem. There is no way to design a replication system such that two highly-consistent databases can achieve perfect availability in the face of real-world networks. A worse protocol can exacerbate the problem, but a better one can't make it go away.

I have never seen corruption with Postgres (unlike MySQL), but I have never tried cross-datacenter replication with it. Apart from that, Postgres generally seems to do much better with consistency than MySQL does, where DDL statements are not transactional, etc. So I am not surprised to hear that their system trips harder on Postgres's more aggressive consistency.

In short, I suspect a more robust solution to their problem is a NoSQL database. On the other hand, it sounds like they want a combination of availability and consistency that will be difficult to get off-the-shelf. I'm glad they found a way to make it work. I wouldn't generally choose Postgres for a scalable system with an aggressive availability constraint--but then again, I wouldn't choose MySQL either, and I generally avoid problems that demand highly scalable, highly available solutions.

drob · on July 26, 2016

We've hit a lot of the same fundamental limits scaling PostgreSQL at Heap. Ultimately, I think a lot of the cases cited here in which PostgreSQL is "slower" are actually cases in which it does the Right Thing to protect your data and MySQL takes a shortcut.

Our solution has been to build a distribution layer that makes our product performant at scale, rather than sacrificing data quality. We use CitusDB for the reads and an in-house system for the writes and distributed systems operations. We have never had a problem with data corruption in PostgreSQL, aside from one or two cases early on in which we made operational mistakes.

With proper tuning and some amount of durability-via-replication, we've been able to get great results, and that's supporting ad hoc analytical reads. (For example, you can blunt a lot of the WAL headaches listed here with asynchronous commit.)

pella · on July 26, 2016

Roadmaps ( PostgreSQL ) 2016-2017-...

* Postgres Professional roadmap ( Pluggable storages, Multimaster cluster with sharding, Effective partitioning, Adaptive query planning, Page-level data compression, Connection pooling, Native querying for jsonb with indexing support, ....) https://wiki.postgresql.org/wiki/Postgres_Professional_roadm...

* EnterpriseDB database server roadmap ( Parallelism, Replication, Vertical Scalability, Performance ) https://wiki.postgresql.org/wiki/EnterpriseDB_database_serve...

====

And "Scalable PostgreSQL for real-time workloads https://www.citusdata.com " --> https://github.com/citusdata/citus

leblancfg · on July 26, 2016

My big question here is why they decided to move over to MySQL instead of using the Citus (also open source) Postgres extension. They don't mention it, so we don't know whether it was considered, and if so, why it was not selected. Postgres' rapid feature growth in the past years is interesting in itself.

pella · on July 26, 2016

Citus is Open Sourced since "Mar 2016", so it is not so old.

https://github.com/citusdata/citus/blob/release-5.1/CHANGELO...

"citus v5.0.0 (March 24, 2016)

- Public release under AGPLv3

- PostgreSQL extension compatible with PostgreSQL 9.5 and 9.4"

bonesmoses · on July 26, 2016

Citus currently has major problems joining non-distributed tables with distributed tables, and it doesn't fully support schemas yet. It's almost there, but not quite.

leblancfg · on July 28, 2016

Thank you!

ksec · on July 27, 2016

Why is there a Postgre Pro and Enterprise DB? Whats the differnce? ( Never heard of PostgrePro till today )

pritambaral · on July 27, 2016

Postgres Professional and Enterprise DB are two companies that provide Prostgres products and services. The linked roadmaps are their respective plans on what they want to add to the open source Postgres database.

thinkingkong · on July 26, 2016

Posts like this are important.

We too often rely on a buzz-word heuristic and that's how you end up with dozens of random technologies that are harder to maintain and don't necessarily solve any of your problems. This method is good, because it shows that when you understand the problem the right way, you can find the right solution, even if by popularity it looks like a "step backwards"

Massive Kudos.

juskrey · on July 26, 2016

Making a lot of "buzz-word" errors like that, I now use the simple heuristic: choose the oldest available technology that solves your problem. Also known as Lindy effect https://en.wikipedia.org/wiki/Lindy_effect

pbreit · on July 26, 2016

The problem is that people might get the wrong impression that this is relevant to them when for 99% of people it is not.

millamox · on July 26, 2016

I strongly disagree. it would have been useful f they'd stuck to problems without well-known solutions.

Sadly, they also mixed in issues which are easily solved, or in a particularly egregious case, where they just complain about a bug. As though MySQL never had a bug. That was silly.

My read of it was: Postgres annoyed us a few times, and we got fed up with its, so now something different will annoy us. Please look forward to our blog post in 4 years about how we're using X instead of MySQL/Schemaless because those were also imperfect.

tracker1 · on July 26, 2016

I got a similar impression... though with Uber's scale, funding and resources, they probably could have worked with and through their issues with Postgres. I'm actually surprised they didn't take a multi-pronged approach to their issues. Since they're using Schemaless, I'm curious why they didn't go for one of the many non-sql databases that may well be a much closer match to their use case.

It seems to me that Cassandra (C*) may have been a better match to their needs... yes it means more administrative and application tuning, but would definitely scale to meet their needs. RethinkDB would also likely be a better match to what they are wanting to do.

That said, I've been holding out for some time on PostgreSQL's in the box replication story to take root and mature. There are definitely more mature features and solutions to sharding and replication around MySQL, I just tend to find MySQL to be brittle and every time I've ever worked with it, I have at least a half dozen WTF moments... from binary data handling/indexing, foreign key syntax, ANSI out of spec, and others. PostgreSQL has some great features, and once the replication issues settle in, it will become the default choice for a lot of projects in a lot of organizations. Though, mySQL/maria and even MS-SQL are currently better options for many SQL use cases.

busterarm · on July 26, 2016

I'll preface what I'm about to say with "I have never worked for Uber and I don't know terribly much about their internal structure", but from my interviews with Uber and a few of their hires I know, it seems that they tend towards hiring totally independent teams from the existing staff when tackling big projects...including hiring an outsider manager to hire a whole team. I won't speculate as to the reasons for this publicly but I've drawn some interesting conclusions from this.

A multi-pronged approach that might involve multiple stakeholders just doesn't seem like their way of doing things.

chucky_z · on July 26, 2016

What replication issues are you referring too? I never once had a problem with pgsql's replication across 8.1->9.5.

It would randomly die, but that was always either my fault or the applications fault, never pgsql itself.

The lack of master-master seems to be the big thing everyone mentions, but PostgresXL is currently in a usable-in-production state.

tracker1 · on July 27, 2016

But, what is the current replication setup that comes with postgres that is well documented with the PostgreSQL (current-version) documentation... the past, when I've looked there's mention of 2-3 solutions (none in-the-box) and others that require at least a 5-figure support contract.

Compare to MongoDB, RethinkDB, MS-SQL and others where the tooling for replication comes in the box. Yes, to of the examples are "no-sql" but even the mysql replication is in the box and supported as such.

chucky_z · on July 28, 2016

Did you read this?

https://www.postgresql.org/docs/current/static/high-availabi...

I'm not sure what more you can ask from documentation.

Does MySQL document Vitess, Galera, MaxScale, etc...?

serge2k · on July 26, 2016

> though with Uber's scale, funding and resources, they probably could have worked with and through their issues with Postgres

Then they wouldn't get to build cool new stuff and write blog posts about how they had to build cool new stuff because OMG UBER SCALE.

thinkingkong · on July 26, 2016

The issue is all solutions are imperfect. You just make tradeoffs and shuffle the problems around.

Doing so deliberately is whats commendable.

swasheck · on July 26, 2016

thank you for more accurately stating what i was trying to say earlier.

stephengillie · on July 26, 2016

Almost all problems are already solved. "Solving" a problem today is mostly ego-stroking.

GuiA · on July 26, 2016

Sweet, I have a few PhD thesis and NSF grant proposals I'd like your help on.

stephengillie · on July 26, 2016

I'd love to help stroke your ego, but I see you're busy doing it yourself.

NhanH · on July 26, 2016

Well, this is heresy. Does that mean we are now officially boycotting Uber?

Joke asides, one thing I've been trying to figure out for awhile is the limitation at which certain components/ systems broke down. Basically, something along the line of "given X records, this operations would take Y time, or would cause Z A B C problems". I've actually got developers friends asking me how fast a simple "SELECT * FROM X WHERE index=?" would take on a million row table, since they were surprised that some NoSQL DB could do a query on hundred million rows in a few seconds.

I guess that's part of why you only learned how to scale after having done it once.

Illniyar · on July 26, 2016

"surprised that some NoSQL DB could do a query on hundred million rows in a few seconds"

The tone seems to suggest this is fast, a simple index query in an RDBMS of even a hundred million rows will take milliseconds on even a weak computer.

philliphaydon · on July 27, 2016

I was testing some JSONB indexing and did a date-range query on a json doc for a table with 100m records to test with, it returned top 10 records that met the criteria in ~4s, figured out how to index the json doc for dates and this dropped to ~25ms.

This was running on an old Core i5 / 16gb of ram with a lot of junk running on Windows.

It would probably have been faster on a production environment.

xaprb · on July 26, 2016

Justify that with actual math, please? I don't think you know how computers (processors, bus, memory access, etc) work. How exactly do you think an RDBMS can query an index at a hundred billion rows per second?

losvedir · on July 26, 2016

The parent said a query of a hundred million rows, i.e.: `select * from users where id = ?` when there are a hundred million users. That's obviously not scanning a hundred million rows, since the index will be a b-tree which by the power of logarithms is not so many levels deep.

Like Illniyar, that's also what I interpreted GP to mean when they said:

> "surprised that some NoSQL DB could do a query on hundred million rows in a few seconds"

which is not fast at all. However, if it actually did something on a hundred million rows that would be impressive. Not sure exactly sure which they meant, though.

CoffeeOnWrite · on July 26, 2016

Illniyar is right, a b-tree with a branching factor of 100 can index 100M rows in 4 levels. Even with the index on a spinning disk, should have no trouble coming in well under 100ms.

falcolas · on July 26, 2016

Indexes like this are also typically in memory to begin with, making it even faster, since you only have to find the actual record on disk.

Even better, if you only need one field from the record, and it's part of a compound index, you can frequently return the data from just the indexes; no disk seeks required. Small tip with InnoDB on MySQL - any non-primary key index is automatically a compound index with the primary key.

forgotpwtomain · on July 26, 2016

Great write up. A couple points -

I'm not sure this post is illustrative of any generally applicable considerations (re: the title) in the choice of Postgresql vs MySQL, since Uber seems to no longer be using a relational model for most of their data and is using MySQL effectively as a key-value store.

> say a developer has some code that has to email a receipt to a user. Depending on how it’s written, the code may implicitly have a database transaction that’s held open until after the email finishes sending. While it’s always bad form to let your code hold open database transactions while performing unrelated blocking I/O, the reality is that most engineers are not database experts and may not always understand this problem, especially when using an ORM that obscures low-level details like open transactions.

I have to very seriously disagree here, ORMs make a lot of things easy - and you can get away with building stuff for a while without understanding the underlying databases or SQL but only to a certain scale (I'd say more like medium-scale, definitely not large or Uber level). If you have engineers writing code that interacts with a database without understanding transactional semantics, the engineer in question not the database is the problem.

> We started out with Postgres 9.1 and successfully completed the upgrade process to move to Postgres 9.2. However, the process took so many hours that we couldn’t afford to do the process again.

There seem to be ways [0][1] to do online upgrades with Postgres (before logical decoding in 9.4), although I haven't personally used them. Not sure if they explored these options at Uber or not?

[0] https://github.com/markokr/skytools [1] http://slony.info/

spotman · on July 26, 2016

> I have to very seriously disagree here, ORMs

In spirit I agree with you, its the engineer's fault for not reading the documentation of their ORM or equivalent.

But in these big ships with hundreds of programmers, leaving transactions open in hibernate is a daily occurrence somewhere.

Usually caught before production, but happens oh-so-frequently, that anything that exacerbates the pain from this would be seen in a negative light for sure.

BadassFractal · on July 26, 2016

I've heard from technical leaders at multiple now well established unicorns how they'd never use postgres or switched from postgres simply because MySQL has a lot more tooling built it and many more people are exposed to its shortcomings at "web scale" so that it's very well known where and when things will break.

Disclaimer, I'm a hardcore Postgres user myself, but I also keep tabs on the other tools.

segmondy · on July 26, 2016

I've been a hardcore and exclusive Postgres user since 2004. Always rolled my eyes when folks say they use MySQL or moved to it from Postgres. Reading about that bug sent shiver's down my spine. I know no system is bug free, but Postgres is one of those systems that have held up very well for me, and now I'm super paranoid. :-( I guess I understand tho, I moved from BSD to Linux for more tooling.

davidw · on July 26, 2016

OTOH, it's important to consider that most places are not and will never be "unicorns", so analyzing things at that scale may not make sense. Just because Facebook managed to make something work doesn't mean that you should follow the same path.

quotemstr · on July 27, 2016

> Each of these system calls incurs a context switch

System calls are not context switches. I wish people would distinguish between them. A system call is just a change of privilege level and is efficient --- there's no cache invalidation required on almost any system.

A context switch, on the other hand, involves a call to the scheduler, saving and restoring of much more CPU register state, and various kinds of cache invalidation. (It's even more expensive if you're switching between different processes instead of different threads in the same process.)

The kernel may perform a context switch while executing a system call --- this context switch is what makes blocking calls blocking. But even IO system calls do not necessarily cause context switches, especially in the case where an operation can be satisfied by accessing only the page cache.

tl;dr A system call is not necessarily a context switch

Illniyar · on July 26, 2016

So the major issue detailed here is that postgres basically uses immutables rows which creates performance issues with writes.

Just read about their new schemaless db in their blog an the first paragraph contains this:

"The basic entity of data is called a cell. It is immutable, and once written, it cannot be overwritten. (In special cases, we can delete old records.) A cell is referenced by a row key, column name, and ref key. A cell’s contents are updated by writing a new version with a higher ref key but same row key and column name."

So, mmm..., not saying that postgres didn't pose a problem for them but I think postgres' db model fits better to their new db then mysql. They probably had to work really hard to get mysql to work like postgres.

Without this issue, it looks like two things needed.to be done with postgres that would have solved their problems have indexes that point to primary id and do logical replication (which they say a plugin solved in 9.4).

Is this a case of "I got burned by something so I won't use it again"

Illniyar · on July 26, 2016

So more from here: https://eng.uber.com/schemaless-part-two/

"Each Schemaless shard is a separate MySQL database, and each MySQL database server contains a set of MySQL databases. Each database contains a MySQL table for the cells (called the entity table) and a MySQL table for each secondary index, along with a set of auxiliary tables."

So... 1 table, with 1 index and manually created and updated secondary index tables.

With this scheme I can only assume postgres will work just as well or better.

0xmohit · on July 26, 2016

Facebook maintains it's own fork [0] of MySQL. A couple of interesting talks are also available: MySQL at Facebook, Current and Future [1] and Massively Distributed Backup at Facebook Scale [2].

[0] https://github.com/facebook/mysql-5.6

[1] https://www.youtube.com/watch?v=jqwegP9xwVE

[2] https://www.youtube.com/watch?v=UBHcmP2TSvk

raverbashing · on July 26, 2016

"its own fork"

And yes, MySQL has a lot of problems, but it seems to be better for "seat of the pants" usage (that is, when you don't need too many SQL capabilities and wants to use it mostly as a data store)

lazyant · on July 26, 2016

from what I understand FB uses Mysql as permanent storage, not as relational database

yeukhon · on July 26, 2016

I believe they use it for almost everything. They built all the graphs on top of MySQL data. I don't know what permanent storage is, but if you are referring to storing images and videos, I doubt. Highly doubt that. They may be storing pointers, but as a file system, I doubt. But I wouldn't know for sure, I don't work there. Oh, they do have Cassandra (well they built Cassandra)...

jhartmann · on July 26, 2016

They use a system called haystack for binary large object storage. They very much don't use mysql for those.

yeukhon · on July 26, 2016

That's my feeling they wouldn't use MySQL as a blob store.

midom · on July 26, 2016

technically, it is a patch set, it gets frequently rebased against the upstream

pluma · on July 26, 2016

Technically they use it as a dumb key-value storage.

philliphaydon · on July 27, 2016

There was a great talk from a FB engineer who talked about using MySQL as a key-value store, then another engineer posted a blog about how they use MySQL and queries and joins etc, I asked about the video and he got super defensive and said it was wrong and the guy had no idea what hes talking about.

It really put me off ever wanting to even consider working at Facebook.

Digging around I can't find either the blog or video :(

pluma · on July 27, 2016

I've talked to several Facebook devs who confirmed to me that FB primarily uses MySQL as a key-value store (e.g. this is why GraphQL makes so much sense for them). They were quick to point out however that of course they don't know everything FB does.

The problem with making absolute statements about what FB does is that it seems to be structured as a large number of mostly independent teams that are free to choose and develop whatever technology they need to solve their problems. There's actually a talk about scaling by a FB dev (don't have the link right now) that uses this as an example for what it means when they talk about "Facebook scale" (another example is that their monorepo outgrew git so they created their own extension to mercurial instead).

It's entirely possible that the main use of MySQL at FB is as a key-value store while at the same time there are small parts of applications using it with plain old queries and joins.

philliphaydon · on July 28, 2016

Ahh yeah good point. FB's scale is very interesting, it's like we solved scalability then FB came along and was like, ah crap we need to resolve scalability. And different teams are solving their own problems different ways.

otterley · on July 26, 2016

And so does Yahoo, and Google, and...

pritambaral · on July 26, 2016

I wonder what the design decisions are behind (or what it would take) to make Postgres store secondary indexes on disk like InnoDB does. Sure, the extra index lookup through the primary index is a cost, but it seems like write-amplification can sure be a greater concern too. Ultimately, it would be nice if Postgres gave the DBA a choice of — if not move outright to — secondary-index indirection through the primary index like InnoDB does.

saurik · on July 26, 2016

I would definitely not want PostgreSQL to "move to" the mechanism used by InnoDB as that mechanism is slower for reads (which this article even admits). This disk layout is one of the reasons I continue to use PostgreSQL. If you care more about write performance than reads you might even want to look into something entirely different than a traditional database.

FWIW, PostgreSQL's mitigation to the write amplification problem mentioned here are "heap only tuples". It is highly likely that Uber could have prevented a lot of this pain by learning more about this feature--which is notably not mentioned even once in this entire article, which to me completely undermines the feel they are trying to achieve of "we really really really know what we are doing"--and tuning their table density parameters to take maximal advantage.

Instead of "why we moved off of X" it would be much better to see "we are considering moving off of X: anyone know what we are doing wrong?". Sure, maybe Uber knows about HOT, and did extensive analysis to determine it wasn't a solution to their problem; but it frankly does not seem at all to be the case. Anyone who knows a lot about PostgreSQL would have explained HOT to them, so they probably didn't even consult with PostgreSQL experts behind the scenes.

Sadly, "we are looking for help with a complex technical challenge" is something the market punishes under the premise that everyone has to be entirely self-sufficient gods of their technology stack :(. The only time I remember ever having seen a company reach out to the community for help was reddit (with respect to something involving PostgreSQL or Cassandra... I don't remembee the specific issue).

pritambaral · on July 26, 2016

Thank you for introducing me to "heap only tuples", I did not know about it. From what I read about it from Postgres's documentation[0], it would have helped in the updates where no indexed column was updated.

0: https://wiki.postgresql.org/wiki/Index-only_scans#Interactio...

dialogbox · on July 27, 2016

You can find more detailed explanation from Bruce Momjian's site. (From page 64: https://momjian.us/main/writings/pgsql/mvcc.pdf)

denishpatel · on July 27, 2016

They migrated from MySQL to Postgres (https://www.yumpu.com/en/document/view/53683323/migrating-ub...) with almost same reasons and now they are counter argumenting for their inability to use software and lack of skills to upgrade!

scotty79 · on July 26, 2016

Funny how the article is just:

- we used X in a fashion that suited us best

- it caused us problems Y because of some technicalities of X

- so we switched to Z and we could avoid Y thanks to how Z handles the technicalities differently than Y

and the top rated HN comments are:

- you used the X wrong

- all the technicalities of X that caused you problems Y are actually superior features of X

zihotki · on July 27, 2016

But think how funny that'd be if we assume that "all the technicalities of X that caused you problems Y are actually superior features of X" is actually true?

_bz2r · on July 26, 2016

This was a great overview and write-up.

Anyone know why they are using MySQL over MariaDB[1]?

1. https://mariadb.org/

mwpmaybe · on July 26, 2016

I'm not saying this is the case at Uber, but it's quite common for MariaDB users to refer to it as MySQL, or to use the two names interchangeably. It's difficult to get out of the habit as all of the CLI tools are still mysql*, clients (i.e. the protocol) are all named "mysql", CM modules (like Ansible's) are all named "mysql", etc. And sometimes it's just simpler (if lazier) to say you're using MySQL rather than explain what MariaDB is and why it's preferable.

zokier · on July 26, 2016

I still occasionally call LibreOffice OpenOffice. Besides mysql has less syllables than mariadb.

0xmohit · on July 26, 2016

It's hard to say, but one can often find people switching from Postgres to MySQL (and vice-versa). For example, see this [0] and the related HN discussion [1].

[0] http://insights.dice.com/2015/03/19/why-i-choose-postgresql-...

[1] https://news.ycombinator.com/item?id=9231751

Illniyar · on July 26, 2016

Well there new database is basically a key value store built on MySQL, so I think there isn't much to be gained by using MariaDB and it will add needless additional complexity.

jswny · on July 27, 2016

It really is sad that we all just can't appreciate a company like Uber giving us insight into their internal engineering choices. So many people on here think Postgres is so perfect that if you don't like it you must be using it wrong. Postgres is a tool like anything else. It has good use cases, and bad use cases.

pella · on July 26, 2016

anno 2013 "MIGRATING UBER FROM MYSQL TO POSTGRESQL"

https://www.yumpu.com/en/document/view/53683323/migrating-ub...

markpapadakis · on July 26, 2016

Great write-up. A few observations:

1. The encoding and translation schemes of Postgres and mySQL/InnoDB are well described in the blog post, and I would also agree that InnoDB’s design is, all things considered, better for all the reasons outlined in the post.

2. I don’t understand why anyone still uses lseek() followed by read()/write() and not pread()/pwrite() syscalls. It’s trivial to replace the pair of calls with one. Aerospike is another datastore that resorts to pairs of seek/red-write instead of pread/pwrite calls.

3. Process/connection model makes no real sense nowadays - although to be fair, there is, today, practically almost no difference in terms of footprint between OS threads and OS processes (other than memory and FDs sharing semantics, they are practically the same). It’s still more appropriate to use threads (although I ‘d argue maintaining a pool of threads for processing requests and one/few threads for multiplexing network I/O is the better choice).

4. ALTER TABLE is obviously a pain point with mySQL, although I am not really sure many users with large datasets care; they probably figured out long ago it’s going to be an issue and they designed and expanded accordingly. It’s also a relatively rare operation. That said, other than using mySQL (or any other RDBMS) to build the data plane for an elaborate, distributed KV store, one should consider Salesforce’s approach too. Their tables have some 50 or so columns, and the column names are generic (e.g column_0, column_1, … ). They have a registry where they assign column indices (e.g column_0) to a specific high-level entity type (e.g customer title, or price), and whenever they need to query, they just translate from the high level entity to the actual column names and it works. They also, IIRC, use other tables to index those columns (e.g such an index table can have just 3 columns, table id, column index, value) and they consult that index when needed (FriendFeed did something similar).

5. Cassandra should have no problem supporting the operations and semantics of Shemaless ass described in their blog posts. However, given they already operate it in production, they probably considered it and decided against it.

fdr · on July 26, 2016

w.r.t. 2:

https://www.postgresql.org/message-id/6248.1046130083%40sss....

     Manfred Spraul <manfred(at)colorfullife(dot)com> writes:
     > Tom Lane wrote:
     >> It seems unlikely to me that eliminating lseek on some platforms would
     >> be worth the hassle of maintaining two code paths.  lseek is mighty
     >> cheap as system calls go.
     >> 
     > It was considered expensive enough to write a syscall avoidance layer 
     > that caches the file pointer and skips lseek if fpos==offset.
     
     You're missing the point: that layer is mostly there to ensure that we
     don't foul up the kernel's readahead recognition for sequential fetches.
     It's nice that Linux doesn't care, but Linux is not the only platform
     we worry about.
     
     			regards, tom lane

markpapadakis · on July 27, 2016

This ML thread's only real argument is that some OS/Kernels may not support pread/pwrite. The readahead argument makes little to no sense IMO. Unless there are too many uses of random access IO in the codebase, they should use pread and friends if available there. Especially considering most people run it on Linux not some exotic OS nowadays.

0xmohit · on July 26, 2016

Worth quoting from the article:

  Accordingly, using pgbouncer to do connection pooling with
  Postgres has been generally successful for us. However, we have
  had occasional application bugs in our backend services that
  caused them to open more active connections (usually “idle in
  transaction” connections) than the services ought to be using,
  and these bugs have caused extended downtimes for us.

fdr · on July 26, 2016

What I don't understand: no system gets to maintain open transactions for free. MySQL keeps UNDO logs, so the effect there, much like Oracle, is possibly running out of UNDO space, as well as slowing down reads that have to apply UNDO to get the last-committed row version. So what gives? Did Uber Engineering fix the dangling transaction issues while migrating off, or are they relying on some other property of MySQL vs. Postgres?

0xmohit · on July 27, 2016

They seem to be relying on the database take care of those :)

vanviegen · on July 26, 2016

One major advantage of MySQL's clustered indexes the article doesn't mention is that, although secondary key reads may be a little slower, primary key reads will be faster. The row data lives in the primary key index, so there is no need for referencing an additional database page (possibly causing random I/O).

This is especially relevant when doing range queries over the primary key. Imagine a table containing billions of chat messages, from which you want to retrieve a single conversation history. With a clustered primary key on (conversation id, message id), MySQL would need to process just a couple of database pages. Postgres, on the other hand, would need to reference a semi-random page for each of the messages.

Now imagine a 10k message chat conversation, a table too large to fit into RAM, and storage by means of spinning rust (yeah, yeah, I know what year it is :-)). The difference would be somewhere between 2 and 3 orders of magnitude.

pritambaral · on July 27, 2016

Uhh ... Postgres supports Index-only scans; as long as the data you're asking for is in the index, that is.

So if you have an index on (conversation_id, message_id), and you try to retrieve message ids of a specific conversation, only the index will be touched.

vanviegen · on July 27, 2016

Doh, you're right of course! We were having this problem in the Postgres 9.1 era, before index-only scans were a thing.

Still, it's quite inefficient maintaining an extra copy of the data that is never actually used. Though no longer multiple orders of magnitude less efficient.

However, I'd guess that programmers don't often think to add these seemingly useless fields to an index, as it feels inefficient and just wrong. But at least this offers an out in pathetic cases.

jasode · on July 26, 2016

Fyi... a related (not duplicate) discussion of a previous Uber story: https://news.ycombinator.com/item?id=10923848

trequartista · on July 26, 2016

Wow, this is such a detailed analysis. Having used Postgres and suffered issues with data replication as well as database crashes, this post was really helpful.

ismdubey · on July 26, 2016

The fact that Uber scaled to so many users with Postgress gives me such a relief. For now, I am good !!

cdelsolar · on July 26, 2016

So basically, if you don't intend to use it as a relational database, and you have enough scale to run cross-data-center (and across-the-world) master-master replication, then you should maybe switch from PostgreSQL to MySQL?