Hacker News new | past | comments | ask | show | jobs | submit login
Online migrations at scale (stripe.com)
268 points by hepha1979 on Feb 2, 2017 | hide | past | favorite | 56 comments



We're doing nearly the exact same approach in my office out of necessity.

We "allowed" small amounts of downtime previously (say 10-15 minutes at a time for migrations), but over the last year as our customer base has expanded, the window has shrunk smaller and smaller as to avoid service disruptions.

We're now at a stage where downtime is effectively not allowed anymore, so we've taken to this approach (make a new table, dual write new records, then write all the old records, remove references to the old table) to mitigate our downtime.

It's nice to know that other companies have taken this approach as well, my team honestly didn't know if we were doing it "properly" or not (if there is such a thing as properly)


> It's nice to know that other companies have taken this approach as well, my team honestly didn't know if we were doing it "properly" or not (if there is such a thing as properly)

Incidentally, I was thinking about doing a blog post about that too.

That's good. You're confirming there's an audience for my writing. Thank you :D


Avoiding large data migrations is one of the reasons to consider if using Command Query Responsibility Segregation (CQRS) and Event Sourcing (ES) is right for the application domain. It seems in this example the domain events haven't changed (i.e. subscription created, subscription canceled) but the logic enforcing subscriptions has (i.e. only one subscription then allowing multiple). If the events were stored in an append only journal then the subscription business logic just needs to change and no data needs to be migrated. Additionally, since the catalyst for this change seemed to be query side performance, the query service can be optimized for reading the records separate from the command service that writes to the event journal.

Event sourcing isn't without it's difficulties but if the domain is pretty well understood then it can be a very powerful pattern.

Stripe has an awesome product and seemingly an equally great engineering culture. I wonder if they have debated using CQRS/ES.


Does this require a separate data source (e.g. a message queue) to ingest and log the events before reacting to them? Or could you retroactively use the mongodb oplog? I imagine there is way too much data for the oplog to be feasible in this case. And if you didn't design the system from day 1 around an event model, wouldn't it be difficult to retroactively introduce this new architecture?


It's nice to see this kind of thing covered.

A company I worked for some years ago, LiveOps, also followed this methodology, because we had to.

LiveOps was and is a 'telephony in the cloud' provider, and the databases are in-line with some of the important call flows.

Last I looked, in the ten years previous, LiveOps had handled at least one call from something like 15% of all of the possible phone numbers in the United States, and all of that data was available in a huge replicated mysql cluster spanning multiple datacenters. We had mysql tables with tens of billions of rows, replicated across many servers.

And just to be crystal clear: there was absolutely no downtime allowed, ever, because downtime of any kind meant that phone calls would not go through.

We used extensive feature flags, along with the techniques described in the stripe.com article, to achieve exceptional uptime, relatively ok operational overhead and pretty good development velocity.

The point I'd like to drive home is that this is all possible even for medium sized organizations and moderately sized staffs. The important part is that everyone needs to keep these priorities, methods and techniques in mind, all the time, to make it work.


The article refers interchangeably to tables and collections. Reading this[1], it seems they are using MongoDB as their transactional database. If they do, it would be interesting to know at what point did they decide to split the subscriptions data into its own collection. The article just states "As we added new features, this data model became problematic."

Mongo-modelling being a black art of sorts (to me, at least) I'd be more interested in what the tipping point was for them (data size and shape, usage patterns) than the relatively straightforward (conceptually at least, operatinally these things are never to be underestimated) table-doubling approach to changing a data model.

[1] https://stripe.com/blog/announcing-mosql


We plan to share more posts on our data infrastructure: like you pointed out, there are lots of useful patterns to share. In the meantime, I’ve updated the post to be more consistent (thanks for mentioning it.)


This article seems to be about really fundamental changes to your data modeling. If you're using MySQL and interested in only changing one table, i.e. adding/removing a column or an index, check out LHM:

https://github.com/soundcloud/lhm

I've used it in the past a lot. You have to make sure to throttle the migration, or else it'll max out your CPU & memory.


Yeah, in simpler circumstances you can follow the approach of

1) Write code that works with current and new schema

2) Deploy that code

3) Run that version for long enough to know you will not roll it back

3) Run the migration, moving the db from current to new schema

4) Clean up code to only work with new schema

5) Deploy that code

Where step 3 uses an appropriate tool for online schema changes for your database.

Here a some writeups of that approach I found when trying to explain this to a friend who's a rails developer:

* http://blog.honeybadger.io/zero-downtime-migrations-of-large...

* https://pedro.herokuapp.com/past/2011/7/13/rails_migrations_...

* https://www.rainforestqa.com/blog/2014-06-27-zero-downtime-d...

* https://blog.engineyard.com/2011/zero-downtime-deploys-with-...


Another tool in this category is gh-ost by github: https://github.com/github/gh-ost

Its main advantage seems to be that it doesn't require triggers on the original table (it works by subscribing to the replication log), thus not making write locks worse during the migration.

I haven't tried either yet, so if anybody has some experience with them, do please share a comparison!


Is it better than pt-online-schema-change from percona-toolkit? That thing creates a new table which is the same shape as the old one, runs an ALTER on the new table, sets up triggers on the old table to update the new table, and then copies the old rows in batches while monitoring replication slave lag. Then it renames the tables and drops the old one. You do need storage for twice the space of the table, at least temporarily.


Square's wrapper around pt-online-schema change is really slick: https://github.com/square/shift

It has a nice webapp to monitor and run schema migrations, including peer review. I use it all the time at Square, and it's pretty great.


In addition, in replicated setups, the more important reason for throttling the migration is avoiding slave lag.

Disclaimer: I worked on LHM a few years ago, still one of my favorite projects :)


An alternate pattern goes like this:

1. Create new data model and adjust reads to read from the new model first and fall back to the old model. Update ALL your readers first.

2. Change writes to use the new model.

3. Now you have different options. You can leave things as they stand. You can actively migrate data from old to new. You can do a lazy migration, for example, only when your read old data you migrate it.

There are some gotchas depending on your system specifics, esp. races between writes from the migration process and writes coming from the application. How you address those depends on the specifics of your situation...


I'm curious about how this edge case was handled:

1. Lisa has old tv_service subscription, saved to snapshot, but not live-copied to Subscription table.

2. Lisa removes tv_service subscription.

3. Before snapshot is updated, batch processing adds lisa_tv_service to the Subscription table.

4. Subscription now has lisa_tv_service, but Lisa.subscriptions does not include tv_service.

My only idea: wait a day, then a batch-process job looking for data that exists in Subscription table but not in Customer.subscriptions


In step two both collections/tables are updated. The backfilling routine is hopefully "smart" enough to leave stuff already in the new table alone.


Right, but there's nothing to delete in the Subscription table at step 2. `lisa_tv_service` isn't added until the backfilling step.


Ah, I see, you're right.

Well, one solution would involve a log table used to log/lookup the migration status: Anytime the real-time system touches data (eg. deletes the subscription in the source table and has nothing to do in the new table) the event is logged. The backfilling routine uses that data to decide wether its data or the current data is "fresher".

If the old subscription data is never (or for a certain period) deleted but merely marked inactive (eg. with fields "timestamp_end" and "timestamp_modified" updated) then that information can be used during the backfilling to decide on what data is more recent. [edit: I guess this is what haldean is reffering to above by "tombstone records."]

That makes sense, I think. Does it? Elsewhere in the thread i referred to the procedure used as "conceptually relatively simple". Most things are conceptually simple, but always with pitfalls and dark things lurking around the edges.


You could use a tombstone record instead of deleting outright, and the batch update could skip copies that have corresponding tombstones.


There are other articles for those interested:

- https://blog.codeship.com/rails-migrations-zero-downtime/

- https://blog.engineyard.com/2011/zero-downtime-deploys-with-...

In the Ruby on Rails world, it's definitely a thing (made easier with attribute reader / writer, and db/migrations/)


Considering doing something similar while we add a graph database for relationships between users in addition to Postgres. The consistency checks between the two databases can get messy. How long did you test to make sure no niche case fell through the cracks?


Nice article. The use of scientist before switching over reads was especially good to know.


The big question that wasn't really covered, is how they are selectively updating the writers and readers on a live environment. Especially migrating on a per object basis to have them start dual writing, and reading.


Feature flags, most likely. Deploy the dual-write code behind a feature flag, enable the feature flag for a small fraction of requests, start notching up the percentage until you're at 100%, then remove the feature flag and the old single-write code. Rinse and repeat for the remaining three steps.


The actual deployment was my curiosity :)


Ah, my bad, I misunderstood.

It's pretty normal for tech companies to do dozens of deploys a day, thanks to CI/CD. There are lots of ways to achieve this, but they all pretty boil down to the same idea: Boot up the new version before shutting down the old version.


Writing Code:

   if (migration_complete) {
      write_new_data_model();
   } else if (migration_in_progress) {
      write_old_data_model();
      write_new_data_model();
   } else {
      write_old_data_model();
   }
Reading Code:

   if (migration_complete) {
     read_new_data_model();
   } else {
     read_old_data_model();
   }
The `migration_in_progress` feature flag is flipped first, then the migration jobs run to backfill the new data model from the old data model, then the `migration_complete` flag is flipped. As many times as necessary, `migration_in_progress` can be flipped to off at any time and the new data model can be blown away and new writing code deployed. If there's no confidence in the code for reading the new data model, then a certain percentage of requests/users can be directed to read from the new data model while double-writing continues.


I get the code bit, I was more curious as to how they in particular get that slipstreamed in there. Load balancing deployments as others mentioned below seems to be the most feasible.


From my understanding, they migrate the backed-up old data to the new db after writers have already started writing any new updates to the new db. So the MapReduce job on that backed-up data should just have to look at timestamps on create/update to check whether or not to load that row into the new db.


Right, but how is that new code rolled out to live systems with zero downtime? They said they try not to update more than a few hundred lines of code, and only on individual objects. What process are they using to roll that out to start the writing?


Here's one simple model: launch new containers into the load balancer pool (having the new code), and shutdown old containers. This can be done in an incremental way so you get a staged rollout with no downtime.


Containers make sense, thanks.


Doesn't even need to be containers, the key part is having a load-balancer that can shift the load to the new web applications deployed without downtime.


Or use workers with a manager like uwsgi. Containers are overkill if you just want zero-downtime upgrades.


> Once the migration is complete, run the Scalding job once again to make sure there are no existing subscriptions missing from the Subscriptions collection.

That sounds scary! how many time do you need the "once again"?


This is a safety check. Because of the dual-writing pattern, there's no reason it should be necessary, but with migrations it makes sense to be extra-cautious and to double-check your work.


In Part 2 where they start reading from the new table, what happens when old subscription data is not present in the new table? Do they then copy the data from the old table to the new table when the need arises?


You have two options. The first is to not move to phase two until EVERY record is duplicated. Looks like that's what they did.

The other option is the on-demand option you mentioned. The biggest issue I see is the latency to fetch that information is suddenly a lot more random. If it's not in DB B then you have to fetch from A and write into B before returning the information.


Sure, but what do you do if your migration requires merging rows of existing data?

You can't rewrite the pointers to the merged rows while the application is online.


How were the animated graphics for this article created?


looks like custom tweening code they wrote for their blog

https://stripe.com/assets/blog/animate-svg-fe67ad3b4fe397689...


But how are you going to do this if you are constantly changing models? Like every other day?


so to sum it up short, create table, copy data, rename table. isnt this already tackled by tools created by percona and many others out in market? other than them adding in hadoop to speed things up, what else is new here?


"What else is new here" is you can't have any downtime, and you have to have a strategy for having your code work against the data in all of those transitional states during that process.

A lot of migrations would be pretty trivial if you could have even a couple hours of downtime, but in an expectation of 24/7 availability that is no longer acceptable in most situations.


You are thinking of pt-online-schema-change or Soundcloud's LHM. Those are tools to modify the schema.

Here they were modifying the data. But yes, it works the same way, and it's not really novel in any way.


This is some very impressive stuff!


Yikes. Use evolvable schemas and never do mass migrations.


How would that work? If you never do an explicit migration, you'll always have records on very old schema versions. Doesn't that force you to support every version forever in the code?


What happens when you need to split one physical database into two physical databases?


This sounds a little over complicated all in an effort to decrease the amount of code changed at any given time. They took great pains to keep data in sync across A and B datastores and I'm not so sure that extra cost was worth the perceived stability of this approach.


> They took great pains to keep data in sync across A and B datastores and I'm not so sure that extra cost was worth the perceived stability of this approach.

Such great pains come with huge systems. What's the alternative?

Taking the platform offline for a few hours? Management will say no. Or maybe Management will say yes once every three years, severely limiting your ability to refactor.

Doing a quick copy, and hope nobody complains about inconsistencies? Their reputation would suffer severely.


They maintained a replication process across both tables as they updated the read processes before updating the write process. Say for whatever reason their offline replication process broke for 2 hours. For those 2 hours of downtime that replication is broken, the system is reading from a table that is not in sync with the table that is receiving writes. At that point you are displaying incorrect subscription data to your customers.


> They maintained a replication process across both tables as they updated the read processes before updating the write process. Say for whatever reason their offline replication process broke for 2 hours.

From the article I got the impression that both tables were being written to in the same database transaction, so this is not a possible failure scenario at all.


Stability matters when you're dealing with people's bank balances and billions of dollars a year of credit card transactions.


There's only one datastore. I'm not sure what you mean with "perceived stability", it's about data integrity and preventing data loss during a model reshuffle.

I'd like to hear about alternatives if you've had experience.


They provided data integrity with a background sync process. What would have happened had that sync process failed?

In the past I have either flagged records to say where the system should read the data from, or just built logic into the readers to say if there is no data in A, read from B.

Then you update the writers to migrate the data from A to B on every new update and remove the data from A. It is an expensive 1 time write to move the data, but then you don't have to worry about keeping data in sync across two storage locations.

What you end up with is all records that are actively getting worked on move first. At a later date you start migrating all those stale records from A to B with a background process. Once A is empty, remove the logic to read from A, remove the migration writes, remove the A datasource.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: