Hacker News new | past | comments | ask | show | jobs | submit login
Move Fast and Migrate Things: How We Automated Migrations in Postgres (benchling.engineering)
193 points by sajithw on Feb 6, 2019 | hide | past | favorite | 54 comments



I am convinced that data migration is definitely one of the hardest problems in data management and systems engineering.

There are basically no solutions today that satisfy fundamental requirements such as minimizing downtime and guaranteeing correctness.

It is _such_ a huge problem that most inexperienced developers see kicking the problem down the line with NoSQL document storage as a viable alternative (it isn't; you'll be either dealing with migrating all data forever and special-casing every old version of your documents, or writing even more convoluted migration logic).

It's also clear that even the most modern ORMs and query builders have not been built in mind to consider the issues that arise in migrating data.

It would be a refreshing thing to see more research devoted to this problem. Unfortunately, migrations end up being so different from each other with such heterogenous requirements that we'll probably be working on this for a really long time.


I recently discovered anchor modeling [1], which forms the relational underpinnings of sqlalchemy-continuum [2], which I haven't used but I'm quite curious about. I've done ad-hoc, manual versions of anchor modeling for most data migrations I've done in my career, and I agree with you on how hard it is. It's high risk, difficult to verify, and repair work after the fact if a botched migration happens is also expensive. With diligence and enough resources though, it is doable. But, it could be a lot easier.

I think that more sophisticated static analysis and migration generation tools would really help out quite a bit in making this a reality, especially if you combine it with something. Having something like rope[3] for generating migrations and hypothesis[4] for using property-based testing to generate tested cases would make things nice as well. Definitely a hard problem, and definitely a worthwhile one to solve. If our team ever gets some free time to build a toolkit, we'd enjoy building some tooling to put all of this stuff together!

[1] https://en.wikipedia.org/wiki/Anchor_modeling

[2] https://github.com/kvesteri/sqlalchemy-continuum

[3] https://github.com/python-rope/rope

[4] https://hypothesis.readthedocs.io/en/latest/


I think DB's could definitely do more to expose what the cost of various operations are, it would be great if you could "explain" a migration before you run it like you can with a query and it would calculate a rough cost, how many rows need to be touched, what resources need to be locked, even how likely the required locks are to cause contention with other frequently-taken locks based on system statistics, etc.

But I was thinking about this recently and I feel like theres's some low hanging fruit in the migration frameworks themselves which, at least as far as I'm aware, all just completely punt on this problem. Rails, Alembic, and every other framework I've used will let you write a migration like adding a new non-nullable column, or renaming an existing column, things that can be really slow on a big or frequently written table and/or will cause problems during the rollout if an old version of the code is still running. It doesn't seem like it would be that hard to add a safe mode where the framework will block at least some of the most common variations of these unsafe migrations. Maybe it's harder than I realize, or maybe it's just a matter of anyone opening up some PR's and actually implementing this.


> do more to expose what the cost of various operations are

Both Postgres & SQL Server support EXPLAIN on DML without running the statement (with varying levels of depth).

    CREATE TEMPORARY TABLE tmp_blah (num INT);
    EXPLAIN INSERT INTO tmp_blah SELECT * FROM generate_series(1, 100000) ORDER BY random();

    QUERY PLAN
    Insert on tmp_blah  (cost=62.33..74.83 rows=1000 width=4)
        ->  Subquery Scan on "*SELECT*"  (cost=62.33..74.83 rows=1000 width=4)
            ->  Sort  (cost=62.33..64.83 rows=1000 width=12)
                Sort Key: (random())
                ->  Function Scan on generate_series  (cost=0.00..12.50 rows=1000 width=12)


One challenge with that is that the query plan often depends on the current table composition statistics. If a value is relatively common (or uncommon), or the size of the table has grown or shrunk, you may end up with a materially different (and conceivably substantially worse) plan.


I can't tell you how many times over 20 years I've heard a DBA tell me "the statistics weren't updated" after an incident.


I've written cron jobs to update mysql statistics to prevent it from choosing bad query plans. It's as terrible as it sounds.


How come it's not automatic and built-in? Why is a script terrible? (Does it have to enumerate all tables and piecewise run the stat update?)


Yes, the stats update is per table. We only updated it for a few tables we really cared about.


This is my reason for preferring NOSQL where possible.


I guess the issue with some of that stuff is that the solution is to do it in stages. Eg, adding a non-nullable column is made safe by doing part of the work and then updating rows in batches. Alembic runs all migrations in a transaction, so there’s not much point batching. It really needs to be handled outside alembic to make sense in the model. I guess that’s all a moot point now in Postgres 11 anyway.


What do you mean by your last sentence? How does PG 11 help?


Postgres v11 adds default values for NOT NULL columns without a full table rewrite.


> I think DB's could definitely do more to expose what the cost of various operations are, it would be great if you could "explain" a migration before you run it like you can with a query and it would calculate a rough cost, how many rows need to be touched, what resources need to be locked, even how likely the required locks are to cause contention with other frequently-taken locks based on system statistics, etc.

As part of my development process for a project I am working on currently I have spent quite a bit of time writing a Python 3 program to generate sample data.

It's nothing groundbreaking but I haven't seen anyone talk about this so I think it might be of interest to others maybe?

I'll explain how that is relevant to the comment I am replying to in a moment but first I would like to talk a little about said program.

So first of all you have other existing tools for generating sample data. For example in DBeaver (https://dbeaver.io/) you have functionality for generating mock data. However, while DBeaver as a whole is a nice tool that I am happy to have learned about, the mock data generating functionality is from what I have seen severely limited.

In particular, what I wanted to do was to generate data that would adhere to arbitrary statistical distribution of my liking.

In the first version of my program it took about 50 minutes (unacceptable!) to generate and insert 10,000,000 records on my laptop, and I was consuming so much RAM that my laptop started swapping which hurt the performance as well (to be fair my laptop only has 8GB of RAM but still). That however was just the initial starting point -- it was even only doing statistical distribution for one property (age), a few were assigned random values and all of the other were assigned a single value common to all.

After a bit of thinking I decided that the next step would be to construct a tree structure with frequencies for the different values. This tree is cheap for memory and fast to build.

First I specify a set of "distribution templates" that instruct the frequencies for values or groups of values of each of the properties that will have a desired statistical distribution.

For example, I create a demographics model (simplified here for brevity):

    distribution_templates = \
    {
      'age':
      {
        '18-24': 150,
        '25-34': 206,
        '35-44': 185,
        '45-54': 177,
        '55-64': 175,
        '65+':   106,
      },
      'gender':
      {
        'female':         80,
        'male':           15,
        'other':           3,
        'rather not say':  2,
      },
    }
From this I create at runtime what I refer to as a "combinatorial tree" with calculated target frequencies, and then I distribute a population count over it that is specified at runtime as well. Let's say that we want to generate 10,000,000 sample users. The resulting tree looks like this:

    -- root -- 'all' (10000000) -- age -- '18-24' (1501501) -- gender -- 'female' (1201201)
                                       |                              |- 'male' (225225)
                                       |                              |- 'other' (45045)
                                       |                              `- 'rather not say' (30030)
                                       |- '25-34' (2062062) -- gender -- 'female' (1649650)
                                       |                              |- 'male' (309309)
                                       |                              |- 'other' (61862)
                                       |                              `- 'rather not say' (41241)
                                       |- '35-44' (1851852) -- gender -- 'female' (1481482)
                                       |                              |- 'male' (277778)
                                       |                              |- 'other' (55555)
                                       |                              `- 'rather not say' (37037)
                                       |- '45-54' (1771772) -- gender -- 'female' (1417418)
                                       |                              |- 'male' (265766)
                                       |                              |- 'other' (53153)
                                       |                              `- 'rather not say' (35435)
                                       |- '55-64' (1751752) -- gender -- 'female' (1401402)
                                       |                              |- 'male' (262763)
                                       |                              |- 'other' (52552)
                                       |                              `- 'rather not say' (35035)
                                       `- '65+' (1061061) ---- gender -- 'female' (848849)
                                                                      |- 'male' (159159)
                                                                      |- 'other' (31832)
                                                                      `- 'rather not say' (21221)
In addition to that I have created functions for things like algorithmically generating unique usernames.

Then I have a tree walker that yields all of the combinations represented by the tree. From the tree above we would yield { 'age': '18-24', 'gender': 'female' } a total of 1,201,201 times, { 'age': '18-24', 'gender': 'male' } a total of 225,225 times and so on. (Again, the real tree is deeper than this -- there's more than just age and gender in the actual model. Also, age is specified at a more fine-grained level but the concept is the same.)

I retrieve these values from the generator function in batches and generate sample user profiles based on that.

So within a span like '18-24' I linearly distribute dates of birth by calculating (upper date bound - lower date bound) / count and using that as the delta to step the date of births by within the span starting from the lower date bound. (And of course the more fine-grained your spans are, the lesser the impact of linearly distributing values within each span.)

And I generate additional properties that are pseudo-random like usernames.

The script execution time is presently down to about 6 and a half minutes for generating and copying (using copy instead of insert was another optimization I made to the script along with a few others from https://www.postgresql.org/docs/current/populate.html after meticulously measuring that said optimizations had a significant positive impact on the time it took to put the data into the db), memory usage is very reasonable, and with the most recent commit I made which makes the generated usernames guaranteed to be unique (for up to a set number of generated user above the 10,000,000 I am doing), the stage is set for running the script on as many cores as your computer has to offer, further cutting down the script execution time.

Anyway, now on to how any of this is relevant to your comment.

I wrote this tool as part of my development process because a lot of the value that the project will offer to its users is tied directly to segmenting the users by various facets, and so I need sample data that allows me to explore the user experience while developing the project.

But I also think that this sort of tool could be useful in the situation that you guys are talking about here.

So when you are doing a migration you could generate a smaller test set of data that realistically reflects the real data of your users and you could then run the migration on that test set to get a very good idea about the cost of the various operations like you wanted. Agree?

Also, if anyone else knows of similar tools to mine I am always interested in knowing about them. Discovering what exists already can be hard and I have gotten to learn about many interesting and useful tools through discussions with others.


Migrations in a strict relation database always guarantee correctness. It's the need to roll back that causes issues.

That is solved by not doing destructive changes (removing a column) until the software is stable and a few iterations have passed.

The issue of downtime is semi-valid but can likewise be worked around by batch migrating data in manageable chunks vs nuking.


I don't know, at my current job I've been introduced to DACPACs[1] and I really like it. The ability to do schema/data compares between arbitrary environments and generate the migrations in real time is awesome.

[1] - https://docs.microsoft.com/en-us/sql/relational-databases/da...


Doesn't Django already do this?


Skimming through the page, it looks like there's a lot more functionality there than Django has.


Category theory to the rescue -- as always!

http://math.mit.edu/~dspivak/informatics/FunctorialDataMigra...


Amen.

Tools that let you work with version A and B of a db schema would be great.

Then you could update DB independently of code release. Its one of the things that I fear the most (and prevents me from loving RMDBs) is the thought of doing a big DB upgrade then having to roll back because of software not performing as expected.


> NoSQL document storage as a viable alternative (it isn't; you'll be either dealing with migrating all data forever and special-casing every old version of your documents, or writing even more convoluted migration logic).

In practice (with Mongo at least) you end up with migrations being from arbitrary JSON to different arbitrary JSON, and come to rely on the Javascript runtime for anything even a bit complex. It's definitely convoluted (albeit extremely powerful), but I think the biggest issue is that generating backward migrations even if you did non-destructive operations is completely impossible.


You are not barred from using a schema in your upper layer. And it is very much advised to do so to maintain sanity.

It just means that the migration can be done on-the-fly as data is touched / read / written, without downtime, but with support for both new-and-old versions in your backend. (Or if you want the SQL style migration, then yes, you need to write a script. And perform a flag day style backend update. Which is not advised.)

In reality, usually best migration strategies - sql or no - use a two phase process. First expansive changes run, which remain backward compatible with the old schema/structure, the backend is updated, then later when the new version of the backend is sort of verified to work well with the new schema/structure, a contraction can be run which removes the old structures/columns/tables/etc. This is especially helpful for providing blue-green automated deployment (which is great for continuous deployments).

And what is missing is usually a tool to verify that your expansive changes are truly backward compatible, and that the contractive changes don't contract too much (don't delete any of the new structures). - And this is easier if you simply manage this as part of "the backend". (Or it can be simply extracted into a service, which wraps the backing/data store.)


Strongly agree with this.

So much so that I wrote a schema comparison tool that allows you to autogenerate migration scripts and explicitly test for correctness.

Schemas are good, but the traditional tooling around changing them (rails/django migrations) is really bad. People inevitably cut corners because it's too hard to do things the right way.


Have you open-sourced your tool?



How is it different/better then apgdiff?


You say NoSQL document but have you work with graph databases like neo4j that allow for flexible schemas? I think they mostly solve this problem but don't have the uptake of RDBMs because of matureness.


The problem is that without a table lock of some sort you cannot guarantee that all the documents in a store satisfy the schema definition.

The NoSQL solution is basically say, "assume that nothing could ever possibly respect any semblance of a schema and build special cases around everything". It is a very unproductive way of thinking, taking defensive programming to the extreme.

Also, joins will be a necessity for all sorts of data querying. At the end of the day, schemas are a necessity for 98% of all work, whether defined through types or ad-hoc.


This is an amazing writeup. I'm currently solving the "migrations" problem for a side project of my own, and have basically resolved myself in the short term to be OK with short downtime for the sake of making migrations somewhat trivial.

And honestly? I hate this answer. As a solo dev it's pragmatic, but the solutions described in this article are _SO NICE_ that I'd love to leverage them.

If there's any way that those deprecate_column and rename functionalities could make their way into OSS/upstream support, I'd have a field day. (Those who know more about PG than I do and perhaps may be able to suggest another workaround, feel free, I'm very much learning this space as I go)

If nothing else, thanks to the benchling team for taking the time to write such a clear yet technical expose. This really hit the sweet spot of "explanations without uneccessary verbosity, technical without being impenetrable, and giving sufficient explanations of motivations and tradeoffs/pitfalls" and will give me a north star for where I aim my own DB work.


Original author here. Thanks for the kind words!

We would love open source some of the work we did - there are a few edge cases to still work out with deprecated_column and renamed_to before I’d be comfortable doing that, but definitely agree that may be generally useful.


If it's just edge cases, you can just open source it and document the edge cases. The community will likely fork and help with those edge cases.


I would love to see the code open sourced! (it's ok if you have edge cases pending as long as they have been documented or at least mentioned in a README). Managing and running migrations smoothly without downtime is a very common challenge that a lot of startups face and I think a lot of us have been reinventing the wheel over and over to handle this due to the lack of current offerings.


In our startup we moved away from Alembic to using plain SQL files for migrations, which (in our experience) is more robust and allows more control over the actual migration process. We wrote a simple migration manager class that loads a YAML config and a series of SQL files from a directory. Each migration is defined as two files of the form "[number]_[up/down]_[description].sql" and tied to a DB version, the YAML config specifies the name of a version table that contains the current version in the database. The manager then reads the current version from the table, compares it to the requested one and executes the necessary SQL files.

Alembic is great for many simple use cases but we found that for a production system it often isn't easy to maintain compatibility between two different DB systems like Postgres and SQLite anyway, as that would mean either adding a lot of feature switches and custom logic to our code or not using most of the interesting native Postgres features. Therefore Alembic offered very little benefit over a plain SQL file in terms of functionality and in addition made it harder to generate correct migrations in some case, as the auto-generation process does not work very reliably in our experience and some things are buggy/inconsistent, e.g. the creation and deletion of enum types. In addition, we found that it's much easier to write complex migration logic (e.g. create a new column and populate it with data from a complex SELECT statement) directly in SQL. Last point is that we can of course execute these migrations using any programming language / tool we like (for example we also wrote a small Go library to handle the migrations), which is a nice bonus.

That said we also heavily use SQLAlchemy in our Python backend code and like it a lot.


This covers a lot of ground that I've recently had to learn the hard way.

One item I've been considering; under Downtime, a reason for flakes in migrations is "long running transactions".

I've seen this too, and wonder if the correct fix is actually to forbid long-running transactions. Typically if the naive long-running transaction does something like:

    with transaction.atomic():
        for user in User.objects.all():
            user.do_expensive_thing_to_related_objects()

You can often recast that migration to something more like

    for user in User.objects.all():
        with transaction.atomic():
            user = User.objects.get(id=user.id)  # Read the row to lock it; or do a SELECT FOR UPDATE
            user.do_expensive_thing_to_related_objects()
This example is somewhat trivial, but in most cases I've seen you can fetch your objects outside of the transaction, compute your expensive thing, and then lock your row for the individual item you're working on (with a sanity-check that your calculation inputs haven't changed, e.g. check the last_modified timestamp is the same, or better that the values you're using are the same).

I've considered simply configuring the DB connection with a very short connection timeout (something like 5 seconds) to prevent anyone from writing a query that performs badly enough to interfere with other tables' locks.

Anyone tried and failed/succeeded in making this approach work?

The other subject that's woefully underdeveloped is writing tests for migrations; ideally I want to (in staging) migrate the DB forwards, run all the e2es and smoke tests with the pre-migration application code, migrate back (to test the down-migration), run the e2es again, and then really migrate forwards again. That would cover the "subtly broken deleted field" migration problem.

But how do we test that our migrations behave correctly in the face of long-running transactions? I.e. what's the failing test case for that bug?


> But how do we test that our migrations behave correctly in the face of long-running transactions? I.e. what's the failing test case for that bug?

Isn't it enough to simply make sure the migration transaction successfully finished? Even if there is a long running transaction, if the migration finished, that long tx will get aborted and rolled back.

Or if the migration stalls because the long running tx, then you'll presumably get a timeout error.

Is there something I'm missing?


No,

  with transaction.atomic():
    for user in db.users.all():
      user.do()
is not the same as

  for user in db.users.all():
    with transaction.atomic():
      user.do()
If the first fails, the entire data rolls back, if the second fails, half your db might be in an inconsistent state.


Yes, of course, they are not the same. I did not claim they were. I said:

> You can often recast that migration to something more like

By which I mean, in my experience you can usually write your migrations so that your code can work with the old AND the new version of the data, in which case you don't need to have a transaction around the whole operation.

This takes more work but is safer:

https://martinfowler.com/bliki/ParallelChange.html

https://www.amazon.com/Refactoring-Databases-Evolutionary-pa...


Great post. I agree that we don't need automated post-deploy migrations. We just need automated pre-deploy migrations. Post-deploy migrations, for example to delete an unused column, can be implemented as pre-deploy migrations in a subsequent commit.


You can also automatically generate some migrations with a proof search in linear logic, the way the beam[1] project does.

[1] https://tathougies.github.io/beam/schema-guide/migrations/#a...


In postgresql if you are using prepared statements and are doing a 'select star' and drop or add a column then the prepared statement will start failing. This is kind of bad when you are doing transactions because the bad statement will taint your transaction and you will need to restart from the beginning. Select star is incompatible with prepared statements and postgresql which might also explain why SQL Alchemy explicitly names columns in select statements. Rails will do 'Select star' so prepared statements are the first thing I turn off in Rails/Postgresql projects. [Maybe they have a fix now?]


> This is kind of bad when you are doing transactions because the bad statement will taint your transaction and you will need to restart from the beginning.

Why is this bad? Can you handle this on the application side somehow? Even if it just means restarting Rails when the migration has finished.

Or the problem is that you want 0 downtime and 0 UX impact migration?


It is bad that you have to write a bunch of extra code in your application to retry on this failure. I guess potentially you could modify the rails framework to automatically retry transactions in this situation and aggressively purge the prepared statement cache but this is really a rails change and not an application change. i don't think it is possible to hook the transaction logic in rails to retry transactions safely from application code.

as you guessed restarting rails does fix the problem but the issue is no zero downtime migrations for migrations that should be trivially safe. ie adding a column


How did it come to be that some portion of the industry use the term "migration" to describe changes/updates to a database?

As far as I can tell, it's a really poor fit. It generates the expectation that movement of existing schema + maybe data from one host to another or one environment to another. What's usually happening instead is essentially a schema diff / mutation.


Because when you change the representation of data at rest, you need to "migrate" that data to the new representation.

I agree that schema changes are not migration, but I think the author correctly uses the word "migration" to mean migrating their data in the database to some new schema representation.


This is true, but it's a bit of an implementation detail. All I'm doing is a schema change; a migration is what the database does under the hood to facilitate it.


The database doesn't know how to perform a data migration for you, however. Sometimes it's possible to do it in SQL as part of the schema upgrade but that isn't always possible, in which case you need to implement it in the application. But regardless of how you implement it, it's still "data migration" because you're migrating data from one representation to another and your database won't always know how to do that "under the hood" for you.


I guess it is because you migrate data between application versions. And sometimes this means simply applying an SQL script file, sometimes you also combine this with upgrading the DB server itself, then you might also need to do the whole export-import dance (async replication + export-upgrade-import on a secondary instance, then promote that instance and either drop the old and create a new secondary, or do the same exp-imp on the old-primary).

Good question though, it could be "schema upgrade", but sometimes there's a downgrade, so somehow people started calling it migration.

And it's especially important to note, that users don't care about the schema, operators care a bit, developers slightly more, and if there is a dedicated DBA in the loop, now that person might finally really care about it, but they rarely write the migration scripts. And developers care about data. (At least that's my impression.)


The word “replication” is used when replicating the data from one location to another. Maybe it depends on your industry but everywhere I’ve worked in software, migration has meant changing the schema and data and replication has meant moving to to another location.

I do agree that the terminology could be better, but it seems to be fairly standardised now.


These are the types of challenges I've been thinking about for the last year -- nice to be able to read the approach taken by others.


I'm totally in love with Django's way of migrating databases.

It's not for every project, certainly, and you sometimes need to work around limitations of the ORM. And of course some people don't like ORMs in the first place.


Anyone else hold back on releasing side projects because having to do data migrations with stored user data prevents you from being able to aggressively refactoring your code? Is there a good compromise for this?


If the side project is "small", then the migration can be done in a few hours tops, so user impact is probably negligible. Have you encountered a more severe problem maybe?


I just meant I notice I feel a lot less freedom after initial release. When I haven't released something yet, I can wipe and recreate the database at any time, move, rename etc. as I want to try things out. Once I've released though, changes become significantly more painful and risky. Perhaps a migration is a few hours tops but that's compared to a few minutes tops when you don't have active users yet and when you're a solo developer each hour is a significant amount of your productivity.


> I just meant I notice I feel a lot less freedom after initial release.

Well, there's no arguing with that. And it usually doesn't matter if you have 2 or 200000 users, the migration has to be perfect - and automated - anyway :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: