> We also use Postgres which doesn't create the obstacles that some other database systems have, for example with locking tables on schema changes.
Actually schema changes still require exclusive table locks, they're just held very briefly since Postgres doesn't need to rewrite each tuple. (With some exceptions, like adding a NOT NULL column, or reducing the size of a VARCHAR.)
I don't mean to be pedantic, it can be important because the exclusive lock can be blocked by other long-running queries. Then Postgres will try to grant the alter table's exclusive lock before other conflicting locks, so normal reads and writes will be locked out until the long query completes or one of the queries is killed.
"Downtime" is a under-defined in this context. The author seems to use the definition "won't cause errors or disconnections" while you are also considering the response time.
Everyone cares about the former, while the latter is of more interest to heavily-loaded sites.
The case you mention of a long-running query is important, so everyone needs to have some idea of their workload at the time.
pt-online-schema-change, part of the Percona Toolkit, is my go-to tool for making production changes in MySQL.
Improvely and W3Counter both have tons of multi-gigabyte tables. A regular ALTER TABLE statement could take hours to run, and would lock the table the entire time, essentially taking down the service.
Percona's tool makes a copy of the table, runs the ALTER TABLE statements on the copy, then sets up triggers on the original to temporarily mirror data-altering queries to the copy. It then intelligently batches the data migration from the old table to the new one to avoid overloading the server, while printing progress and ETA on the console. When the copying's done, the old table is dropped and the modified one renamed in a single transaction.
It takes just one command and has always worked flawlessly.
For the many people discussing how this has been effective for them, does MySQL 5.6's online DDL [1] not solve this well enough for you? I know pt-online-schema-change does some extra stuff to e.g. prevent overloading the server, but so far regular migrations in 5.6 have handled this fine for us.
For ActiveRecord (or other Ruby-based) migrations with MySQL, I've had good luck with Large Hadron Migrator[1]. It's worked flawlessly for some of our own tables with several hundred million rows (and no downtime).
This looks interesting. Do you know how it actually works? The readme simply says, "The basic idea is to perform the migration online while the system is live, without locking the table. In contrast to OAK and the facebook tool, we only use a copy table and triggers." What's the technique used to achieve that?
From what I understand, it creates a new table with the same structure as the existing table, runs your migrations on it, adds triggers to bring live changes in from the original table while copying (in batches) the records in the old table. Once everything is copied and updated with triggers, it swaps out the old table for the new one in an atomic operation.
Coincidentally, I was dreading having to add an index on a multigigabyte table this weekend, because I didn't know whether or not it would be done in time for Monday morning business. You just helped put my mind at ease, so I can have a stress-free weekend. Thanks for the tip!
Indeed. I used it successfully with large tables (100GB+ of data) under heavy transactional load. It takes time and requires extra disk space since it creates a second copy of the table thus you need to plan accordingly. It also works great with Galera as long as you are not touching primary key.
There will be a point in time where the migration has been done and the trigger hasn't been applied.
Even sneakier, if the trigger is applied in code and not in the database, there will be period in time where some of the code is running the trigger and some isn't.
The trigger needs to be in production and executing prior to the backfill running.
I'm also loathe to rely on "triggers" in application model logic. I've seen too many platforms where the operations engineers interact directly with the database, resulting in required logic not executing.
Good idea for the fake delete column. I've always seen non rollback-able column removal migrations as something we just have to do with, this totally solves the problem.
As for when is the good time to actually remove the column, I think this can merge pretty well with my current thinking about migrations.
A new developer joined my company (we use rails as well). When setting things up, he tried to run `rake db:migrate` (my fault, I mentioned it in my doc instead of `rake db:schema:load`, I wrote doc as I was setting up the very first lines of the project). There was several years of migrations, some of them not working anymore since they used code that does not exist anymore (like `#add_hstore_index` from activerecord-postgres-hstore, which has been replaced by native postgres support).
This made me thinks that there is no need to maintain migrations in repository if we can't use them anymore. And thus, from time to time, migrations should be flushed. This flush should be the perfect time to actually delete columns (well, this should be done just before). That way, we ensure our migrations can constantly be rolled back, without breaking anything.
Edit : refining on that, we probably don't want to flush all migrations, but only the older ones. So, how do we insert the migration to actually remove data ? We can't just put it between old migrations and the ones we'll keep, since it would be considered already migrated. We can't remove it directly in sql console either, or we would have to do it on every machine and this would caused desync among developers machines.
I think the best way to do it would be to append a new non-reversible (but that does not break on reverse) migration, that uses `table_exists?` and `column_exists?` to ensure it is only ran on databases that contains the deprecated columns / tables. Something like :
class RemoveDeprecated < ActiveRecord::Migration
def up
remove_column :foos, :bar_deprecated if column_exists? :foos, :bar_deprecated
end
def down
end
end
This is great. I've been wondering how to do this, and this is exactly what I had come up with. I'm glad to have confirmation that it's the normal way people do zero-downtime migrations.
I might add one change. You say add a column migration, ship it, then add the feature that uses the column and ship again. I was planning on shipping the code at the same time (or even before) the migration, but having the feature disabled with a feature-flipper until the migration is done.
More complicated scenarios occur when migrating millions/billions of records to a completely new data structure.
1. Create new structure
2. Add trigger to old tables for insert/update to new tables.
3. Create batch job that migrates records that aren't added by 2 above.
4. Ship.
When migration of data is completed.
1. Deploy new application code to all servers.
2. Remove old tables & triggers only after code is deployed and verified.
Make sure that you don't have any code that does "select " elsewhere in your system. Otherwise, the result sets' shape changes when you start adding columns. (And you can get into naming collisions where suddenly a select is ambiguous where it wasn't before (when both a and b have a column named ID, for example).
I'd suggest putting somewhere in there making sure you create backups with the CREATE TABLE AS SELECT command. I've used it a lot, and it's another one of those steps that makes rollbacks, etc a lot easier.
If I'm understanding correctly, you take all frontends out of HAProxy's config, so that HAProxy basically queues requests in-memory until you put a frontend back into circulation? That seems like a much simpler solution for when you have migrations that do not take very long (and enough memory for HAProxy to spool those requests).
Would you be willing to add a minimal example HAProxy config to your blog post that demonstrates your setup? I've been meaning to try out HAProxy but the documentation is so dense I've had trouble getting started. Your use case sounds like a really good reason to give it another go.
Issue an ALTER TABLE ADD COLUMN on the production database. Try not to make it NOT NULL.
> Case #2: I want to change a column type
Add another column (see Case #1) of the same name, with a suffix like "_2" or "_NUM" so you can tell them apart.
> Case #3: I want to rename a column
Don't risk it. Keep using the original column name. Very few of the column names reflect the reality of that they contain anyway. It's little bits of insider knowledge like this ("Product name is actually stored in ITEMNAME_LONG, ITEMNAME is only a CHAR(15) so we keep UPC in there") that contribute to our fast-paced, agile work environment.
Actually schema changes still require exclusive table locks, they're just held very briefly since Postgres doesn't need to rewrite each tuple. (With some exceptions, like adding a NOT NULL column, or reducing the size of a VARCHAR.)
I don't mean to be pedantic, it can be important because the exclusive lock can be blocked by other long-running queries. Then Postgres will try to grant the alter table's exclusive lock before other conflicting locks, so normal reads and writes will be locked out until the long query completes or one of the queries is killed.