I was always wondering if there is some way of database deployment which does not suck. And all I see - every professional team ends with bunch of 074-add-column-to-table.sql files. I mean, code deployment can be organized much better. You can make graceful restarts, transactional updates, etc. Actually, almost nobody backups old code version because deployment of new one may fail, but database upgrades are so fragile and making backups is a must not only because you afraid of upgrade process may be interrupted and leave database in inconsistent state, but because properly done upgrade may result in corrupted state.
There is a fundamental difference between code deployment and DB upgrades: DB upgrades must preserve (and upgrade) the existing state. Code deployment only needs to ensure that the new code is in place. It can do so by deploying to a new location and then redirecting the entry points, or it can do so by laying over the new code. Either way, the ways ways in which it can go wrong are few (at least compared to DB upgrades). DB upgrades, on the other hand, must take existing state (which can measure GBs and TBs) and transform it into a new state. If you're really unlucky, this may involve a size-of-data transformation (eg. update every row in a big table). I've seen 'upgrades' that lasted weeks. Having witnessed DB upgrades at MS Azure scale (see [0]), and having to code myself several SQL DB upgrade steps (internal system upgrades, the kind done when you install new DB engine, not user upgrades) I can say with confidence that DB upgrades are hard.
What we all end up with are either DB migration steps (a-la Rails migrations), of which I approve see [1], or schema comparison tools (a-la vsdbcmd.exe) of which I'm very skeptical after being burned several times on large DBs (they are as good as the tool, and some tools are better, still I find explicit migrations much better).
As a side note, My startup DBHistory.com is recording all DB schema changes, like a DB flight-recorder of sorts, and one of my future goals is to add capability to generate compensating actions for any DB schema change and thus be able to revert the DB schema to any previous state (by simply rolling back every change, in reverse order). But I must reckon I'm quite far from having such a thing working, and I'm not even considering migrations that modify data, not schema.
occasionally i wonder if we are making a fundamental mistake when we replace the old code with the new. perhaps data versions should be first class concepts in our domains.
Perhaps I can add some context on top of the post. All updates are in transactions unless explicitly opted out. Also remember our case is a little special: we have hundreds of the databases with the same schema, what if #328 fails? The failure/rollback scenario is a bit more complicated when you go past a single database involved.
As for backups: absolutely. We handle this independently though. We do full backups every night as well as T-logs every 15 minutes. If I had to restore every database we had to a very specific point in time or just before a migration command was run: we have T-logs to do that going back 4 days at all times.
I'm sure there are good solutions for single database applications way more fully featured than our approach, they just do little to solve any problems we actually run into.
If you did different tables, that's even more complicated by making every query dynamic. It also makes backups, etc. far more complicated as well. Multiple databases is simply the simplest solution for multiple things that need a database with the same schema :)
There definitely is some tooling available to help - Redgate, many ORMs - but I agree that it's somewhat lacking.
I think the bigger problem is cultural - many programmers either don't really understand databases/data modelling or they don't care about it. After all, you don't really have to worry about it when you're just starting out - almost any schema will work. That is, right up until you have to modify it. By the time it becomes an issue, the culture has crystallised and changing the database is too risky.
For some reason, a lot of companies are largely unwilling to spend money on good database management/migration tools - even if they're paying a stack of cash for SQL Server.
> For some reason, a lot of companies are largely unwilling to spend money on good database management/migration tools - even if they're paying a stack of cash for SQL Server.
If you use SQL Server, you can do database migrations using DACPAC files. DACPAC migrations are idempotent, so you can deploy to an existing database; it will add missing columns without deleting data etc.
Personally I like manual migration scripts better, for my last projects I have been successful embedding these migration scripts in the source code of the application itself, so it integrates with source control , makes it very nice to use in test and development, and avoids many of the possible mistakes with separate code and database deployments.
There are many open source tools for this in .NET: I've used FluentMigrator, SqlFu and Insight Database Schema, and they all worked well.
I use sqitch, which is kind of like a specialized git for database schemas, and it has been pretty awesome so far. It includes dependency management, selective deployment of updates to the schema, testing before deployment with rollbacks on failure, etc. Work your way through the tutorial and I think you'll be pleasantly surprised at how many problems it solves.
A lot of our apps use Flyway and it works pretty well, though it tries to do everything in a transaction. Normally that's desirable -- we don't want a migration to partially succeed -- but we would often use CREATE INDEX CONCURRENTLY in Postgres databases, which can't run in a transaction. In those cases we would need to manually run the migrations and update Flyway's schema_version table, which is annoyingly complex (11 columns, some with no obvious purpose).
The principal is brilliant. The SSDT project is a file tree of the complete schema of your system, including security objects. There is also a simple macro-language built-in for supporting conditional-compilation in your SQL. And because it's part of Visual Studio, you get source-control and msbuild integration.
This thing does static analysis of your SQL, and does deltas on SQL schema against running databases and creates the delta scripts to publish. Works very nice for continuous deployment to testing servers.
Any change that risks data-loss will be blocked, so for those you have to execute a change manually and then publish your SSDT package. You can generate the script directly against the target database, or generate a "DACPAC" package and use a command-line tool to publish the compiled DACPAC against a target database.
The problem is that the SQL server side of Microsoft is the polar opposite of the Satya OSS side of Microsoft developers. Lots of tedious designers and slow GUIs.
Also, it has no story for deploying configuration data or initialization data. At all. We've rolled our own tooling for that.
Also, performance-wise it's a goddamned dumpster fire, and it's buggy as hell.
So yes, MS got the concept right... but the implementation leaves a lot to be desired.
I concur - when I saw SSDT I thought this looks great, I can't recall the specifics now but it seems not ready for primetime.
> the SQL server side of Microsoft is the polar opposite of the Satya OSS side of Microsoft developers. Lots of tedious designers and slow GUIs.
Someone needs to come in and send some people packing and tell the rest to get with the program. For how central MSSQL seems to be to MS strategy going forward, there sure are a lot of things that still just suck.
I know. The command-line workflow for publishing SSRS reports is awful, the SSIS packages use that crazy graphical programming language and are hyper-brittle, and Management Studio still calls all its tabs sqlquery## until you save them and uses a moronic non-standard regexp for find. So many obvious bone-headed things an open development process would have taken around back and shot. It marrs an otherwise-great and featureful platform.
I wonder about this too. It seems like some ideas from category theory and pure functional programming could really help here to provide an abstraction over the top of the (pseudo)relational model.
While doing some research on an unrelated topic, I stumbled on some potentially related work [1, 2, 3] by some researchers at MIT that could be relevant to database deployment/migration (I haven't checked in depth yet). I have not had a chance to sink into these references to see if there is any relevance or promise there, though it looks like there is some kind of commercialization effort [4].
1. Add new database structure (new columns, new tables, whatever) but leave all the old structure in place
2. Update all servers with code that writes in the new format but understands how to read both the new and old structures
3. Migrate the data that only exists in the old structure
4. Get rid of the old stuff from the database
5. Get rid of the code that is responsible for reading the old format
Conceptually it's straightforward but it can take a long time in calendar days depending on your deployment schedule, it can be tough to keep track of what's been migrated, and the data migration will cause performance issues if you don't plan it properly (e.g. trying to do a migration that locks an important table). You just have do it in a way where each individual change is backward compatible and you don't move on to the next change until the previous one is rolled out everywhere.
I put a bunch of effort into database change management over the years and work on a tool called AliaSQL that is a simple command line database migrator for SQL Server.
Using this tool has led to a dramatic increase in productivity on our team since we really don't have to worry about database changes anymore. I won't waste space here with the details but these links will fill you in if you have any interest.
After using SQL for most of my career, I'm now working on a product using MongoDB. Removing the need for schema migrations has been such a boon for productivity. You essentially push the work of migrating to the app code, where it can be done incrementally, and with a far better programming language than SQL. It's been well worth the trade offs, on my opinion.
It's called NoSQL, which removes the need for schema migrations for things like adding or deleting columns.
This could be solved for relational databases if you implemented application-level abstractions that allowed you to store all your data using JSON storage, but create non-JSON views in order to query it in your application using traditional ORMs, etc.
So, store all data using these tables, which never have to be changed:
- data_type
- data (int type_id, int id, json data)
- foreign_key_type (...)
- foreign_keys (int type_id, int subject_id, int object_id)
(we'll ignore many-to-many for the moment)
And then at deploy time, gather the list of developer-facing tables and their columns from the developer-defined ORM subclasses, make a request to the application-level schema/view management abstraction to update the views to the latest version of the "schema", along the lines of https://github.com/mwhite/JSONAlchemy.
With the foreign key table, performance would suffer, but probably not enough to matter for most use cases.
For non-trivial migrations where you have to actually move data around, I can't see why these should ever be done at deploy time. You should write your application to be able to work with the both the old and new version of the schema, and have the application do the migration on demand as each piece of data is accessed. If you need to run the migration sooner, then run it all at once using a management application that's not connected to deploy -- with the migration for each row in a single transaction, eliminating downtime for migrating large tables.
I don't have that much experience with serious production database usage, so tell me if this there's something I'm missing, but I honestly think this could be really useful.
> With the foreign key table, performance would suffer, but probably not enough to matter for most use cases.
Citation needed :) That's going to really depend.
I'm not for or against NoSQL (or any platform). Use what's best for you and your app!
In our case, NoSQL makes for a bad database approach. We do many cross-sectional queries that cover many tables (or documents in that world). For example, a Post document doesn't make a ton of sense, we're looking at questions, answers, comments, users, and other bits across many questions all the time. The same is true of users, showing their activity for things would be very, very complicated. In our case, we're simply very relational, so an RDBMS fits the bill best.
Sorry for being unclear. I'm not proposing NoSQL. I'm saying that many NoSQL users really mainly want NoDDL, which can be implemented on top of Postgres JSON storage while retaining SQL.
- data (string type, int id, json fields)
- fk (string type, int subj_id, int obj_id)
select
data.id,
data.fields,
fk_1.obj_id as 'foo_id'
fk_2.obj_id as 'bar_id'
from data
join fk as fk_1 on data.id = fk_1.subj_id
join fk as fk_2 on data.id = fk_2.subj_id
where
data.type = 'my_table'
and fk_1.type = 'foo'
and fk_2.type = 'bar'
What would the performance characteristics of that be versus if "foreign keys" are stored in the same table as the data, if fk has the optimal indexes?
If your database doesn't enforce the schema you still have a schema, it's just ad-hoc and spread across all your different processes, and no one quite agrees what it is. In the real world as requirements change and your app/service increases in complexity this becomes a constant source of real bugs while simultaneously leading to garbage data. This is not theoretical, we have a lot of direct painful experience with this. Best case scenario your tests and tooling basically replicate a SQL database trying to enforce the schema you used NoSQL to avoid in the first place.
Indexes are fast but they aren't magic. A lot of what a traditional SQL database does is providing a query optimizer and indexes so you can find the data you need really fast. Cramming everything into a few tables means everything has to live in the same index namespace. Yes you can use views and sometimes even indexed views, but then you have a schema so why jump through hoops to use non-optimized storage when the database has actual optimized storage?
Separate database tables can be put on separate storage stacks. A single table can even be partitioned onto separate storage stacks by certain column values. Cramming everything into four tables makes that a lot more complicated. It can also introduce contention (depending on locking strategies) where there wouldn't normally be any.
IMHO most systems would be better served by sharding databases than by using NoSQL and pretending they don't have a schema. If application design prevents sharding then scaling single-master, multiple-read covers a huge number of cases as well. The multiple-master scenario NoSQL systems are supposed to enable is a rare situation and by the time you need that level of scale you'll have thrown out your entire codebase and rewritten it twice anyway.
The key to schema migrations is just to add columns and tables if needed, don't bother actually migrating. Almost all database engines can add columns for "free" because they don't go mutate existing rows. Some can drop columns for "free" too by marking the field as obsolete and only bothering to remove it if the rows are touched.
Postgres (and at least one other RDBMS) has partial indexes, which pretty much solves the index namespace problem you mention: http://www.postgresql.org/docs/8.0/static/indexes-partial.ht... Partial indexes are integrated into the proof-of-concept repo I linked.
Storing a data type field in the generic storage table enables the same partitioning ability as a standard schema.
99% of NoSQL database users just don't want to deal with migrations, even if they're "free" (another big issue is synchronizing application code state and DB migration state of production, testing, and developer machines), so what they really need is NoDDL, YesSQL.
> Almost all database engines can add columns for "free" because they don't go mutate existing rows. Some can drop columns for "free" too by marking the field as obsolete and only bothering to remove it if the rows are touched.
Didn't know that, thanks.
> It can also introduce contention (depending on locking strategies) where there wouldn't normally be any.
Didn't think of that. I'm aiming this at 99% of NoSQL users in which doing things you could do with SQL requires much more effort, so allowing them to do it with SQL can accept a modest performance degradation, but if you have any good links relevant to how this storage design would affect lock contention, please share.