> Continuously deploy. Every commit should be instantly deployed to production.
ah... no. Maybe this is just viable for a single developer, as a substitution for continuous integration. But with multiple developers checking in, on a complex system your site will be down. A lot.
If you're frequently having problems where new deployments are bringing down the site, then doing that more often isn't the answer. As un-fun as it may be, you should be looking at how you could have caught the problem before it went live, whether that entails more testing, more robust design or something else.
You're right. If everyone behaves the way they do with larger deploy cycles even after moving to an extremely short deploy cycle then you would have lots of downtime.
The point wasn't that it solves downtime, but that it causes you to more readily understand sources of downtime. You then have to use that information to make your development and deploy process resilient to those classes of failures. Rinse. Repeat.
For example, at my company we would spin up a new engineer and within about a month they would take out a database (or worse) with a query that would be extremely slow. What did we do to fix the problem? We capped all queries to 20 seconds, after which we would fail the page request. In the best case, a developer pushes a page with a slow query in it, and the page fails enough to cause the revision to roll back. In the worst case, the feature the developer was working on would be broken but the rest of the website would continue functioning.
And yes. yes. yes! You should be catching your problems before they go live. The list of techniques is staggering: unit testing, regression testing, functional testing, fuzz testing, exploratory testing... but those methodologies only go so far. Continuous Deploy goes further.
Agree'd. It seems like the author either has no customers or very understanding customers. Pushing every commit to production is just non-sense by any metric. What about commits that muck with the data model where bugs might potentially destroying production data? What about complex developments that consist of more than one commit?
This almost reads like a link-bait. I somehow doubt the author really believes what he's writing there.
Well, it's certainly not "no customers"... Timothy is referring to IMVU. Whether the customers are understanding or not is another issue. However, when we implemented this "cluster immune system", we had outages all of the time. But every single time, we took steps to prevent that class of failure from happening again, and now we deploy code to the cluster twenty times a day.
Well, I'm not against frequent deploys. I just found the article a bit light-hearted in tone - as if no testing at all was going on.
No matter how many classes of failures you have ironed out, in any reasonable complex system you will still regularly have regressions that are not caught by a quick "two eyes" check.
"What about complex developments that consist of more than one commit?"
Obviously since you use git or another DVCS, you can do your mucking about locally in a branch, and when it's ready to go, squash it down into one atomic commit.
Well, he didn't mention that in the article. Such a model generally requires a staging server which implies integration testing, though. Unless he suggests the release manager (or even developer de jour?) just merges the stuff on his local machine and pushes it out as he sees fit...
I wonder if you could do some automated A/B system here, and instead of every commit, just do it once a day: Every night, update one set of servers with the latest stable branch code. Route N% of your user to those servers. If faults start showing up, push N to 0 and notify developers. Otherwise, slowly move N towards 100 throughout the day. By end of the day you're either fully deployed, or rolled back.
(I already see faults with this...it's presented here as a thought experiment, not advice)
What we implement is much closer to this suggestion. We move the code out to a couple machines from each of our different web frontend servers. After a minute we compare before and after across numerous metrics (load average, cpu, errors, page failures, etc). If the revision passes, we roll it out to 100% of the cluster and do the same monitoring for 5 minutes.
Originally we intended to put business metrics in these tests, but it turns out we regress on them via code changes rarely and it takes a human to figure out what went wrong. Instead we test business metrics (and lots of other stuff) via nagios, which gives us 1-5 minute sampling frequency, good enough for most of our issues.
I did not cover how you would iterate towards the ideal (instant deploy) and what concessions you might have to make. Our 6 minute deploy is actually quite inefficient, but it's not the bottleneck to our deploy system.
(If you're wondering what our bottleneck is, our automated tests take 9 to 12 minutes despite being spread across 40 machines... Selenium in Internet Explorer is slow.)
I ask because I swear A/B and multivariatate tests have been around my head a lot lately, and when I finished reading the article, the first thing I thought was: Why not just deploy to 1% of the users and see if it works?
Then I thought about how hard would be to manage multiple versions of the same software, specially data, amongst different user. Certain features presented to the 1% might be incompatible with the other 99%. But that's a technical problem. Very hard to solve, but manageable. Then I imagined somekind of framework that would make communication between different versions of the data floating around easier, with "how to transform" data from-and-to version 1.1 and 1.2 easily.
Anyway, I am really curious, because it sounds like a good solution :)
As far as data transofmration goes you have two options:
1. Something akin to ActiveMigration is RubyOnRails world. This allows going back and forth different versions of your data's schema.
2. Use a more open data scheme such as Google App Engine uses where adding/removing properties to an object is not as disruptive compared to SQL-based solutions.
It turns out the system works fine with SQL based alters. We do have to do real work to deploy expensive alters (apply them to standbys, fail over, repeat, or worse) but in general it's cheap to change schemas.
Unfortunately, it's very manually intensive to roll back schema changes, so it's one of the few places where we put old school process in place (a DBA who reviews all schema changes prior to deployment)
ActiveMigration really solves a different problem. Our problem is that adding indexes or altering popular tables is impossible to do on a live and in production database. To get those changes out we have to go through quite a bit of extra work. It's really a MySQL limitation, not a process problem.
This is an excellent idea, please continue to explore it.
As my token contribution - Google App Engine allows storing and accessing several versions of the app (access through different subdomains). Perhaps one coudl use DNS to trick different users into seeing different app versions? Not quite what you wanted, but down the right path.
ah... no. Maybe this is just viable for a single developer, as a substitution for continuous integration. But with multiple developers checking in, on a complex system your site will be down. A lot.
If you're frequently having problems where new deployments are bringing down the site, then doing that more often isn't the answer. As un-fun as it may be, you should be looking at how you could have caught the problem before it went live, whether that entails more testing, more robust design or something else.