Deployinator: Etsy's deploy tool open sourced

stephth · on July 31, 2011

The slides without the talk are mostly useless, you're better off reading their Deployinator blog post:

http://codeascraft.etsy.com/2010/05/20/quantum-of-deployment

Cushman · on July 31, 2011

Does anyone know if this talk was recorded? It's the sort of deck that leaves you wondering what the presenter was saying.

Edit— Found this from NYLUG in April: http://www.archive.org/details/NYLUG-2011-04-20-Etsy-Deployi... (Talk starts at around 4:20, runs to 1:21:06.)

Haven't watched it through yet, but it looks like mostly the same slides.

cdavid · on July 31, 2011

Continuous deployment sounds right to me in principle, but I have yet to see a system to handle rollback in the face of database handling (one slide mentioned this briefly). When you have a bug (say you delete the wrong data), how do you rollback this ?

The only solution I can see is one where databases are handled differently (they generally are anyway at the deployment stage), and IMO, this is the challenging issue. Continuously deploying app servers is not trivial, but I would consider it a mostly solved problem.

rmaccloy · on July 31, 2011

Soft-delete (+ action logs) for data recovery. Also, code review and lots and lots of tests.

For schema changes, there are a bunch of approaches. Just a few:

* Treat (non-purely additive) schema changes as a special case and take extra care deploying them. (This sounds like a cop-out, but it might be the right effort tradeoff for many startups.)

* Write forwards and reverse migrations for each change, and avoid data-losing schema changes.

* Implement all (or all non-purely additive) schema changes by shadowing data operations onto a new version of the schema (asynchronously migrate all previous data onto the new schema in some back-end process.) Write tests to verify the migration/shadowed data is correct. Switch data reads to be on the new schema. Eventually delete the old schema in a subsequent deploy after an appropriate stabilization/verification period.

I find #3 to be the most general, but there's plenty of overhead (computational, storage and man-hour) that may not be appropriate for your case.

jcrites · on July 31, 2011

A number of techniques help here. These are useful whether or not you're using continuous deployment.

A simple approach is not to delete data, but rather simply set a flag which marks the data as deleted. The application storage layer acts as if this data were deleted. A separate process (changed separately) is responsible for actually purging data which has remained deleted for a certain period of time.

In any case, do you really wish to delete the data? Keep it around -- it might be useful. The only reason to delete data, outside legal concerns, is to control the cost to retain it. Given that you mostly never really want to delete data, you probably want to orient your schema toward hiding it instead. In that case, if you need to roll back application changes, you can do so by backfilling the data to restore it to a visibile status.

Keeping detailed programmatic logs is also helpful for making the backfill easy to execute. For example, in addition to your main textual log file, you keep something like a log consisting of JSON objects, one per line, each which represent the essential details of some application action.

mcfunley · on July 31, 2011

You are correct, at Etsy we deploy schema changes once per week. The code always works against both versions of the schema. We never take downtime for schema changes.

We avoid data loss by doing soft deletes as much as we can. Sometimes we do want to do real deletes, especially for data that is heavily denormalized, but in those cases we keep an audit trail so that the data can be reconstituted in the event of a mistake.

Pewpewarrows · on July 31, 2011

The problem is pretty universally solved:

1. Use some sort of schema migration tool to keep track of and version all changes you've made to the database since your very first commit.

2. Never do a backwards-incompatible schema change. Never. Time and time again these are the ones that will completely bite you in the ass when the day comes that you broke production with a deployment and desperately need the old version back up and running. You're going to end up losing data and shit will hit the fan. It doesn't take much code-wise to support the old scheme, and it will ensure that you have zero downtime while rolling out your updates.

3. Never delete data. The only exception being if some user-submitted content violates a law, like child pornography. In which case, it shouldn't be part of your migration/update process to begin with. Just use deleted flags / boolean fields. Note in your code when certain fields are no longer used anymore because they're only there to support legacy versions.

cdavid · on July 31, 2011

I am pretty much a beginner in database handling, what would you suggest for 1 ?

Concerning 2: how do handle schema fixes (example: at my previous job, the tables were very badly designed, and the application+schema combination prone to frequent race conditions). What to do in this case ?

As for 3, I already came to this conclusion on my own, so I was not completely on the wrong track :)

quicksilver03 · on Aug 1, 2011

I've used DbDeploy and Mybatis Migrations with good success, they're written in Java but they've also been handy in PHP projects.

jpluscplusm · on July 31, 2011

I totally get all these points; is there any literature I can point our devs towards to convince them (a) #2 + #3 don't leave any corner cases and (b) they really should take the time to do it (ultimately so I can put an automation solution around the the DB)?

macros · on July 31, 2011

I was at the talk and both of these were addressed. For rollback, the answer was they don't support it, only roll forward.

Databases are handled differently from this tool. Iirc he said Thursdays are database day and all changes are done/initiated then.

swanson · on July 31, 2011

While it doesn't handle every case, using a database migration framework with Up/Down style methods as part of your deploy process can work pretty effectively.

Another option is to use a VM for your database and do a snapshot before deploying.

andrewvc · on July 31, 2011

Well, yeah, but if you perform financial transactions, you can't just lose 5 minutes of history.

Also, for a DB, LVM's probably just as good as a full VM.

mathgladiator · on July 31, 2011

Chances are good that important tables do do transactions don't change as often as other things in the system. This is why I support things like CouchDB to handle the non-transaction stuff.

nknight · on July 31, 2011

Based on a very brief conversation I had with a Paypal engineer a few years back, it's SOP in modern financial transaction handling to store/log details of every transaction in many places and various forms, and there's plenty of (probably excessive) paranoia involved.

The context of the discussion was performance, scalability, and engineering resources, so we didn't get into how the logs are used, but it would be my guess that if the "primary" database had some sort of failure that lead to a rollback, affected accounts would be locked out until the transactions can be properly recovered from other storage/logging mechanisms.

AlecSchueler · on July 31, 2011

Direct link to the project: http://github.com/etsy/deployinator

mscarborough · on July 31, 2011

I'm curious about how HN feels about doing feature management all in trunk with if/then blocks (Flickr, and I assume Etsy), versus a more branch-based workflow. By having config flags you can be very particular about what features or bugfixes are enabled/disabled at any time, but it seems like what you are saving in merge time, you are paying by having a bunch of dead code hanging around until someone gets rid of it.

Even in svn, and much better in git, if you can keep a sane release and merging strategy then it's not going to introduce much overhead to getting new features or bugfixes out. Our 20+ person team deploys a few times a day using a branch system of (sprint + trunk/QA + release), but always looking for different ways to do things.

rmaccloy · on July 31, 2011

They're not mutually exclusive: you want basic correctness/design review of course (my favored approach is now pre-integration code review + authoritative master/master-always-deployable), but a proper feature-flagging system will let you beta or A/B test potentially incorrect (in a code-correctness, performance, or UX sense) changes gradually in the real world. Binary (on/off for everybody) flags are more the exception than the rule.

Server affinity + staged rollouts are another approach, but for long timeframes (e.g. A/B UX tests, feature prototypes) I'd think you'd want to go with flagging anyway from a codebase management point of view.

Some things do take more effort to make dynamically flaggable than it would take cost to delay integration, to be fair. Judgement call.

mcfunley · on July 31, 2011

Config flags and how we commit code are two completely separate issues. Very simply, you need config flags to operate a site that degrades gracefully when something is going wrong. They cannot be avoided. Is there a query in the forums suddenly hitting the database too hard? Turn off the forums while we fix it so the rest of the site isn't affected. Etc.

When a feature is in development, its config flag is off or enabled for admins. We deploy it as we develop it, in pieces that are generally not more than a few dozen lines of code long. This is another critical idea: we don't push out all of the code for an entire sizable feature all at once, ever. Doing that is a recipe for having a problem and having to review thousands of lines of code to try to figure out what it is.

For new features, we generally do not remove the config flag once a feature is live because we want to be able to disable it if anything goes wrong.

If we are replacing a feature with another one, we generally want to keep the old one around for a little while (we generally do A/B testing or ramp up new versions whenever we do this). After that, we do delete the old code and reduce the config flag to an on/off switch which was probably there in the first place.

mscarborough · on Aug 1, 2011

Thanks for the replies. I didn't mean to frame it in a mutually exclusive way; I'm up for ini or yaml configs any day. Am looking at this from a rel-eng perspective.

tszming · on July 31, 2011

For those who are interested in continuous deployment, there is a recent talk from Flickr: http://sna-projects.com/blog/2011/06/continuous-deployment-f...

czzarr · on July 31, 2011

i'm very interested in learning more about deployment, does anyone know where to look ? say for a small webapp that has 1000 daily users

NoahSussman · on July 31, 2011

If you are specifically interested in this kind of continuous deploy process then "Continuous Delivery[1]" is a decent book on the subject. The IMVU post "Doing The Impossible 50 Times A Day[2]" is another good place to start reading.

[1] http://www.amazon.com/gp/product/0321601912

[2] http://timothyfitz.wordpress.com/2009/02/10/continuous-deplo...

czzarr · on July 31, 2011

thanks a lot

drivebyacct2 · on July 31, 2011

Anyone have these slides in a non-Flash format? I try to advance the slide and it just corrupts the display. Using flash to swap out static images. Dumbfounding.

jgoulah · on July 31, 2011

PDF format: http://bit.ly/qRhsj4