Hacker News new | past | comments | ask | show | jobs | submit login
How release canaries can save your bacon (googleblog.com)
248 points by mooreds on April 3, 2017 | hide | past | favorite | 53 comments



A quick rollback frees up the developer's contextual burden and makes it easy to take calculated risks. This works better in some environments, of course. You wouldn't want it in testing a medical support system, for example. LinkedIn goes a bit further and has continuous release monitoring: https://engineering.linkedin.com/blog/2015/11/monitoring-the...


I'm less enthused about 'rollbacks' being considered 'normal'. They signify something didn't go quite right with your unit/integration/qa process. IMO there should be at least a 'mini-postmortem' to understand why it was missed even if it's in an intentional blind spot. (i.e you made an explicit decision it wasn't worth the engineering resources to get the testing fidelity needed to catch the issue earlier). It's almost always better to catch issues earlier, even if you have super neat tooling that makes it easy to rollback.


You are a CTO sitting atop very expensive hardware and software. Would you start removing deployment and runtime safety guards (such as a consumer-facing staging environment) because you want to "discipline coders and devops"?


A post-mortem should never be about placing blame on individuals, it should be about identifying flaws in a system or a process.

There are places where post-mortems can turn into blame games, but in my experience such things are counter-productive to actually solving problems. Luckily, there are plenty of engineering organizations that do not make this mistake! :)


The easiest way to avoid that is to have well structured post-mortem process, and post-mortem everything. Successful and unsuccessful releases.


We need to go from postmortem to postpartum!


On the one hand, rollbacks need to be culturally normal. Finding out that an essential part of your process (essential because it keeps your mean-time-to-repair low) is undependable because the last time you ran it was a half a year so, and right now is precisely when you need it to be dependable, well, that royally sucks.

On the other hand, what you're talking about shouldn't be that hard to implement. Just hook your rollback system into your issue tracker to create a post-mortem issue whenever a rollback is necessary, and assign it to whoever initiated the deployment (or their manager). Easy.

On the third hand, you might end up finding that a lot of your post-mortems end up looking something like "we don't have reproduceable builds and we made a managerial decision not to invest in that now". And now you just created a ton of recurring paperwork for everyone with little benefit.


It is not that hard to imagine that every failed request/action in your application costs some real money to the client (cannot complete sale, must record data locally and then reenter, etc). If that can be regressed (fines, compensations, lost sales) to vendor that is sort of increased operation costs for the duration of outage. Can you get that sure of your pre-release process as not to have ability to 'rollback' to cheaper (earlier) state? I guess not.

While I agree on your points that it is better to catch errors earlier in the pipeline and the necessity of mini-postmortems, I personally think that rollbacks are inevitable and compare them to backups. Of course it is better not to restore from backups, it is easy to rationalise good processes over a few metric tons of never been read backup tapes, but a single accidental drop of production database may quickly pay for all the effort put into ensuring backing up works.


If it was an automated rollback, a system that automatically generated either test data or other heatmap of the system when the rollback was called would be awesome.

It is amusing, because we are basically saying it would be awesome to have a coredump when the crash happens. Which... used to be standard behavior but was essentially lost in most modern development environments. (Not to mention the tooling lagged in two directions. One to help you with the coredumps, and the other to make usable core dumps.)


> They signify something didn't go quite right with your unit/integration/qa process

Staged rollouts are part of the QA strategy (whether this is something to be aspired is another question).


Given that it's impossible to have the real world simulated in qa, it's probably the best you can do. There's just no way to have all configurations for all clients in your test lab.


[Deleted because I'm being a dickhead, and can see that]


This is a straw-man argument. GP didn't argue against rollbacks. S/he argued against considering them to be normal.


Yes, a Canary lets you limit the damage if some bug sneaks past testing. We've done it for over 10 years, with staged rollouts and automated crash statistics and such.

The draw back is that prod needs to be tolerant of multiple versions. Which is usually a fine practice in itself, anyway!


That's not a drawback so much as it's a fact of life. In any large-scale (read: distributed) system trying to provide a high degree of availability, rolling upgrades are the only way code goes out and individual components need to deal with interacting with newer/older dependencies. You can constrain the matrix by only allowing current version plus one back running in production, or forcing deployment orders, and so on, but in modern systems (read: ones where you can't just say "We're taking everything down for 3 hours on Sunday to upgrade.") you can't escape non-atomic upgrades.


Another approach is you start up a full copy of the new system with all new versions, then change loadbalancers to direct all traffic away from the old system and to the new. Then decomission the old system.

With dedicated hardware, you need twice as much hardware. With cloud, you only pay double for 10 minutes during the rollout, which usually is very cheap.


That approach only works if your system is stateless, a caveat which excludes virtually all large-scale systems.


If you are interested in canary deployments, check out Spinnaker by Netflix: http://www.spinnaker.io/ There's a good talk about it here with stories from Waze and Google: https://www.youtube.com/watch?v=05EZx3MBHSY


What are the best practices redarding rollbacks when the database is affected. I would think a large amount of overhead would be required.


At my job the DB is backwards compatible by one release. This means we roll back code, not data. It takes a little longer to do something like delete a column, but it's worth it.

Rainforest QA had a blog post on strategies a while ago.


Simply testing that you maintain backwards compatibility is a royal pain though.

One could imagine a test rig which created a db with test users in it, then ran v1 of the software, then v2, then v1 again, and checked v1 didn't crash on the now mutated datastructures.

Most datastore mutations might only happen on certain user actions and inputs, so making sure you get full coverage would be rather tricky.


They touched on this in their SRE post last week: https://cloudplatform.googleblog.com/2017/03/reliable-releas...


From that link: At Google, our philosophy is that “rollbacks are normal.” When an error is found or reasonably suspected in a new release, the releasing team rolls back first and investigates the problem second.

I like that -- reminds me of aviation, where a go-around is normal. If your approach to landing isn't stabilized, you're too high, too low, too fast, too slow, etc. don't try to save it. Go around and try again.


At our place, we rollback every few weeks just to test the system, even if nothing appears abnormal.

Next we plan to automate the process - 1 in 10 rollouts will actually be a rollout, a rollback, and another rollout, checking system health at each step.


I just wanted to suggest this. Similar to 'chaos monkey' killing my processes once every few days, developers would also look differently at rollback procedures if I guaranteed that one in 10 would be rolled back randomly.


Thanks. Basically they do a db change inbetween rollouts with no feature yet. Clever.


Interesting. I've heard of this practice as 'one box' or 'one pod'. And canary used to mean, 'tests that run continously against your production stack.'

I wonder which is more prevalent.


Probably depends on your workplace, but at Google canary has meant a subset of production running at a newer version at least since '07.


I think your definition comes from Amazon/AWS... haven't heard it much from outside Amazon.


I wonder how the botched Google Drive release issue from a weeks ago worked under this scenario?


Anyone find a good way to do this with AWS Lambda / API Gateway?


Well, you could definitely front two versions of your application lambda function with a traffic splitter lambda function that would send 99% of traffic to the production alias and 1% to the canary (or whatever the number you wanted was. See how to call one lambda from another: http://stackoverflow.com/questions/31714788/can-an-aws-lambd... and aliases: http://docs.aws.amazon.com/lambda/latest/dg/versioning-alias...

This post might also be of interest: https://blog.jayway.com/2016/09/07/continuous-deployment-aws...

Note, I have never done this, this is just how is approach it.


> One solution is to version your JavaScript files (first release in a /v1/ directory, second in a /v2/ etc.). Then the rollout simply consists of changing the resource links in your root pages to reference the new (or old) versions.

I wouldn't take this advise as it's bad for caching. A change to one JavaScript file will then result in breaking the cache of everything.


Maybe I misunderstand, but don't you want to invalidate the cache when a new version comes along? Isn't the risk of version skew worse?


If one.js changes but two.js doesn't, then two.js should come from the cache. Only one.js should be fetched from network. Sticking all assets in an /assets/v2 folder invalidates everything.


If one.js and two.js are really separate components, they should each get a version. If they're closely coupled, they should be compiled together into one unit, to take advantage of deduplication, inlining, dead code elimination, fewer requests, better compression, etc etc.


Versioning assets separately by sticking them in new subfolders is just re-inventing ETags in a bad way. Just use ETags.

The point of the folder versioning scheme the article proposes is to make rollbacks easier. You can easily rollback a /assets/v2 folder to /assets/v1 by updating your server template, but if you have a dozen separate version folders (with different latest version numbers) for each resource then it's no longer easy to roll those all back.


That's fair. Most apps I have worked on have very few bundles of JavaScript that we either served inline or in one request, but I can see how this helps in the case of many requests.


> any reliable software release is being able to roll back if something goes wrong; we discussed how we do this at Google ...

How we do this at Google? Okay, tell me how to roll back the crap Android 6 upgrade back to 5 on this Samsung tablet I have here.


Clearly everyone should consider their delivery costs and error acceptability when determining their development & release process.

Continuous automated deployments might not be a great fit for satlellite control software but that doesn't mean SaaS apps should switch to the same process as satellite software teams.


I am mostly amazed that this is the first time that I've read in depth about John Scott Haldane, who is the father of noted evolutionary biologist J.B.S. Haldane. Super interesting.


As a high traffic customer of Google, I've been this person far too many times.

   [...] if it breaks, real users get affected, so canarying should be the first step in your deployment process, as opposed to the last step in testing. 
It's a fine pattern and all, but not an excuse to throw stuff at prod and see what happens.


We don't just throw stuff at prod. Google has a culture of testing and code review, so code review, unit testing and integration tests are the first line of defense. Then we do QA testing, where a lot of problems not caught by tests are found, which then inform us where tests may be lacking. We also roll out to internal dogfood first, given Google's number of employees this is kind of like a pre-canary.

The point of the canary is it gives you one last real world test where you can limit the damage if anything goes wrong. Without it, you'll have to assume your tests are perfect, and I doubt that's ever achieved in practice in a real environment, except maybe in avionics and space control systems.


I think the author of that sentence agrees with you. It's saying you should think of a canary as something to do only after testing is complete, not as a means of testing.


>but not an excuse to throw stuff at prod and see what happens

The problem with production traffic is you will never be able to simulate it perfectly. At the end of the day you are required to flip a switch and test things out in the wild.


What's a high traffic customer of Google?


[disclaimer: I work at Google]

https://cloud.google.com/customers/

Spotify, SnapChat, eBay are the examples I usually give when asked that question.

I'm not sure what numbers I am allowed to provide for any of them, but there's some public information available that gives you a sense of the scale involved:

https://labs.spotify.com/2016/03/03/spotifys-event-delivery-...

https://www.itnews.com.au/news/do-not-fear-the-cloud-ebay-mi...


Host a high-traffic site on Google's infrastructure. Since you can see the version number of the platform, it's obvious when they're rolling out changes. This has cause many partial outages until the change was (I assume) automatically rolled back.

It's a little hard to take this advice from Google after being the victim of so many bad rollouts. Because we use a lot of services, we are far more likely to have problems. We seem to always be the canary.

That's not fun.


Not mentioned in this blog post is rollout related outages.

It is common for a system to work fine before and after a rollout, but during a rollout clients experience errors.

One might imagine downloading a big file for example, which takes an hour. If you are downloading that from http-server-v1, which is being upgraded to http-server-v2, there will be a small grace period for clients of v1 to complete their operations. That grace period in many datacenters is often 30 seconds. That means if your operation is long running, you would see a failure. The error code is usually HTTP 503, for which your client logic should retry the request with an exponential backoff.

If your client doesn't retry/resume the request, now you see the service as down, when in fact the error you are seeing is by design. It will happen for every release, but also when servers come and go for maintanance, or for a bunch of other reasons.

Good libraries will handle retries for you, but some don't properly, and thats a bug.


But without canaries, you would have complete outages instead of partial ones.


Sure, but I'm pointing out that this strategy relies on real customers encountering an error. I caution people to not forget that is a failure for us trying to ensure reliable websites.


Presumably someone who sells or buys lots of ads?


I assume they mean they visit the site frequently




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: