Why Deployment Freezes Don't Prevent Outages

objectified · on Dec 27, 2014

Whether or not this theory is true, can be measured. A few important numbers come to mind.

- do you get more operations tickets right after a production deployment?

- do your call centers get an increased calls/hour rate after a production deployment?

- are there often noticeable anomalies in system resource usage that seem to be directly related to your deployment cycle?

- do your monitoring tools show a higher rate of warnings/criticals right after a production deployment?

Whether or not production deployments introduce more risk is probably largely subjective. How well do you test, what number of changes are inside a deployment, how collaborative are your operations/development teams, are your technical teams understaffed, and so on.

The advantages of a freeze (defined as "no production deployments during a certain time window") that I see, are at least the following.

- from my own experience while working in operations, I remember very well the many sleepless nights that somehow always seemed to be the immediate consequence of new (byte)code running in production

- it gives operations a break; they often work at ungodly hours the whole year, and getting some rest is very much needed

- it gives a moment to stand still and reflect; work on internal tooling, putting some structure into ad hoc things that sneaked in, etcetera.

Furthermore, I don't agree with the philosophy that "everything is always broken". Sure: disks and power supplies break all the time, even load balancers break, security patches need to be applied, and so on. But these are things that are part of day to day operations, and most operations engineers know how to do them. It's usually controllable. Unlike a bug in some newly introduced code by one developer that causes a stack overflow every time one certain application flow is being hit. That requires a different kind of discipline to solve.

I think it's a little dangerous to generalize these things without having actual numbers to back them up; before you know it, your operations team won't have any excuse to just sit and play Quake for a week. In most cases, that's a joke.

dasil003 · on Dec 27, 2014

I agree the OA is a bit disingenuous about acknowledging the risk of deployments, but the four metrics you came up with also don't tell the whole story. Of course there will be more breakage after a deployment, the real question isn't whether that's true or not, it's whether subsequent deployments will becomes even more risky by withholding earlier deployments.

Retric · on Dec 27, 2014

Not all risks are equivelent. There is a reason planned outages are scheduled for ~2AM local time not ~2PM local time. Plenty of companies have dealt with 2h windows where if the site is down they fail. Some have even been down during that time period and ended up laying off everyone.

tezza · on Dec 27, 2014

Change freezes have more uses than the OP has highlighted.

We've just had Christmas and Boxing Day. There may be less support staff available on those days, and the devs may be on holiday.

By having a change freeze beforehand, the set of things that have changed is reduced, so any issue that arises will be easier to diagnose.

Less changes allows a firm to justify the lower support overhead... not eliminate it

sargun · on Dec 27, 2014

I'd say that's a straight up work freeze, not just a change freeze.

minot · on Dec 27, 2014

Not quite a work freeze. There is a subtle difference in that only the most critical code fixes will go in, we'd just document everything else and come up with solutions for the rest.

So this isn't exactly a 100% code freeze. There are still critical fixes that might go in with managerial understanding and approval.

The downside is people then start saying things like "my fix doesn't require any code changes, only SQL changes." I am not trying to be pedantic and say SQL is code, which it is, Rather, if the change could be done better in C# but we do a workaround in the database layer, that isn't exactly ideal.

tomohawk · on Dec 27, 2014

The idea is that if you have a deployment process, you should use it. Otherwise, you end up with "deployment by emergency", which is a process that only proceeds when an exceptional condition occurs.

A process should normally flow and make progress. If it can only make progress by exception, it is not a process and is probably greatly increasing risk.

An assembly line is a good example. By default, it moves forward. It is only stopped if something exceptional occurs.

greenleafjacob · on Dec 27, 2014

A corollary might be that one constraint to be optimized for is how much maintenance is required. That is:

* A positive amount of resources should be allocated towards deciding what to do when the disk becomes full, when memory runs out, etc., and it should be automated. * When deciding between two ways to solve a problem, one factor in that decision should be whether it injects a dependency into some other process / function.

beejiu · on Dec 27, 2014

Nothing wrong with 'feature freezes' rather than complete code freezes. It's a great opportunity to get work done that you've been putting off all year. If the management aren't expecting new features, it's a perfect time to fix bugs, improve reliability and experiment.

nailer · on Dec 27, 2014

The author hits it on the head: financial institutions don't understand that stasis also has risk. I spent a couple of years at an investment bank that happily paid 5000 USD per server per year to run an out of date, critical bugs only copy of Solaris 8 as their production OS.

lostcolony · on Dec 27, 2014

The author seems to be mixing his messages.

His second sentence ("What really happens, of course, is that the system in question becomes booby-trapped with extra risk. As a result, problems are more likely, and when there there is even a slight issue, it has the potential to escalate into a major crisis.") I -completely- disagree with, as phrased/contextualized, both from a theoretical and an experienced perspective (week long change freezes prior to important events have led to fairly straightforward operation cycles, whereas deployments lead sometimes to unexpected new features, new UI elements, new bugs, etc, which people on a trade show floor probably don't want to have to deal with when demoing).

However, his conclusion (the big long paragraph at the end that I won't bother quote) is suitably vague and abstract and filled with good advice (while ignoring subtleties and specifics, such as when he admits 'Frozen systems can run as-is briefly', but then immediately goes on to describe the issues with leaving them running as is for long periods) that I can't directly disagree with it.

In short, I'm not really sure what to do with this. "Improve your deployment processes!" Well, sure, that's always a good thing. But that's mostly orthogonal to whether we have change/deployment freezes (they don't happen because of fear of the deployment process, but because of fear of the new code and the changes it brings). And the entire argument seems to posit that all change freezes are bad, yet he both throws a bone that they at least are tenable for brief periods in his conclusion, -and- ignores the body of evidence pretty much everyone has that short change freezes -do- grant comparative stability, which may make for sound business decisions (I personally have sat in a demo where key functionality was broken because of a last minute check-in of some library code from someone on another team that they had not properly tested. That stuff -happens-).

Were the tone changed to, sure, include all the risks inherent in a change freeze, but to rephrase the arguments to show that as time goes on an inflection point is reached such that the cons outweigh the benefits, and posit that that point is reached much faster than you might think, I'd be completely on board. As it is...meh.

EDIT: Maybe the author is specifically targeting web applications with frequent pushes to prod, i.e., the code you're freezing hasn't had time to shake out any of its issues, as compared to a versioned release that has been in production, and patched as necessary, for a week or two prior to the freeze, and the code freeze is just applicable to new versions, not patches considered critical.