Cascading errors caused AWS to go down

jrockway · on June 18, 2012

The RSS link was quite amusing. My Chrome instance downloaded the RSS file without displaying it. Then I clicked it to open, and it opened Firefox. Firefox showed its file download box, suggesting I open the RSS with Google Chrome.

Deadlock detected.

tedunangst · on June 18, 2012

On Linux, right? Firefox, or whatever gnome/dbus/opendesktop/gtk fuckery it uses has all sorts of strange notions about file types. When I download a tar.gz file, it saves a copy to /tmp, then launches a new instance of firefox with a file:// url, which opens a save file dialog.

dfc · on June 18, 2012

Respectfully, your setup is broken; not linux...

TazeTSchnitzel · on June 18, 2012

Yup, similar file association nonsense can happen on Windows too (and it is a right PITA to fix)

dkersten · on June 18, 2012

Firefox on Linux works fine for me for both RSS files and tar.gz files.

coderintherye · on June 18, 2012

Does anyone know a solution for this? When I got upgraded to this version of Chrome (Version 20.0.1132.34 beta), it started downloading RSS feeds instead of displaying them. Much sadness has ensued :(

anaheim · on June 18, 2012

AFAIK Chrome has never detected RSS correctly for me. I assumed this was special to Mozilla, with their "we should probably provide a basic version of everything - newsreader, FTP client, etc., even if it makes the browser a bit more bloated." Bit like emacs.

Chrome is a bit like vi, if you want more stuff, there's probably some sort of extension.

Sorry for the emacs/vi analogies, I'm not trying to flame :)

rapind · on June 18, 2012

Yes there is an extension for it, by Google too.

https://chrome.google.com/webstore/detail/nlbjncdgjeocebhnmk...

coderintherye · on June 18, 2012

Well, no, the extension only allows you to subscribe to an RSS feed, whereas the aforementioned feature in Chrome was that it would open the RSS feed inline.

Osiris · on June 18, 2012

Opera detected the link as an RSS feed and asked me to subscribe to it with the built-in RSS reader.

I wonder if other browsers would benefit from being able to at least detect RSS feeds and display appropriate information to user.

ars · on June 18, 2012

> I wonder if other browsers would benefit from being able to at least detect RSS feeds and display appropriate information to user.

Firefox does. It shows the feed right in the browser.

hboon · on June 18, 2012

Safari does and pops up the default RSS client.

forgotusername · on June 18, 2012

In my time at larger companies, DC power seems to be one of the weakest links in the reliability chain. Even planned maintenance often goes wrong ("well we started the generator test and the lights went out, that wasn't supposed to happen. Sorry your racks are dead").

Usually the root cause appears simple - a dead fan, breaker set to the wrong threshold, alarm that didn't trigger, incorrect component picked during design phase, or whatever else that gets the blame - things it would seem to a software guy that good processes could mitigate.

Can any electrical engineers elaborate on why power networks fail (in my experience at least) so frequently? I guess failure modes (e.g. lightning strike) are hard to test, but surely an industry this old has techniques. Is it perhaps a cost issue?

mrkurt · on June 18, 2012

It's really incredibly complicated, and difficult to test fully. The bits of Amazon's DC that failed seem like stuff normal testing should catch, but the DC power failures I've dealt with in the past always had some really precise sequence of events that caused some strange failure no one expected.

As an example, Equinix in Chicago failed back in like 2005. Everything went really well, except there was some kind of crossover cable between generators that helped them balance load that failed because of a nick in its insulation. This caused some wonky failure cycle between generators that seemed straight out of Murphy's playbook.

They started doing IR scans of those cables regularly as part of their disaster prep. It's crazy how much power is moving around in these data centers, in a lot of way they're in thoroughly uncharted territory.

rdl · on June 18, 2012

The even crazier thing is big industrial plants where they are using tens or hundreds of MW and have much lower margins than datacenter companies, so they run with dual grid (HV, sometimes like 132kV) feeds and no onsite redundancy. As in, when the grids flicker, they lose $20mm of in-progress work.

bigiain · on June 18, 2012

I'd guess that's because "tens or hundreds of MW" of on-site backup power would be _ludicrously_ expensive to own/maintain, and the tradeoff against the risk of both ends of their dual grid flickering at once and trashing the current batch is less expensive. (or maybe the power supply glitches are insurable against, or have contract penalty clauses with the power companies?)

rdl · on June 18, 2012

I assume you mean "Datacenter (conditioned) Power", not literally Direct Current power.

In my experience (in ~30 datacenters worldwide, and reading about more), the actual -48v Direct Current plant is usually ROCK SOLID, in comparison to the AC plant. It's almost always overprovisioned and underutilized, at least in older facilities, or those with telcos onsite (who, unlike crappy hosting customers, actually understand power).

My pro tip for datacenter reliability is to try to get as much of your core gear on the DC plant as possible -- core routers, and maybe even some of your infrastructure servers like RRs, monitoring, OOB management, etc. Ideally split stuff between DC and AC such that if either goes down, you're still sort of ok, or at least can recover quickly. DC and AC is even better than dual AC buses, since what starts out as dual AC can easily end up with a single point of failure later (like when they start running out of pdu space, power, or whatever), and dual AC is also more likely to have a closer upstream connection.

DC stuff is WAY simpler to make reliable and redundant, just uses larger amounts of copper and other materials.

Spooky23 · on June 18, 2012

Not an EE, but I've observed a few things about electrical infrastructure:

- The work is usually done by outside contractors, working off of specifications that may or may not make sense.

- Some aspects of testing have the potential to be dangerous to the people doing them. (ie. if network switch fails in testings, no big deal. If some types of electrical switch break during testing, the tester is dead.) High voltage electricity is not a toy.

- IT and facilities staff usually don't talk much, and often don't understand each other when they do.

- There's no instrumentation. I get an alert when IT systems aren't configured right. Nothing from the other stuff.

- There is a wide variance in quality of electrical infrastructure that isn't obvious to someone who isn't skilled in that area. IT folks don't need to deal with computers built in 1970. Electricians deal with ancient stuff that may be completely borked all of the time.

nl · on June 18, 2012

Rackspace has a pretty detailed report[1] of their 2009 outages, which is surprisingly similar to the Amazon problems.

[1] http://broadcast.rackspace.com/downloads/pdfs/DFWIncidentRep...

davps · on June 18, 2012

Power failures caused by lightning strike are relatively easy to test with platforms like RTDS [1] (I am not affiliated to RTDS).

I know that you can test in real time your electrical protection systems for almost all the possibilities you can imagine (thousands of them), for example: faults in your high voltage utility distribution system, breaker failures, coordination of the protection systems, lost of your back-up generator power. I don't know their systems or their philosophies, would be interesting for me to know why they don't parallelize groups of generators (at the backup system), so, when one generator fails, the power load are balanced to the others (and using well known schemes to avoid cascade failures).

[1] http://www.rtds.com/applications/applications.html

jluxenberg · on June 18, 2012

"Those customers with affected instances or volumes that were running in multi-Availability Zone configurations avoided meaningful disruption to their applications"

"Meaningful disruption" is a bit of a weasel word; Amazon's own EBS API was down for almost two hours[1] despite being designed to use multiple AZs

[1] "the EBS-related EC2 API calls were impaired from 8:57PM PDT until 10:40PM PDT ... The EC2 and EBS APIs are implemented on multi-Availability Zone replicated datastores"

Guess the moral of the story is, if you require high availability then you must test your system in the face of an availability zone outage.

damian2000 · on June 18, 2012

I love this sentence: Those customers with affected instances or volumes that were running in multi-Availability Zone configurations avoided meaningful disruption to their applications; however, those affected who were only running in this Availability Zone, had to wait until the power was restored to be fully functional.

Translation: If you have a redundant (multiple-AZ) installation, then you were ok, if not then your server died.

jtchang · on June 18, 2012

Data Center Operator:

We've lost our main power. No problem though we have a backup generator so we are good!

... 5 minutes later ...

Uhh boss, our backup generator's fan crapped out. But no worries we have a secondary generator just for this kind of scenarion!

...10 minutes later and lights go out...

"Well damn...looks like we configured the breaker wrong. This is not a good day."

lutorm · on June 18, 2012

Things could be have been worse -- it could have been a nuclear power plant. oh wait...

eli · on June 18, 2012

In case your browser doesn't speak RSS:

Service is operating normally: Root cause for June 14 Service Event June 16, 2012 3:15 AM

We would like to share some detail about the Amazon Elastic Compute Cloud (EC2) service event last night when power was lost to some EC2 instances and Amazon Elastic Block Store (EBS) volumes in a single Availability Zone in the US East Region.

At approximately 8:44PM PDT, there was a cable fault in the high voltage Utility power distribution system. Two Utility substations that feed the impacted Availability Zone went offline, causing the entire Availability Zone to fail over to generator power. All EC2 instances and EBS volumes successfully transferred to back-up generator power. At 8:53PM PDT, one of the generators overheated and powered off because of a defective cooling fan. At this point, the EC2 instances and EBS volumes supported by this generator failed over to their secondary back-up power (which is provided by a completely separate power distribution circuit complete with additional generator capacity). Unfortunately, one of the breakers on this particular back-up power distribution circuit was incorrectly configured to open at too low a power threshold and opened when the load transferred to this circuit. After this circuit breaker opened at 8:57PM PDT, the affected instances and volumes were left without primary, back-up, or secondary back-up power. Those customers with affected instances or volumes that were running in multi-Availability Zone configurations avoided meaningful disruption to their applications; however, those affected who were only running in this Availability Zone, had to wait until the power was restored to be fully functional.

The generator fan was fixed and the generator was restarted at 10:19PM PDT. Once power was restored, affected instances and volumes began to recover, with the majority of instances recovering by 10:50PM PDT. For EBS volumes (including boot volumes) that had inflight writes at the time of the power loss, those volumes had the potential to be in an inconsistent state. Rather than return those volumes in a potentially inconsistent state, EBS brings them back online in an impaired state where all I/O on the volume is paused. Customers can then verify the volume is consistent and resume using it. By 1:05AM PDT, over 99% of affected volumes had been returned to customers with a state 'impaired' and paused I/O to the instance.

Separate from the impact to the instances and volumes, the EBS-related EC2 API calls were impaired from 8:57PM PDT until 10:40PM PDT. Specifically, during this time period, mutable EBS calls (e.g. create, delete) were failing. This also affected the ability for customers to launch new EBS-backed EC2 instances. The EC2 and EBS APIs are implemented on multi-Availability Zone replicated datastores. The EBS datastore is used to store metadata for resources such as volumes and snapshots. One of the primary EBS datastores lost power because of the event. The datastore that lost power did not fail cleanly, leaving the system unable to flip the datastore to its replicas in another Availability Zone. To protect against datastore corruption, the system automatically flipped to read-only mode until power was restored to the affected Availability Zone. Once power was restored, we were able to get back into a consistent state and returned the datastore to read-write mode, which enabled the mutable EBS calls to succeed. We will be implementing changes to our replication to ensure that our datastores are not able to get into the state that prevented rapid failover.

Utility power has since been restored and all instances and volumes are now running with full power redundancy. We have also completed an audit of all our back-up power distribution circuits. We found one additional breaker that needed corrective action. We've now validated that all breakers worldwide are properly configured, and are incorporating these configuration checks into our regular testing and audit processes.

We sincerely apologize for the inconvenience to those who were impacted by the event.

yaix · on June 18, 2012

"service event"

Added to my list of favourite euphemisms.

namidark · on June 18, 2012

Sounds like Amazon is doing something wrong, shouldn't it fail over to Battery then Generator?

ericabiz · on June 18, 2012

The batteries at colocation facilities are only designed to hold power long enough to transfer to the generator. They're also a huge single point of failure. A better design is a flywheel that generates enough power. But datacenters are often hit with these generator failures (in my experience, once every year or so.)

Amazon had a correct setup--but not great testing.

By the way, these are great questions to ask of your datacenter provider: Are there two completely redundant power systems up to and including the PDUs and generators? How often are those tested? How do I set up my servers properly so that if one circuit/PDU/generator fails, I don't lose power?

There is a "right way" to do this--multiple power supplies in every server connected to 2 PDUs connected to 2 different generators--but it's expensive, and many/most low-end hosting providers won't set this up due to the cost.

(I ran a colocation/dedicated server company from 2001-2007.)

bigiain · on June 18, 2012

"Are there two completely redundant power systems up to and including the PDUs and generators? How often are those tested?"

And "are they tested for long enough to detect a faulty cooling fan that'll let the primary generator run at normal full working load for ~10mins and are the secondary gensets run and loaded up long enough to ensure something that'll trip ~5mins after startup isn't configured wrong?

While they clearly failed, I do have some sympathy for the architects and ops staff at Amazon here. I could very easily imagine a testing regime which regularly kicked both generator sets in, but without running them at working load for long enough to notice either of those failures. My guess is someone was feeling quite smug and complacent 'cause they've got automated testing in place showing months and years worth of test switching from grid to primary to secondary and back, without every having thought to burn enough fuel to keep the tests running the generators long enough to expose these two problems.

"There is a "right way" to do this …"

There's a _very_ well known "right way" to do this in AWS - have all your stuff in at least two availability zones. Anybody running mission critical stuff in a single AZ has either chosen to accept the risk, or doesn't know enough about what they're doing… (Hell, I've designed - but never go to implement - projects that spread over multiple cloud providors, to avoid the possible failure mode of "What happens if Amazon goes bust / gets bought / all goes dark at once?")

raverbashing · on June 18, 2012

Or maybe they should just do it backwards

Run the generators and have the grid as backup

And just stop the generators to validate fallback to grid once in a while

rdl · on June 18, 2012

Large cogeneration sites (where they use the waste heat from electrical generation for process steam, building/district heating, etc.) actually do run in grid-backup mode. An example is MIT's cogeneration plant (a couple of big natural gas turbines on Vassar street) -- a lot of universities do this since they can use the steam for heating, and a lot of industrial sites do it for process.

It comes down to cost and zoning/permitting. It's much easier to get a permit to run a generator for backup use than to run one 24x7. It's also hard to get a 1-10MW plant which is per-KWh as efficient/inexpensive as the grid (although now that natural gas is about 20% of what it was when I last bought it, gas turbines actually are cheaper than industrial tariff grid power, if you have good gas access...). Being able to actually use the waste heat is what makes the combined cycle efficiency worth it.

There was a crazy plan to run a datacenter on a barge tethered to the SF waterfront, for a variety of reasons, but a primary one being power -- the SF city government wouldn't be able to regulate the engines/generators on a ship running 24x7.

lolryan · on June 18, 2012

My university had a big cogen plant, but it was never designed to power the entire campus (it was only able to do so at around 3 AM). Aside from providing heating and power, because it was run off of natural gas it qualifies for clean energy credits, which the university makes money off of by selling on the market.

potkor · on June 18, 2012

Hmm, wouldn't it be less practical to do that with large CHP plants (vs small ones)? Here in Europe district heating CHP plants are generally run by utilities.

mkramlich · on June 18, 2012

that sounds like the Crash Only Software paper, but with respect to power sources

rdl · on June 18, 2012

The rotational UPSes are the cause of the majority of 365 Main's downtime, and in general, horrible and must be destroyed with prejudice.

They're a nice idea in principle (and were the best option back in the mainframe era), but power electronics have gotten better faster than rotational maintenance at a datacenter company. They also weren't widely deployed enough to have a great support system, and it was firmware/software which caused most of their outages.

Dual line cord for network devices, and then STSes per floor area, probably make the most sense. Basically no commodity hosting provider uses dual line cord servers on A and B buses. I love having dual line cord for anything "real" (including virtualization servers for core infrastructure), but when you're selling servers for $50/mo, you can't.

(there's the Google approach to put a small UPS in each server, too...)

stickfigure · on June 18, 2012

I had to upvote you just because I'm still pissed at 365 Main dropping power to our entire cage five years ago.

ericabiz · on June 18, 2012

(Person you replied to here.) "The rotational UPSes are the cause of the majority of 365 Main's downtime, and in general, horrible and must be destroyed with prejudice."

No. Incorrect. There is a reason I 100% refused to move my hosting company there. I'm not going to say anything else publicly, but it wasn't the hardware that caused repeated outages there. (I moved my hosting company from San Francisco to San Jose, and lived in the Bay Area for 10 years. Everyone in the hosting industry in the Bay Area knew each other. I also hosted for years in AboveNet SJC3, which had the same flywheel setup.)

Note: I hope at this point they've fixed the issue. I've been out of the industry for a few years. I wish them the best.

rdl · on June 18, 2012

Yes, I almost took half a floor of 365 Main back in 2003-2004, and didn't due to their (at the time) tenuous financial situation and thus being underresourced on everything. That and there being ~no carriers in the building at the time. For SF colo, 200 Paul remains a superior choice, although some floors have had problems, and it's a totally conventional design.

But the hitech UPS was a weak link. When they sold all their facilities to someone else (DRT), that fixed most of the other issues.

WestCoastJustin · on June 18, 2012

Totally correct -- In our data center, and many of the universities that we work with, will have a right hand (RH) and left hand (LH) power feed for a dual PDU server, switch, router, console server, etc. You will typically run a power bar on the right and left hand side of the rack and wire then accordingly. You will have dedicated beakers, panels, UPS, and generator, etc for the RH/LH side. If you ever need to service a panel, then you can safely cut power knowing that everything will still be powered by its partner. This happens once and a while and allows you breathing room if you need to replace a breaker or a power feed fails. We also test our generators on a monthly cycle to test for failures.

I also wanted to address you point about batteries. We have a device on each battery that monitors it's state. So we can find faults before they cause the entire UPS to fail.

bigiain · on June 18, 2012

"We also test our generators on a monthly cycle to test for failures."

Curious - when "testing" them, how long do you run them for and at what load?

I could see the beancounters being _very_ unhappy with the ops people saying "we want to run both gen sets at full datacenter load for more than 10 minutes at a time, every month", which is what Amazon would have to have done to detect the faulty cooling fan problem. I'm guessing there are _some_ organisations who do that, but I suspect most datacenters don't.

WestCoastJustin · on June 18, 2012

I work for the Fed. We are a remote site and have several power outages each year due to trees on the power lines or snow related issues. It's pretty much required and we've had zero issues justifying it.

rdl · on June 18, 2012

The baseline generator maintenance cost exceeds the fuel cost every month. A bigger issue is getting permits from your local/moronic government to run your generators for testing, but for a big datacenter, you're probably in an industrial neighborhood (to get correct power from multiple substations or higher) and this isn't an issue -- it's more an issue with office-datacenters or other backup systems in normal residential/commercial neighborhoods.

dnewcome · on June 18, 2012

It takes a lot of discipline to run a shared power setup like you describe. Most of those servers that have two power supplies operate in a shared power mode, rather than active/failover. This means that if one of your sides (LH/RH) is over 50% and you fail over, you are going to have a cascading failure as the other side goes >100%. It used to surprise me how often I saw things like this, although I'm talking about server rooms in the low 100s of servers, not huge data centers.

Retric · on June 18, 2012

Ideally, you want at least 3+ generators any two of which can power the facility.

rkluge · on June 20, 2012

Telcordia recommends start testing generators monthly and doing a full load test once a year. Not every generator that starts will carry a load as many folks have discovered.

mjwalshe · on June 18, 2012

no doing it right = everything runs off battery and you should be testing your gensets every day or so - we ceratinly did back in the 80's at Telecom Gold

mrkurt · on June 18, 2012

Battery backup isn't usually considered a "failover" step, just like your desktop battery backup doesn't actually do much other than stop charging when the power goes out.

Datacenters only really have battery backup to let the emergency generators come up.

WestCoastJustin · on June 18, 2012

You're correct. It goes battery then generator. If they didn't use battery first, then when the power initial failed all systems would be off-line as the generator takes about 15-30 seconds to kick in.

ericabiz · on June 18, 2012

He's correct or incorrect, and that entirely depends on the facility. Many newer datacenters don't use battery banks--they are expensive to maintain and often cause more failures than they prevent.

rdoherty · on June 18, 2012

From what I can remember of a datacenter tour, most generators supply power to the batteries, which then supply power to servers. The batteries can only supply a few minutes (I think) of power, so the generators need to turn on immediately.

tysont · on June 18, 2012

On the plus side, the level of transparency that AWS displays and the detail that they provide seems above and beyond the call of duty. I find it refreshing and I hope that other companies follow suit so that customers can understand the details of operational issues, calibrate expectations appropriately, and make informed decisions.

rdl · on June 18, 2012

They're less transparent and responsive than most datacenter or network providers -- it's just that most of those providers hide their outage information behind an NDA, so only customer contacts get it, vs. making it public.

smackfu · on June 18, 2012

Yeah, a good datacenter will have SLAs around the root-cause analysis document for any failures. Like a preliminary report within a day and a final report within 7 days.

acdha · on June 18, 2012

I've also had a few cases where providers either outright lied or only gave details if you persisted in requesting them. Having to play that game gets old…

mleonhard · on June 18, 2012

The title is incorrect. It should say something more like "Cascading failures cause part of AWS to go down."

mleonhard · on June 18, 2012

I'm running https://www.rootredirect.com/ and http://www.restbackup.com/ in us-east-1, in multiple availability zones. Both sites remained up with no problems.

aparadja · on June 18, 2012

Does your rootredirect service actually attract paying customers? I'm genuinely interested to know.

joelcollinsdc · on June 18, 2012

Also, how do you do this without having to handle the customers DNS lookups as well?

moe · on June 18, 2012

Can someone translate that to control rods and manifolds?

tzury · on June 18, 2012

Seems like deploying on two _physical_ regions (or more) is the best and only proven approach.

That could be within the global AWS, or even say, one cluster at AWS and the other at RackSpace/Linode, etc.

robryan · on June 18, 2012

Then you just need to worry about your application consistency with the replication lag. No silver bullet I guess.

drags · on June 18, 2012

Did anyone else run into issues with ELB during the outage? We're multi-AZ and could access unaffected instances directly without a problem, but the load balancer kept claiming they were unhealthy.

gosub · on June 19, 2012

Could it be possible to have power management the same way Erlang manages processes? Instead of 2 or 3 enormous backup power unit, hundreads of small ones to come in and out of use "fluently".

anaheim · on June 18, 2012

TL; DR:

Shit happens. Don't use AWS as your only platform, you will get burned sometime. Guaranteed, you will also get burned if you try to host and run your own stuff. How competent you are determines which way you get burned less.

starship · on June 18, 2012

Actually, starting right now, AWS is probably your best bet.

Old story about Chuck Yeager from the 1950's: one time shortly after take-off, Yeager's aircraft suffered an engine failure, and he had to do an emergency semi-crash landing. When he realized that a mechanic had put the wrong type of fuel in the plane, he went looking for the guy. The mechanic profusely apologized, said he would resign and never work in aviation again. Yeager replied something along the lines of "Nonsense. In fact, I need someone to refuel my plane right now, and I want you to be the one to fuel it. That's because of all the guys here, I know you'll be the one guy who'll be sure to do it right."

Probably apocryphal, but the point has merit.

Monkeyget · on June 18, 2012

This is mentioned in 'How to win friends and influence people' where the anecdote is about Bob Hoover and Jet fuel put in a WW2 plane. It is used as an example that it is easy to criticize and complain but that it takes character to be understanding.

mhartl · on June 18, 2012

Or, if you use AWS as your only platform, accept that shit will happen from time to time. Unless your application is a matter of life and death, or unless billions of dollars are at stake, a little downtime now and then probably isn't that big a deal. (All my sites went down when Heroku did (including railstutorial.org, which pays my bills), but the losses are acceptable given the convenience of not having to run my own servers.)

rdl · on June 18, 2012

I think it's reasonable to escalate criticism of Heroku for remaining in a single AZ. They have had plenty of time and resources to fix this, and haven't, despite being quite competent. I don't know if it is that they don't think it's necessary (due to the profile of their current customers) or what, but I wouldn't use Heroku for anything as long as they remain in a single AZ, and would be really reluctant to advise other people to do so. I obviously really like the Heroku team and product and would love to use them otherwise.

It wouldn't even need to be true seamless failover across AZs right away -- just offering a us-west and us-east Heroku would be enough for me, with shared nothing (maybe billing, or not even that), and then figure out redundancy yourself inside your app. Multiple regions is WAY better than multiple AZs within a region, too -- both for reliability and for locality.

Obviously a real seamless multi AZ/multi region solution would be much more technically impressive, useful to users, and Heroku-like, but they shouldn't let the perfect be the enemy of the good here.

spartango · on June 18, 2012

While I'd agree with the general premise that diversification is a good thing in platform use if high-availability is a requirement, given that this outage was single-AZ, this particular outage should really highlight the point that your application should be multi-AZ scaled if it needs to be up.

acdha · on June 18, 2012

More accurately: “don't trust any single data center”. All of the people who complained were directly ignoring Amazon's own advice, not to mention decades of engineering experience.

Going multi-AZ, multi-region or multi-cloud will help, each step up that list being significantly more work for increasingly small returns.

scyscy · on June 18, 2012

Yes. Also, stop navel-gazing (usually that means stop reading Hacker News). Stop commenting on Hacker News as well. Funny thing about the Singularity/aliens/heaven--it'll come even if you don't spend a lot of time worrying about it.

mattdeboard · on June 18, 2012

What?

heretohelp · on June 18, 2012

1 in a (million/billion/trillion) I guess.

That'll make for a great horror story to tell though.

kbutler · on June 18, 2012

Isn't the moral of the story, "Check your backups"? There was a defective fan in one generator (sounds like it was findable via a test run?) and a misconfigured circuit breaker (sounds like it was findable by a test run).

Redundancy is only helpful if the redundant systems are actually functional.

olefoo · on June 18, 2012

Having been affected several times by colocation facilities bouncing the power during a test of the failover system, I can tell you that such tests are not without risk. Yes, you should test redundant systems, but how often, at what cost, and what risks are you willing to run while doing so.

It's a fact of life that when dealing with complex, tightly coupled systems with multiple interactions between subsystems that you will routinely see accidents caused by improbable combinations of failures.

spartango · on June 18, 2012

I wonder if it's better to create an accidental outage during a scheduled test, or to have an outage completely out of the blue. Obviously mitigation is tricky even during a scheduled test, but perhaps its plausible?

raverbashing · on June 18, 2012

Or do it with a non-production load

AWS could create some machines at a lower cost and lower availability, just something that if goes down doesn't affect you much, or one-off usages.

I'm not sure how migrating machines between nodes happens in S3 or if it's easy to do it (maybe with some downtime)

dkulchenko · on June 18, 2012

With a scheduled test, you have the benefit of having the main power actually working if the backup being tested comes crashing down; seems to me that mitigation would be much quicker in a scheduled test than in a real outage.

excuse-me · on June 18, 2012

But would you rather risk a once in 10years real power failure testing your backups, or would pull the plug once per year to test it yourself?

richardw · on June 18, 2012

I suspect a monthly test - if communicated to customers as well - would drive better customer behaviour, e.g. multi-AZ usage, automated validation of EBS, adding extra machines in another AZ automatically. Maybe start with one or two AZ's with an opt-in from customers.

It's the same as advice to routinely replace your live data from backups. It's not a real backup until you've tested that you can recover from it.

excuse-me · on June 18, 2012

It's an interesting and well studied area of statistics for medical tests.

eg. If you have a test that is 99% accurate and a treatment that harms 1% of the patients and you do this screening of a million people - how common does the disease have to be before you cure more people than you kill ?

gameshot911 · on June 18, 2012

I agree, it sounds like this could have been discovered had the two backup power systems been properly tested.

Note how in the case of backup power, "properly tested" doesn't mean 'Does the generator turn on? Are we getting electricity from it? Ok, pass!'. It means running the backup generator in a way that is consistent with what you would expect in an actual power failure - i.e., for more than just a few minutes.

Same thing with storage backup. Checking your backups isn't just 'was a backup file/image created?', it means _actually trying to recover your systems from those backup files_.

enjo · on June 18, 2012

But what about parts that are going to fail in the next 35 minutes of running, and then you run it for 30. There are too many variables to account for here I think.

patrickgzill · on June 18, 2012

The datacenter I use, has a policy to do full load runs for 30 minutes each month.

AWS would have found this, and been able to fix it in a timely fashion, if they did the same (the genset lasted for 10 minutes under load before failing).

jaggederest · on June 18, 2012

Most of the generators I'm aware of require this for maintenance purposes anyway - without running them occasionally the lubricants will freeze up and sieze the engine.

dredmorbius · on June 18, 2012

All sorts of stuff.

Diesel is great food. For bacteria, that is. So it's treated, but you've still got to stir it to keep gunk from settling, you've got to rotate it (so you burn through your stock every so often), you've got to filter it. And stages of all of that can go wrong.

I recall a cruise on a twin V12 turbodiesel powered ship (hey, we've got full redundancy!) in which both engines failed. Cause? Goopy fuel and clogged filters (she spent a lot of time in port). This happened a couple of hours into cruise, fortunately on inland waterways, not open seas, shallow enough water to anchor, and numerous skiffs with which we could head to shore and find replacement parts.

More recently, an colo site I know of was hit by a similar outage: utility power went out, generators kicked in, but a transfer switch failed to activate properly. Colo dumped.

Second time it was the fire detection system which inadvertantly activated. One of its activation modes is to deenergize the datacenter (if you've got a problem with too much energy liberation, not dumping more energy into the system can be a useful feature). Again, full colo dump. And APCs will only buy you enough time for a clean shutdown. If you've got them at all.

But, yes: exercise your backups.

mikiem · on June 18, 2012

Yes, generators must be run priodically. However, not all data centers actually put the full load (or any load at all) on the generators during non-emergency periodic testing.