About today's App Engine outage

bithive123 · on Oct 27, 2012

This sounds very similar to something that can happen at small scales too. It happened to me when I naively used a preforking apache and mod_proxy_balancer on a machine with 2GB RAM. We had a surge in traffic and the load balancer passed the "paging threshold" (it started swapping) and at that point the increased latency caused requests to pile up as users would get impatient and hit reload, leaving processes tied up waiting to time out.

In my case I was able to get things working again by disabling all non-essential apache modules and lowering the timeout. I even had to take the load balancer completely down for a few minutes to get the users to back off long enough to ramp up smoothly again. Then I switched to nginx and haven't looked back.

Obviously I am not comparing my apache instance to App Engine, just the broad strokes of the self-perpetuating failure mode. But, reading between the lines this post basically admits to oversubscription on App Engine. Talking about load as having an unexpected impact on reliability (especially during a "global restart") is a nice way of saying that they got more traffic than they could handle.

teraflop · on Oct 27, 2012

I doubt the "paging threshold" referred to in this post has anything to do with paging to disk. It probably just means the point at which the operations folks' pagers start going off.

gizmo686 · on Oct 27, 2012

The interesting thing with this type of bug is that they did not necessarily need to get more load than they can handle over a medium lenght time period (like 1 second). If they get more than they can handle, even over a short period of time, their throughput goes down significantly as the system tries to compensate, meaning that a normal load which they are perfectly capable of handling becomes one they cannot handle.

halayli · on Oct 27, 2012

"We begin a global restart of the traffic routers to address the load in the affected datacenter."

This sounds like "Did you turn it off and on?"

bsaul · on Oct 27, 2012

this one made my day :)

untog · on Oct 27, 2012

Amazon: this is the kind of explanation blog post we want from you. Please be inspired by it.

saurik · on Oct 27, 2012

Here is a link to what Amazon posted about the outage a few days ago. I actually wrote this post originally with the later examples, not realizing that Amazon had actually posted another great one about the latest outage, but when I went looking to find out how long that outage was I was greatly pleased to find another AWS message I get to read.

https://aws.amazon.com/message/680342/

(Ah, apparently it was only just submitted to HN about an hour ago by someone, and is about to fall off the front page. https://news.ycombinator.com/item?id=4705067)

This is what Amazon posted during the really large outage last year (the one that still only affected multiple availability zones for at most an hour or so):

http://aws.amazon.com/message/65648/

I can't find a link on Amazon's website, but this is a copy/paste from the explanation of a smaller outage that occurred a earlier this year:

https://news.ycombinator.com/item?id=4124488

Amazon's explanations are, I find, much more detailed (although this App Engine one was pretty good): when something serious goes wrong at AWS, we often not only get an apology (and a service credit), but we learn something about how distributed systems work in the process.

When we don't see explanations from Amazon is when a subset of the servers within a single availability zone (not even an entire zone) are inaccessible for less than an hour (which occasionally happens); otherwise, they honestly "kick ass" at post-mortem, as the above examples show.

However, it is my understanding that Google has had all kinds of random issues that only affected some customers that are dealt with in private, so that isn't different with them. The outage this morning, however, was "all of App Engine doesn't work anymore", something that has never even happened to AWS.

(Now, during the issue, Amazon really really sucks to the point where I'd often rather them say nothing than to keep having their front-line keep reassuring people; that said, in the middle of a crisis, most systems/people suck.)

aidos · on Oct 27, 2012

You mean, like they always do?

https://aws.amazon.com/message/680342/

https://aws.amazon.com/message/65648/

https://aws.amazon.com/message/67457/

https://aws.amazon.com/message/2329B7/

hkmurakami · on Oct 27, 2012

From earnings calls to server outages, secrecy is the Amazon way! :P

At least they're ... consistent? =S

loceng · on Oct 27, 2012

Cascading failures seem to be a recurring theme amongst hosting providers..

mdellavo · on Oct 27, 2012

Cascading failures are a recurring theme in outages among many interconnected systems. The Northeast blackout of 2003 is one such example.

gizmo686 · on Oct 27, 2012

That because hosting providers have their systems set up to bypass contained failures. Single failures are transparently compensated for, so you never notice them. The only system that could take down the entire network is (mostly) the load-balancing/failure-compensating system. It makes sense that the bugs in these system are almost exclusivly cascade failures.

rhizome · on Oct 27, 2012

I know neteng isn't the simplest thing in the world, but I was struck that for all the "Google Interview...dummies don't even think about it" stories (not to mention Microsoft-mockery the last couple decades), the fix was first to reboot everything, then when that lumped too much traffic in the wrong places, to reboot everything more slowly, 4hrs later, which fixed everything in 35min.

packetslave · on Oct 27, 2012

Think "load balancing service" when you see "traffic router" in this case. This was not necessarily a case of physically rebooting a Juniper or something.

rhizome · on Oct 27, 2012

That's one interpretation, but from what little information they offer it seems their initial "reboots" (of whatever form) introduced asymmetric traffic loads that blew out some segments, after which they chased fixes for a couple hours.

philip1209 · on Oct 27, 2012

A 10% credit for SLA violations seems quite generous - credit for 3 days after about 6 hours of downtime

wmf · on Oct 27, 2012

10x strikes me as an appropriate factor since it gives the provider a strong disincentive for outages.

philip1209 · on Oct 27, 2012

They did refund approximately 10x ->

6 hours/24 hours = .25 day.

10*.25 Day =2.5 days ~= 3 Days

sejje · on Oct 27, 2012

I don't use GAE, but in the comments at the original article there were people talking about huge amounts of instances getting spun up.

Does the 10% cover that in all cases? (Did they maybe roll-back any charges from those 6 hours, as well?)

bthomas · on Oct 27, 2012

Psychologically 10% free sounds more generous than one month free

NameNickHN · on Oct 27, 2012

No, it doesn't. It sounds exactly like 10% of the monthly fee.

mwsherman · on Oct 27, 2012

A big, hairy problem that a YC company should take on: modeling complexity and predicting emergent phenomena like this. (Ditto Amazon’s outage.)

It wouldn’t be just for data centers, but that’s a good place to start.

digeridoo · on Oct 27, 2012

That's actually a problem academia should take on, but unfortunately that's not the direction computer science has taken.

velar · on Oct 27, 2012

I remember a similar incident at Microsoft that brought Bing down, the solution was to add 2 more Cisco core routers (which works in pairs) at a cost of 100k(200k?) each.

It's funny watching software people getting owned by the hardware people ;)

andrewcooke · on Oct 27, 2012

i get the impression that routers are hard to configure well, particularly used in "complex" ways. isn't there some hardware startup looking at fixing this? i can't remember the name, but thought i had read about it here before.

naivley, it seems like the configuration is at too low a level - individual routers - and that there should be higher level coordination with support for simulating different conditions. or does that already exist? what's state of the art for places like google to manage routers?

packetslave · on Oct 27, 2012

Where the blog post says "traffic router" you should read "big pools of load balancing servers" not Junipers and Ciscos and whatnot.

See also the posts in the google-appengine-downtime-notify group during the incident. https://groups.google.com/forum/?fromgroups=#!topic/google-a...

daave · on Oct 27, 2012

Google uses OpenFlow software defined networking: http://en.wikipedia.org/wiki/OpenFlow http://www.youtube.com/watch?v=VLHJUfgxEO4

michaelkscott · on Oct 27, 2012

For anyone interested, here are the comments on the AWS outage: http://news.ycombinator.com/item?id=4705067

cloudwizard · on Oct 27, 2012

It makes more sense for GAE to potentially have cascading failures since they failover for you. AWS does not so it is less vulnerable.

Evbn · on Oct 27, 2012

Why doesn't AppEngine degrade by browning out low priority services (free tier, batch jobs, low paying customers) instead of overloading themselves?