This sounds very similar to something that can happen at small scales too. It happened to me when I naively used a preforking apache and mod_proxy_balancer on a machine with 2GB RAM. We had a surge in traffic and the load balancer passed the "paging threshold" (it started swapping) and at that point the increased latency caused requests to pile up as users would get impatient and hit reload, leaving processes tied up waiting to time out.
In my case I was able to get things working again by disabling all non-essential apache modules and lowering the timeout. I even had to take the load balancer completely down for a few minutes to get the users to back off long enough to ramp up smoothly again. Then I switched to nginx and haven't looked back.
Obviously I am not comparing my apache instance to App Engine, just the broad strokes of the self-perpetuating failure mode. But, reading between the lines this post basically admits to oversubscription on App Engine. Talking about load as having an unexpected impact on reliability (especially during a "global restart") is a nice way of saying that they got more traffic than they could handle.
I doubt the "paging threshold" referred to in this post has anything to do with paging to disk. It probably just means the point at which the operations folks' pagers start going off.
The interesting thing with this type of bug is that they did not necessarily need to get more load than they can handle over a medium lenght time period (like 1 second). If they get more than they can handle, even over a short period of time, their throughput goes down significantly as the system tries to compensate, meaning that a normal load which they are perfectly capable of handling becomes one they cannot handle.
Here is a link to what Amazon posted about the outage a few days ago. I actually wrote this post originally with the later examples, not realizing that Amazon had actually posted another great one about the latest outage, but when I went looking to find out how long that outage was I was greatly pleased to find another AWS message I get to read.
This is what Amazon posted during the really large outage last year (the one that still only affected multiple availability zones for at most an hour or so):
Amazon's explanations are, I find, much more detailed (although this App Engine one was pretty good): when something serious goes wrong at AWS, we often not only get an apology (and a service credit), but we learn something about how distributed systems work in the process.
When we don't see explanations from Amazon is when a subset of the servers within a single availability zone (not even an entire zone) are inaccessible for less than an hour (which occasionally happens); otherwise, they honestly "kick ass" at post-mortem, as the above examples show.
However, it is my understanding that Google has had all kinds of random issues that only affected some customers that are dealt with in private, so that isn't different with them. The outage this morning, however, was "all of App Engine doesn't work anymore", something that has never even happened to AWS.
(Now, during the issue, Amazon really really sucks to the point where I'd often rather them say nothing than to keep having their front-line keep reassuring people; that said, in the middle of a crisis, most systems/people suck.)
That because hosting providers have their systems set up to bypass contained failures. Single failures are transparently compensated for, so you never notice them. The only system that could take down the entire network is (mostly) the load-balancing/failure-compensating system. It makes sense that the bugs in these system are almost exclusivly cascade failures.
I know neteng isn't the simplest thing in the world, but I was struck that for all the "Google Interview...dummies don't even think about it" stories (not to mention Microsoft-mockery the last couple decades), the fix was first to reboot everything, then when that lumped too much traffic in the wrong places, to reboot everything more slowly, 4hrs later, which fixed everything in 35min.
Think "load balancing service" when you see "traffic router" in this case. This was not necessarily a case of physically rebooting a Juniper or something.
That's one interpretation, but from what little information they offer it seems their initial "reboots" (of whatever form) introduced asymmetric traffic loads that blew out some segments, after which they chased fixes for a couple hours.
I remember a similar incident at Microsoft that brought Bing down, the solution was to add 2 more Cisco core routers (which works in pairs) at a cost of 100k(200k?) each.
It's funny watching software people getting owned by the hardware people ;)
i get the impression that routers are hard to configure well, particularly used in "complex" ways. isn't there some hardware startup looking at fixing this? i can't remember the name, but thought i had read about it here before.
naivley, it seems like the configuration is at too low a level - individual routers - and that there should be higher level coordination with support for simulating different conditions. or does that already exist? what's state of the art for places like google to manage routers?
In my case I was able to get things working again by disabling all non-essential apache modules and lowering the timeout. I even had to take the load balancer completely down for a few minutes to get the users to back off long enough to ramp up smoothly again. Then I switched to nginx and haven't looked back.
Obviously I am not comparing my apache instance to App Engine, just the broad strokes of the self-perpetuating failure mode. But, reading between the lines this post basically admits to oversubscription on App Engine. Talking about load as having an unexpected impact on reliability (especially during a "global restart") is a nice way of saying that they got more traffic than they could handle.