Cascading failures seem to be a recurring theme amongst hosting providers..

mdellavo · on Oct 27, 2012

Cascading failures are a recurring theme in outages among many interconnected systems. The Northeast blackout of 2003 is one such example.

gizmo686 · on Oct 27, 2012

That because hosting providers have their systems set up to bypass contained failures. Single failures are transparently compensated for, so you never notice them. The only system that could take down the entire network is (mostly) the load-balancing/failure-compensating system. It makes sense that the bugs in these system are almost exclusivly cascade failures.

rhizome · on Oct 27, 2012

I know neteng isn't the simplest thing in the world, but I was struck that for all the "Google Interview...dummies don't even think about it" stories (not to mention Microsoft-mockery the last couple decades), the fix was first to reboot everything, then when that lumped too much traffic in the wrong places, to reboot everything more slowly, 4hrs later, which fixed everything in 35min.

packetslave · on Oct 27, 2012

Think "load balancing service" when you see "traffic router" in this case. This was not necessarily a case of physically rebooting a Juniper or something.

rhizome · on Oct 27, 2012

That's one interpretation, but from what little information they offer it seems their initial "reboots" (of whatever form) introduced asymmetric traffic loads that blew out some segments, after which they chased fixes for a couple hours.