Hacker News new | past | comments | ask | show | jobs | submit login

Cascading failures seem to be a recurring theme amongst hosting providers..



Cascading failures are a recurring theme in outages among many interconnected systems. The Northeast blackout of 2003 is one such example.


That because hosting providers have their systems set up to bypass contained failures. Single failures are transparently compensated for, so you never notice them. The only system that could take down the entire network is (mostly) the load-balancing/failure-compensating system. It makes sense that the bugs in these system are almost exclusivly cascade failures.


I know neteng isn't the simplest thing in the world, but I was struck that for all the "Google Interview...dummies don't even think about it" stories (not to mention Microsoft-mockery the last couple decades), the fix was first to reboot everything, then when that lumped too much traffic in the wrong places, to reboot everything more slowly, 4hrs later, which fixed everything in 35min.


Think "load balancing service" when you see "traffic router" in this case. This was not necessarily a case of physically rebooting a Juniper or something.


That's one interpretation, but from what little information they offer it seems their initial "reboots" (of whatever form) introduced asymmetric traffic loads that blew out some segments, after which they chased fixes for a couple hours.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: