Hacker News new | past | comments | ask | show | jobs | submit login

I'm going to read all the conspiracy theories on this post.

https://news.ycombinator.com/item?id=20345060 (Facebook, Instagram, and WhatsApp outages)




You know the old adage that celebrities die in threes? It's actually mathematically supported... or, well, it's supported that they die in 2.718s. Same principle would apply to cloud service outages if all the services and their failures were actually independent. We'd expect them to happen in "clusters" of e:

http://ssp.impulsetrain.com/celebrities.html

I still love me a good conspiracy theory, but clustering of random (poisson) events is much more likely than you'd expect.


That's calculated with the assumption that you can cluster the events within a 1-month period and that the events happen at a fixed rate.

We are talking about a period of a few days here and about only a handful of services that tout 99.999(9)% uptime. I'm no mathematician but I don't think it's a great comparison.


AWS averages about one major outage per year[0].

Google's own status page[1] list near 100 incidents in the past 365 days, only one of which ended on HN frontpage.

[0] https://en.wikipedia.org/wiki/Timeline_of_Amazon_Web_Service...

[1] https://status.cloud.google.com/summary


Both of these Google outages made it to the HN frontpage:

https://news.ycombinator.com/item?id=20077421

https://news.ycombinator.com/item?id=20338263

Both of these AWS outages made it to the HN frontpage (and neither are listed in that AWS timeline):

https://news.ycombinator.com/item?id=20298719

https://news.ycombinator.com/item?id=19749781


Good correction: the outages do happen at a fixed rate. Regarding GCP, I only see a couple events that have lasted more than 8 hours, not sure how many of those affect more than one subsystem.


No the cluster size isn't fixed, nor is the event rate — the clusters are chosen purely based upon what an "unusually long" time between events would be. That's precisely how the rate ends up canceling itself out. Given this formulation, the expected cluster size is always e, regardless of how you define celebrity. That's what makes it fun.

Now the human psychology part that this doesn't cover is that typically when you have two or three A-listers die, well, then you start seeing all the B- C- and D-listers that also died in the same period that you would have otherwise ignored.


Can you calculate the chance of the cluster size being >5 when we can cluster events within a one-week period and the rate of events is one every 2 months (debatable, but feels right to me)?

I think that would just end this discussion. I don't know how to calculate that, but my intuition says the resulting chance is low.


I don't know how to do the math, but my python simulation says zero.

  import random

  service_providers = 20
  servers_threshold = 5
  up_time_threshold = 0.999**7 # prob down in week
  years = 20
  occurrences = 0

  # run sim 10,000 times
  for i in range(0, 10000):
      servers_down = False
    
      # simulate weeks in 20 years
      for j in range(1, 52*years):
          how_many_down_this_week = 0
        
          # run up times for service providers
          for s in range(0, service_providers):
              up_time = random.random()
              if up_time > up_time_threshold:
                  how_many_down_this_week += 1

          # did they go down?
          if how_many_down_this_week > servers_threshold:
              servers_down = True

      if servers_down == True:
          occurrences += 1

  print(occurrences)


Given those parameters, you'd expect to see something every 100,000 years or so. (poisson distribution, lambda=1/8, k=5, the result is like 7e-7).


Plus one big outage story then makes other minor outages get more headlines... when normally no one cares.

Same thing with hurricanes, when a bad one happens in the US they are more likely to document other minor hurricanes in the Caribbean and Latin America. But otherwise plenty of hurricanes hit those areas and it doesn't make major headlines.


Facebook, Google, Apple, Cloudflare, Azure, Amazon, WhatsApp, and we have noticed smaller bits of routing weirdness like servers in Los Angeles not being able to hit GitHub for a few hours a few days ago.

This is more like 6-8, not e. It's definitely odd.

One non-conspiracy explanation I can imagine is that maybe all these big providers have a bunch of hidden dependencies on each other.


Apple documents their dependencies on other Cloud providers. Facebook has its own DCs but I think they’re sometimes colocated with cloud providers’.


But the outages are sequential, not simultaneous


They are very dense compared to the usual rate.


I agree, this is a bigger and really dense cluster, but remember that e is just the average (mean) cluster size. There will be clusters bigger than it.

My point is simply that our intuition of randomness biases us to see meaning in clusters when there is none in the first place. In other words, we need to intentionally shift our prior on how unlikely this event is. Yes, it's still unlikely, but it's not as unlikely as you would think.


“When it rains, it pours” is the conventional wisdom distillation of Poisson variable behavior



Actually there are not many actual conspiracies listed there.

I will start :

- All the companies in the list are getting rid of their Huawei dependencies, replacing routers by domestic ones.

- China is flexing its muscles in the trade war : look, we are creating some instability in your e-commerce. Wait until we attack your infra ( electricity, water )


If you go down the route of conspiracy theory you have to also believe that the companies who have already issued a PR response are lying and that all of the employees privy to the truth are also staying silent.

That would be quite a story.


I agree, Snowden was a pretty big story exposing the tech companies for working with the NSA in the PRISM program after previously denying it, including Apple. https://en.wikipedia.org/wiki/PRISM_(surveillance_program)


Amazing. It's 2019, and you still don't understand what PRISM is. Hint: at no place in the system diagram does it show any NSA system interacting with any of those tech companies' systems.


The iPhone and the atomic bomb were kept secret quite well.


I don’t think there has been a case of an atomic bomb engineer leaving a prototype at a bar, though.


How would we know? Information (esp. sensitive ) was a lot easier to hide back then.

Also the iPhone leaked was an iterative-model not the original.


I think we know because an atomic bomb prototype would have been far too large and conspicuous for an individual to take to, let alone leave at, a bar :)


Not to mention the bar would’ve been in Los Alamos, a town run by the military.


Also there were rumors of the iPhone for yeeeeaaaars. I remember people jokingly referring to it as "the jesus phone" while speculating about it.


There were rumors of the iPhone for a couple of years before it was released.

Even funnier, there were quite accurate rumors based on patents of the “true video iPod” since 2003 - a 3.5 inch all touch screen iPod. The rumor sites correctly predicted the iPod Touch down to the resolution years before it was released. They didn’t know that would also be the form factor of the iPhone.


One company keeping a specific product launch secret is different from three or four independent companies and all of their network engineers all conspiring.


Sure, but the Manhattan project was a huge undertaking with multiple extremely large facilities.

What if the NSA identified a critical backdoor and ordered all major companies to decommission and not talk about it? ( NSLs )


Fair enough... But then they all just did a really crappy job of concealing the patch?


Note I said decommission, not patch. And aren't most of the outages of this period routing related?


Why would you assume it to be so malicious, if true?

Maybe they were under attack. Maybe they were just removing all of the Bloomberg rice grain chips! :)

There's an infinite number of "maybe"s though. Far more than there are truths.

Your own bias about the entities shapes your conspiracy theory.


From the public, yes. From other nations, no.


Let’s not forget the existence of ayyyy lmao.


If the US is attacking Iran via cyberwar, it'd be likely they would want to retaliate. Google's post about fiber cables being cut, seems like a good experiment to see potential impact of our US network. Many companies rely on backbone lines to transfer petabytes of data between data centers around the country, these are our weakest points.


Maybe some submarines are cutting some deep sea cables? In actuality it's probably massive backbone upgrades that are classified, the NSA has to obtain real-time data from somewhere.


Why would the NSA be doing such low-level (literal wiretapping) work anymore? Useful traffic is almost always encrypted.


Knowing whom is communicating with whom is pretty useful to know, even if you can't decrypt the data.


Clearly this means they've broken public-key cryptography.


Flaws could be discovered in modern communication such that they can decrypt it in 30 years time. How good is 30-year-old crypto today?


They have a #ailored #ccess #perations department. Look it up.


You sure are playing it safe there.


They log who talk to who


Is this a joke about the Russian communications sub fire?


Once is happenstance. Twice is coincidence. The third time it’s enemy action.

I don't even know if fourth of Fifth is confirmation.


Dude, I wrote that top comment. And I was like iCloud WTF?




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: