I'm going to read all the conspiracy theories on this post. https://news.ycombin...

mbauman · on July 4, 2019

You know the old adage that celebrities die in threes? It's actually mathematically supported... or, well, it's supported that they die in 2.718s. Same principle would apply to cloud service outages if all the services and their failures were actually independent. We'd expect them to happen in "clusters" of e:

http://ssp.impulsetrain.com/celebrities.html

I still love me a good conspiracy theory, but clustering of random (poisson) events is much more likely than you'd expect.

emerongi · on July 4, 2019

That's calculated with the assumption that you can cluster the events within a 1-month period and that the events happen at a fixed rate.

We are talking about a period of a few days here and about only a handful of services that tout 99.999(9)% uptime. I'm no mathematician but I don't think it's a great comparison.

desdiv · on July 4, 2019

AWS averages about one major outage per year[0].

Google's own status page[1] list near 100 incidents in the past 365 days, only one of which ended on HN frontpage.

[0] https://en.wikipedia.org/wiki/Timeline_of_Amazon_Web_Service...

[1] https://status.cloud.google.com/summary

Buge · on July 4, 2019

Both of these Google outages made it to the HN frontpage:

https://news.ycombinator.com/item?id=20077421

https://news.ycombinator.com/item?id=20338263

Both of these AWS outages made it to the HN frontpage (and neither are listed in that AWS timeline):

https://news.ycombinator.com/item?id=20298719

https://news.ycombinator.com/item?id=19749781

emerongi · on July 4, 2019

Good correction: the outages do happen at a fixed rate. Regarding GCP, I only see a couple events that have lasted more than 8 hours, not sure how many of those affect more than one subsystem.

mbauman · on July 4, 2019

No the cluster size isn't fixed, nor is the event rate — the clusters are chosen purely based upon what an "unusually long" time between events would be. That's precisely how the rate ends up canceling itself out. Given this formulation, the expected cluster size is always e, regardless of how you define celebrity. That's what makes it fun.

Now the human psychology part that this doesn't cover is that typically when you have two or three A-listers die, well, then you start seeing all the B- C- and D-listers that also died in the same period that you would have otherwise ignored.

emerongi · on July 4, 2019

Can you calculate the chance of the cluster size being >5 when we can cluster events within a one-week period and the rate of events is one every 2 months (debatable, but feels right to me)?

I think that would just end this discussion. I don't know how to calculate that, but my intuition says the resulting chance is low.

appleiigs · on July 4, 2019

I don't know how to do the math, but my python simulation says zero.

  import random

  service_providers = 20
  servers_threshold = 5
  up_time_threshold = 0.999**7 # prob down in week
  years = 20
  occurrences = 0

  # run sim 10,000 times
  for i in range(0, 10000):
      servers_down = False
    
      # simulate weeks in 20 years
      for j in range(1, 52*years):
          how_many_down_this_week = 0
        
          # run up times for service providers
          for s in range(0, service_providers):
              up_time = random.random()
              if up_time > up_time_threshold:
                  how_many_down_this_week += 1

          # did they go down?
          if how_many_down_this_week > servers_threshold:
              servers_down = True

      if servers_down == True:
          occurrences += 1

  print(occurrences)

joshuamorton · on July 4, 2019

Given those parameters, you'd expect to see something every 100,000 years or so. (poisson distribution, lambda=1/8, k=5, the result is like 7e-7).

dmix · on July 4, 2019

Plus one big outage story then makes other minor outages get more headlines... when normally no one cares.

Same thing with hurricanes, when a bad one happens in the US they are more likely to document other minor hurricanes in the Caribbean and Latin America. But otherwise plenty of hurricanes hit those areas and it doesn't make major headlines.

api · on July 4, 2019

Facebook, Google, Apple, Cloudflare, Azure, Amazon, WhatsApp, and we have noticed smaller bits of routing weirdness like servers in Los Angeles not being able to hit GitHub for a few hours a few days ago.

This is more like 6-8, not e. It's definitely odd.

One non-conspiracy explanation I can imagine is that maybe all these big providers have a bunch of hidden dependencies on each other.

inlined · on July 4, 2019

Apple documents their dependencies on other Cloud providers. Facebook has its own DCs but I think they’re sometimes colocated with cloud providers’.

the-dude · on July 4, 2019

But the outages are sequential, not simultaneous

api · on July 4, 2019

They are very dense compared to the usual rate.

mbauman · on July 4, 2019

I agree, this is a bigger and really dense cluster, but remember that e is just the average (mean) cluster size. There will be clusters bigger than it.

My point is simply that our intuition of randomness biases us to see meaning in clusters when there is none in the first place. In other words, we need to intentionally shift our prior on how unlikely this event is. Yes, it's still unlikely, but it's not as unlikely as you would think.

stefco_ · on July 4, 2019

“When it rains, it pours” is the conventional wisdom distillation of Poisson variable behavior

philshem · on July 4, 2019

2.718 is the Euler number

https://en.wikipedia.org/wiki/E_(mathematical_constant)

the-dude · on July 4, 2019

Actually there are not many actual conspiracies listed there.

I will start :

- All the companies in the list are getting rid of their Huawei dependencies, replacing routers by domestic ones.

- China is flexing its muscles in the trade war : look, we are creating some instability in your e-commerce. Wait until we attack your infra ( electricity, water )

wybiral · on July 4, 2019

If you go down the route of conspiracy theory you have to also believe that the companies who have already issued a PR response are lying and that all of the employees privy to the truth are also staying silent.

That would be quite a story.

cronix · on July 4, 2019

I agree, Snowden was a pretty big story exposing the tech companies for working with the NSA in the PRISM program after previously denying it, including Apple. https://en.wikipedia.org/wiki/PRISM_(surveillance_program)

lern_too_spel · on July 15, 2019

Amazing. It's 2019, and you still don't understand what PRISM is. Hint: at no place in the system diagram does it show any NSA system interacting with any of those tech companies' systems.

the-dude · on July 4, 2019

The iPhone and the atomic bomb were kept secret quite well.

dillonmckay · on July 4, 2019

I don’t think there has been a case of an atomic bomb engineer leaving a prototype at a bar, though.

r00fus · on July 4, 2019

How would we know? Information (esp. sensitive ) was a lot easier to hide back then.

Also the iPhone leaked was an iterative-model not the original.

creddit · on July 4, 2019

I think we know because an atomic bomb prototype would have been far too large and conspicuous for an individual to take to, let alone leave at, a bar :)

Apocryphon · on July 4, 2019

Not to mention the bar would’ve been in Los Alamos, a town run by the military.

JohnBooty · on July 5, 2019

Also there were rumors of the iPhone for yeeeeaaaars. I remember people jokingly referring to it as "the jesus phone" while speculating about it.

scarface74 · on July 5, 2019

There were rumors of the iPhone for a couple of years before it was released.

Even funnier, there were quite accurate rumors based on patents of the “true video iPod” since 2003 - a 3.5 inch all touch screen iPod. The rumor sites correctly predicted the iPod Touch down to the resolution years before it was released. They didn’t know that would also be the form factor of the iPhone.

wybiral · on July 4, 2019

One company keeping a specific product launch secret is different from three or four independent companies and all of their network engineers all conspiring.

the-dude · on July 4, 2019

Sure, but the Manhattan project was a huge undertaking with multiple extremely large facilities.

What if the NSA identified a critical backdoor and ordered all major companies to decommission and not talk about it? ( NSLs )

wybiral · on July 4, 2019

Fair enough... But then they all just did a really crappy job of concealing the patch?

the-dude · on July 4, 2019

Note I said decommission, not patch. And aren't most of the outages of this period routing related?

wybiral · on July 4, 2019

Why would you assume it to be so malicious, if true?

Maybe they were under attack. Maybe they were just removing all of the Bloomberg rice grain chips! :)

There's an infinite number of "maybe"s though. Far more than there are truths.

Your own bias about the entities shapes your conspiracy theory.

NavekM · on July 4, 2019

From the public, yes. From other nations, no.

ryanmarsh · on July 4, 2019

Let’s not forget the existence of ayyyy lmao.

JoeOfTexas · on July 4, 2019

If the US is attacking Iran via cyberwar, it'd be likely they would want to retaliate. Google's post about fiber cables being cut, seems like a good experiment to see potential impact of our US network. Many companies rely on backbone lines to transfer petabytes of data between data centers around the country, these are our weakest points.

cantbecool · on July 4, 2019

Maybe some submarines are cutting some deep sea cables? In actuality it's probably massive backbone upgrades that are classified, the NSA has to obtain real-time data from somewhere.

snazz · on July 4, 2019

Why would the NSA be doing such low-level (literal wiretapping) work anymore? Useful traffic is almost always encrypted.

JohnBooty · on July 5, 2019

Knowing whom is communicating with whom is pretty useful to know, even if you can't decrypt the data.

mbauman · on July 4, 2019

Clearly this means they've broken public-key cryptography.

earenndil · on July 4, 2019

Flaws could be discovered in modern communication such that they can decrypt it in 30 years time. How good is 30-year-old crypto today?

2Ccltvcm · on July 4, 2019

They have a #ailored #ccess #perations department. Look it up.

AlexCoventry · on July 5, 2019

You sure are playing it safe there.

z3t4 · on July 4, 2019

They log who talk to who

helloworldies · on July 4, 2019

Is this a joke about the Russian communications sub fire?

ksec · on July 4, 2019

Once is happenstance. Twice is coincidence. The third time it’s enemy action.

I don't even know if fourth of Fifth is confirmation.

the-dude · on July 4, 2019

Dude, I wrote that top comment. And I was like iCloud WTF?