You know the old adage that celebrities die in threes? It's actually mathematically supported... or, well, it's supported that they die in 2.718s. Same principle would apply to cloud service outages if all the services and their failures were actually independent. We'd expect them to happen in "clusters" of e:
That's calculated with the assumption that you can cluster the events within a 1-month period and that the events happen at a fixed rate.
We are talking about a period of a few days here and about only a handful of services that tout 99.999(9)% uptime. I'm no mathematician but I don't think it's a great comparison.
Good correction: the outages do happen at a fixed rate. Regarding GCP, I only see a couple events that have lasted more than 8 hours, not sure how many of those affect more than one subsystem.
No the cluster size isn't fixed, nor is the event rate — the clusters are chosen purely based upon what an "unusually long" time between events would be. That's precisely how the rate ends up canceling itself out. Given this formulation, the expected cluster size is always e, regardless of how you define celebrity. That's what makes it fun.
Now the human psychology part that this doesn't cover is that typically when you have two or three A-listers die, well, then you start seeing all the B- C- and D-listers that also died in the same period that you would have otherwise ignored.
Can you calculate the chance of the cluster size being >5 when we can cluster events within a one-week period and the rate of events is one every 2 months (debatable, but feels right to me)?
I think that would just end this discussion. I don't know how to calculate that, but my intuition says the resulting chance is low.
I don't know how to do the math, but my python simulation says zero.
import random
service_providers = 20
servers_threshold = 5
up_time_threshold = 0.999**7 # prob down in week
years = 20
occurrences = 0
# run sim 10,000 times
for i in range(0, 10000):
servers_down = False
# simulate weeks in 20 years
for j in range(1, 52*years):
how_many_down_this_week = 0
# run up times for service providers
for s in range(0, service_providers):
up_time = random.random()
if up_time > up_time_threshold:
how_many_down_this_week += 1
# did they go down?
if how_many_down_this_week > servers_threshold:
servers_down = True
if servers_down == True:
occurrences += 1
print(occurrences)
Plus one big outage story then makes other minor outages get more headlines... when normally no one cares.
Same thing with hurricanes, when a bad one happens in the US they are more likely to document other minor hurricanes in the Caribbean and Latin America. But otherwise plenty of hurricanes hit those areas and it doesn't make major headlines.
Facebook, Google, Apple, Cloudflare, Azure, Amazon, WhatsApp, and we have noticed smaller bits of routing weirdness like servers in Los Angeles not being able to hit GitHub for a few hours a few days ago.
This is more like 6-8, not e. It's definitely odd.
One non-conspiracy explanation I can imagine is that maybe all these big providers have a bunch of hidden dependencies on each other.
I agree, this is a bigger and really dense cluster, but remember that e is just the average (mean) cluster size. There will be clusters bigger than it.
My point is simply that our intuition of randomness biases us to see meaning in clusters when there is none in the first place. In other words, we need to intentionally shift our prior on how unlikely this event is. Yes, it's still unlikely, but it's not as unlikely as you would think.
Actually there are not many actual conspiracies listed there.
I will start :
- All the companies in the list are getting rid of their Huawei dependencies, replacing routers by domestic ones.
- China is flexing its muscles in the trade war : look, we are creating some instability in your e-commerce. Wait until we attack your infra ( electricity, water )
If you go down the route of conspiracy theory you have to also believe that the companies who have already issued a PR response are lying and that all of the employees privy to the truth are also staying silent.
Amazing. It's 2019, and you still don't understand what PRISM is. Hint: at no place in the system diagram does it show any NSA system interacting with any of those tech companies' systems.
I think we know because an atomic bomb prototype would have been far too large and conspicuous for an individual to take to, let alone leave at, a bar :)
There were rumors of the iPhone for a couple of years before it was released.
Even funnier, there were quite accurate rumors based on patents of the “true video iPod” since 2003 - a 3.5 inch all touch screen iPod. The rumor sites correctly predicted the iPod Touch down to the resolution years before it was released. They didn’t know that would also be the form factor of the iPhone.
One company keeping a specific product launch secret is different from three or four independent companies and all of their network engineers all conspiring.
If the US is attacking Iran via cyberwar, it'd be likely they would want to retaliate. Google's post about fiber cables being cut, seems like a good experiment to see potential impact of our US network. Many companies rely on backbone lines to transfer petabytes of data between data centers around the country, these are our weakest points.
Maybe some submarines are cutting some deep sea cables? In actuality it's probably massive backbone upgrades that are classified, the NSA has to obtain real-time data from somewhere.
https://news.ycombinator.com/item?id=20345060 (Facebook, Instagram, and WhatsApp outages)