For those out of the loop, this week has seen an outage from a bunch of major US internet companies: Cloudflare, Slack, Google Cloud, Azure, Facebook (including Instagram and WhatsApp), and now iCloud.
The one thing we can say with some certainty (Except for the first cloudflare outage about a week ago[1]), it's not BGP/routing table fuckery/IP space hijacks/people who don't know MANRs. There's lots of routing table monitoring services and big ISPs that run their own alerting/alarming systems.
AMD has been shipping their new Epyc Rome server CPUs to hyperscalars for a few weeks now [1]. It seems to be a very appealing chip that a lot of them are adopting. They won't be the same as whatever they were using before, so it could be that major updates are needed to include them, meaning that it's just a risky time right now?
Yup, correct. The whole stack is consistently sharded. They should be sharing little to no machinery at all and only a handful of people are able to use the "migration" tool to force your account from one partition to the other.
What had struck me as an interesting vector back when I was involved with iCloud was how this ability could be used to move an account from a production partition to the staging partition as staging partition has unique debugging and tracing capabilities.
When an employee joins iCloud, their account is often moved from the production partition to the staging partition using this tool. Technically it should be possible to take any iCloud account and move it to this partition and use the slightly modified code in there to gain access to the contents of the account.
Nope, just a typical network hardware failure in a west coast datacenter, which triggered a cascading BGP failure disrupting much of the network connectivity in that DC.
You know the old adage that celebrities die in threes? It's actually mathematically supported... or, well, it's supported that they die in 2.718s. Same principle would apply to cloud service outages if all the services and their failures were actually independent. We'd expect them to happen in "clusters" of e:
That's calculated with the assumption that you can cluster the events within a 1-month period and that the events happen at a fixed rate.
We are talking about a period of a few days here and about only a handful of services that tout 99.999(9)% uptime. I'm no mathematician but I don't think it's a great comparison.
Good correction: the outages do happen at a fixed rate. Regarding GCP, I only see a couple events that have lasted more than 8 hours, not sure how many of those affect more than one subsystem.
No the cluster size isn't fixed, nor is the event rate — the clusters are chosen purely based upon what an "unusually long" time between events would be. That's precisely how the rate ends up canceling itself out. Given this formulation, the expected cluster size is always e, regardless of how you define celebrity. That's what makes it fun.
Now the human psychology part that this doesn't cover is that typically when you have two or three A-listers die, well, then you start seeing all the B- C- and D-listers that also died in the same period that you would have otherwise ignored.
Can you calculate the chance of the cluster size being >5 when we can cluster events within a one-week period and the rate of events is one every 2 months (debatable, but feels right to me)?
I think that would just end this discussion. I don't know how to calculate that, but my intuition says the resulting chance is low.
I don't know how to do the math, but my python simulation says zero.
import random
service_providers = 20
servers_threshold = 5
up_time_threshold = 0.999**7 # prob down in week
years = 20
occurrences = 0
# run sim 10,000 times
for i in range(0, 10000):
servers_down = False
# simulate weeks in 20 years
for j in range(1, 52*years):
how_many_down_this_week = 0
# run up times for service providers
for s in range(0, service_providers):
up_time = random.random()
if up_time > up_time_threshold:
how_many_down_this_week += 1
# did they go down?
if how_many_down_this_week > servers_threshold:
servers_down = True
if servers_down == True:
occurrences += 1
print(occurrences)
Plus one big outage story then makes other minor outages get more headlines... when normally no one cares.
Same thing with hurricanes, when a bad one happens in the US they are more likely to document other minor hurricanes in the Caribbean and Latin America. But otherwise plenty of hurricanes hit those areas and it doesn't make major headlines.
Facebook, Google, Apple, Cloudflare, Azure, Amazon, WhatsApp, and we have noticed smaller bits of routing weirdness like servers in Los Angeles not being able to hit GitHub for a few hours a few days ago.
This is more like 6-8, not e. It's definitely odd.
One non-conspiracy explanation I can imagine is that maybe all these big providers have a bunch of hidden dependencies on each other.
I agree, this is a bigger and really dense cluster, but remember that e is just the average (mean) cluster size. There will be clusters bigger than it.
My point is simply that our intuition of randomness biases us to see meaning in clusters when there is none in the first place. In other words, we need to intentionally shift our prior on how unlikely this event is. Yes, it's still unlikely, but it's not as unlikely as you would think.
Actually there are not many actual conspiracies listed there.
I will start :
- All the companies in the list are getting rid of their Huawei dependencies, replacing routers by domestic ones.
- China is flexing its muscles in the trade war : look, we are creating some instability in your e-commerce. Wait until we attack your infra ( electricity, water )
If you go down the route of conspiracy theory you have to also believe that the companies who have already issued a PR response are lying and that all of the employees privy to the truth are also staying silent.
Amazing. It's 2019, and you still don't understand what PRISM is. Hint: at no place in the system diagram does it show any NSA system interacting with any of those tech companies' systems.
I think we know because an atomic bomb prototype would have been far too large and conspicuous for an individual to take to, let alone leave at, a bar :)
There were rumors of the iPhone for a couple of years before it was released.
Even funnier, there were quite accurate rumors based on patents of the “true video iPod” since 2003 - a 3.5 inch all touch screen iPod. The rumor sites correctly predicted the iPod Touch down to the resolution years before it was released. They didn’t know that would also be the form factor of the iPhone.
One company keeping a specific product launch secret is different from three or four independent companies and all of their network engineers all conspiring.
If the US is attacking Iran via cyberwar, it'd be likely they would want to retaliate. Google's post about fiber cables being cut, seems like a good experiment to see potential impact of our US network. Many companies rely on backbone lines to transfer petabytes of data between data centers around the country, these are our weakest points.
Maybe some submarines are cutting some deep sea cables? In actuality it's probably massive backbone upgrades that are classified, the NSA has to obtain real-time data from somewhere.
Just wondering out aloud rather than floating any theories, if it is in any way related to the 6.4 magnitude earthquake and/or geological events leading upto it. Also inter/un-related, Google provides Apple with the infrastructure for iCloud; they also experienced some downtime in the last few days, along with others i.e. tremors causing some sensitive/critical servers to misbehave.
However, I doubt that it is down to these factors, as there will likely be significant amount of distributed fault tolerance, failover and contingency plans in place.
Apple runs their own bare-metal services - only some of which are hosted on Google Cloud. You could see which ones went down when Google did the other week.
I'd like to think the simplest explanation to this recent spate of outages is that all the engineers at each org spent too long reading the HN comment thread on the previous company's outage, and didn't notice their own servers catching fire ;)
Little off-topic, but lately i often find myself Airdropping files between my iPad & Macbook because iCloud isn't syncing newly added files for some reason (which can't be forced/refreshed on the iPad as far as i know.)
Are there any graphs anywhere showing Internet traffic at different ISPs and networks? Such a graph (especially over a world map) would make it obvious if there was any DDoS, SSH bruteforcing, or other monkey business going on.
I wish there was a historical view. Very curious if this is common, or if it's unusual (which would make it especially unusual with the Slack, Facebook, etc. outages recently).
I'm actually relieved to hear there were a bunch of known problems…earlier today I was wondering why none of my notes and photos were syncing and thought iCloud was loosing its marbles. It used to happen a lot to me, but in the last year or so it's been pretty stable. I was afraid sync reliability was regressing!
I suspect Apple's making upgrades as part of the push towards iOS 13 and macOS Catalina, and ran into some rollout glitches.
One of the services listed is Screen Time. Maybe I'm mistaken but isn't that the feature on iPhones where you can see how much time you spend in each app? Why would that require Apple's servers.
It labels what you're doing during that time, ie, which apps you're using then displays an icon next to it which could explain the web requests since there are so many apps.
My guess is that it's a glorified analytics system for Apple which they retooled into a useful service for end users.
Time zones are ever changing. Maybe they’re looking up a database?
Edit: actually screen time is the feature that measures how long you’re spending on each device each week. I’m not sure why that needs to talk to the cloud unless devs also get the inverse metrics.
In India, I'm experiencing issues with online payment from last few days. Majorly with UPI payments.
Not sure if these are related but hard to ignore it.
Tinfoil hat on, conspiracy mode on: And a fire on a Russian submarine whose purpose is said to be disrupting communications (it can access and cut underwater cables”.
What I find interesting is that before big news networks covered it some HN users were linking to the story in the previous outage threads. Way before anyone knew that the submarine had cable cutting capabilities
At the time, everyone downvoted them because of how unrelated it seemed.
I'm not saying that I subscribe to the theory. But it is interesting nonetheless.
I sent the first three that I saw. I have to admit that in my eyes all those sites have the same lack of credibility. Here's a few more mainstream, I think.
Whether or not something is a conspiracy theory has nothing to do with the person or persons advocating it.
And the mainstream press is really bad at careful logical analysis. If anything, their desire for sensationalism can easily outweigh their concerns over looking foolish. For example, back in the '80s there was the Satanic daycare abuse scare, and it was absolutely pushed by mainstream news outlets.
(That's more of a moral panic than a conspiracy theory, but real conspiracy theories tend to be partisan and I'm trying to avoid that.)
Might as well add the Vice President suddenly being required to stay in Washington DC early on July 2nd, to the list.
Even if it involved the Russians attempting to tamper with underseas communications cables (and US service providers being forced to respond), we'll never get confirmation of it from either side.
I don't know why you're being downvoted. The Russians trying to screw with US communications makes perfect sense as an explanation for such a comically implausible whoops one-after-another result like this.
People here are saying that it wouldn't be possible for all the companies to conspire together to cover that up. Sure, tell me how it wasn't possible for numerous of the forced-to-join NSA programs to be covered up for years, all of which involved the participation of huge US service providers. Of course it's not only possible but very easy, major US service providers routinely work with US Government agencies.