Hacker News new | past | comments | ask | show | jobs | submit login
Apple iCloud Experiencing Issues (apple.com)
210 points by josho on July 4, 2019 | hide | past | favorite | 141 comments



For those out of the loop, this week has seen an outage from a bunch of major US internet companies: Cloudflare, Slack, Google Cloud, Azure, Facebook (including Instagram and WhatsApp), and now iCloud.


Interns join at the beginning of summer. \o/


Now might be around the time they’d start pushing to production, too…


So we will have “eternal July” for internet outage?


There's been a few major office365 outages that coincided with the Azure ones, we've taken to calling it office364 internally.


At this point it’s more like Office 361


There have also been a few "nines" related jokes, as in:

Hey, nine fives!


https://en.wikipedia.org/wiki/High_availability#Percentage_c...

I was a little confused by that phrase, so for anybody else who was as confused as me.


"55.5555555% ("nine fives") "

downtime per year: 162 days


Have any of them issues post mortems yet? Anything at all to indicate an attack of some sort?


“Cloudflare outage caused by bad software deploy“

https://blog.cloudflare.com/cloudflare-outage/


The one thing we can say with some certainty (Except for the first cloudflare outage about a week ago[1]), it's not BGP/routing table fuckery/IP space hijacks/people who don't know MANRs. There's lots of routing table monitoring services and big ISPs that run their own alerting/alarming systems.

[1]: https://blog.cloudflare.com/how-verizon-and-a-bgp-optimizer-...


this is somewhat common


They needed to connect NSA's new servers... /s


Attackers and state actors sometimes appear to behave indiscriminately


why is slack on that list, they use AWS. They arent an infrastructure provider.


Nor is FB. It is about major outages.


facebook absolutely is an infrastructure provider for a whole host of products including FB, Instagram, WhatsApp and more.


Sure, you are absolutely right.


They also provide login (authentication) services for a number of sites.


Its probably some back end access for five eyes, and needs a re-start of everything. Its too fishy for it just to a coincidence.


Could it be that people see an outage, migrate to a new vendor, and increase load?


AMD has been shipping their new Epyc Rome server CPUs to hyperscalars for a few weeks now [1]. It seems to be a very appealing chip that a lot of them are adopting. They won't be the same as whatever they were using before, so it could be that major updates are needed to include them, meaning that it's just a risky time right now?

[1] Second question in https://www.anandtech.com/show/14568/an-interview-with-amds-...


After reading the ZIP bomb article also linked on HN, I have copied the largest file to my iCloud drive for fun.

And an hour later, I'm reading that most of iCloud is down. I really hope this is a funny coincidence. And if not, then I'm terribly sorry.


As far as my memory goes, iCloud features at least 99 partitions. The 99th of which is the "staging" partition where the iCloud employees' data lives.

All partitions are more or less separated and shouldn't affect one another. For that reason, it probably wasn't you that broke the whole system.

Even the Cassandra they use for metadata storage is partitioned and separated.


> The 99th of which is the "staging" partition where the iCloud employees' data lives.

As in, if there's a failure in iCloud code, it'll affect employees who worked on that code before everyone else? Good use of dogfooding!


So, it is fair to say this is just the full stack (endpoints, databases, etc) being sharded consistently?


Yup, correct. The whole stack is consistently sharded. They should be sharing little to no machinery at all and only a handful of people are able to use the "migration" tool to force your account from one partition to the other.

What had struck me as an interesting vector back when I was involved with iCloud was how this ability could be used to move an account from a production partition to the staging partition as staging partition has unique debugging and tracing capabilities.

When an employee joins iCloud, their account is often moved from the production partition to the staging partition using this tool. Technically it should be possible to take any iCloud account and move it to this partition and use the slightly modified code in there to gain access to the contents of the account.


Nope, just a typical network hardware failure in a west coast datacenter, which triggered a cascading BGP failure disrupting much of the network connectivity in that DC.


Could that because of the earthquake?


Can you try again to make sure?


Can you try again a few times? Let's see if it's reproducible. I think you are not at fault :-)


Think they tried to scan it for malware?


upload a different sized one next time, with a new filename


Don't give yourself too much credit


Hey, could you please not create accounts to break the site guidelines with? We're trying for a bit better than this here.

https://news.ycombinator.com/newsguidelines.html


I was tongue in cheek. Of course I didn’t cause this. I imagine I must have been the millionst and one person to try this.

It’s just a funny coincidence and I enjoyed toying with the thought


Hey, at least it's as likely as some of the weird conspiracy theories being floated around.


Maybe somebody copied the zip bomb onto the battery charge controller subsystem of the russian submarine cable tapping submarine.


I'm going to read all the conspiracy theories on this post.

https://news.ycombinator.com/item?id=20345060 (Facebook, Instagram, and WhatsApp outages)


You know the old adage that celebrities die in threes? It's actually mathematically supported... or, well, it's supported that they die in 2.718s. Same principle would apply to cloud service outages if all the services and their failures were actually independent. We'd expect them to happen in "clusters" of e:

http://ssp.impulsetrain.com/celebrities.html

I still love me a good conspiracy theory, but clustering of random (poisson) events is much more likely than you'd expect.


That's calculated with the assumption that you can cluster the events within a 1-month period and that the events happen at a fixed rate.

We are talking about a period of a few days here and about only a handful of services that tout 99.999(9)% uptime. I'm no mathematician but I don't think it's a great comparison.


AWS averages about one major outage per year[0].

Google's own status page[1] list near 100 incidents in the past 365 days, only one of which ended on HN frontpage.

[0] https://en.wikipedia.org/wiki/Timeline_of_Amazon_Web_Service...

[1] https://status.cloud.google.com/summary


Both of these Google outages made it to the HN frontpage:

https://news.ycombinator.com/item?id=20077421

https://news.ycombinator.com/item?id=20338263

Both of these AWS outages made it to the HN frontpage (and neither are listed in that AWS timeline):

https://news.ycombinator.com/item?id=20298719

https://news.ycombinator.com/item?id=19749781


Good correction: the outages do happen at a fixed rate. Regarding GCP, I only see a couple events that have lasted more than 8 hours, not sure how many of those affect more than one subsystem.


No the cluster size isn't fixed, nor is the event rate — the clusters are chosen purely based upon what an "unusually long" time between events would be. That's precisely how the rate ends up canceling itself out. Given this formulation, the expected cluster size is always e, regardless of how you define celebrity. That's what makes it fun.

Now the human psychology part that this doesn't cover is that typically when you have two or three A-listers die, well, then you start seeing all the B- C- and D-listers that also died in the same period that you would have otherwise ignored.


Can you calculate the chance of the cluster size being >5 when we can cluster events within a one-week period and the rate of events is one every 2 months (debatable, but feels right to me)?

I think that would just end this discussion. I don't know how to calculate that, but my intuition says the resulting chance is low.


I don't know how to do the math, but my python simulation says zero.

  import random

  service_providers = 20
  servers_threshold = 5
  up_time_threshold = 0.999**7 # prob down in week
  years = 20
  occurrences = 0

  # run sim 10,000 times
  for i in range(0, 10000):
      servers_down = False
    
      # simulate weeks in 20 years
      for j in range(1, 52*years):
          how_many_down_this_week = 0
        
          # run up times for service providers
          for s in range(0, service_providers):
              up_time = random.random()
              if up_time > up_time_threshold:
                  how_many_down_this_week += 1

          # did they go down?
          if how_many_down_this_week > servers_threshold:
              servers_down = True

      if servers_down == True:
          occurrences += 1

  print(occurrences)


Given those parameters, you'd expect to see something every 100,000 years or so. (poisson distribution, lambda=1/8, k=5, the result is like 7e-7).


Plus one big outage story then makes other minor outages get more headlines... when normally no one cares.

Same thing with hurricanes, when a bad one happens in the US they are more likely to document other minor hurricanes in the Caribbean and Latin America. But otherwise plenty of hurricanes hit those areas and it doesn't make major headlines.


Facebook, Google, Apple, Cloudflare, Azure, Amazon, WhatsApp, and we have noticed smaller bits of routing weirdness like servers in Los Angeles not being able to hit GitHub for a few hours a few days ago.

This is more like 6-8, not e. It's definitely odd.

One non-conspiracy explanation I can imagine is that maybe all these big providers have a bunch of hidden dependencies on each other.


Apple documents their dependencies on other Cloud providers. Facebook has its own DCs but I think they’re sometimes colocated with cloud providers’.


But the outages are sequential, not simultaneous


They are very dense compared to the usual rate.


I agree, this is a bigger and really dense cluster, but remember that e is just the average (mean) cluster size. There will be clusters bigger than it.

My point is simply that our intuition of randomness biases us to see meaning in clusters when there is none in the first place. In other words, we need to intentionally shift our prior on how unlikely this event is. Yes, it's still unlikely, but it's not as unlikely as you would think.


“When it rains, it pours” is the conventional wisdom distillation of Poisson variable behavior



Actually there are not many actual conspiracies listed there.

I will start :

- All the companies in the list are getting rid of their Huawei dependencies, replacing routers by domestic ones.

- China is flexing its muscles in the trade war : look, we are creating some instability in your e-commerce. Wait until we attack your infra ( electricity, water )


If you go down the route of conspiracy theory you have to also believe that the companies who have already issued a PR response are lying and that all of the employees privy to the truth are also staying silent.

That would be quite a story.


I agree, Snowden was a pretty big story exposing the tech companies for working with the NSA in the PRISM program after previously denying it, including Apple. https://en.wikipedia.org/wiki/PRISM_(surveillance_program)


Amazing. It's 2019, and you still don't understand what PRISM is. Hint: at no place in the system diagram does it show any NSA system interacting with any of those tech companies' systems.


The iPhone and the atomic bomb were kept secret quite well.


I don’t think there has been a case of an atomic bomb engineer leaving a prototype at a bar, though.


How would we know? Information (esp. sensitive ) was a lot easier to hide back then.

Also the iPhone leaked was an iterative-model not the original.


I think we know because an atomic bomb prototype would have been far too large and conspicuous for an individual to take to, let alone leave at, a bar :)


Not to mention the bar would’ve been in Los Alamos, a town run by the military.


Also there were rumors of the iPhone for yeeeeaaaars. I remember people jokingly referring to it as "the jesus phone" while speculating about it.


There were rumors of the iPhone for a couple of years before it was released.

Even funnier, there were quite accurate rumors based on patents of the “true video iPod” since 2003 - a 3.5 inch all touch screen iPod. The rumor sites correctly predicted the iPod Touch down to the resolution years before it was released. They didn’t know that would also be the form factor of the iPhone.


One company keeping a specific product launch secret is different from three or four independent companies and all of their network engineers all conspiring.


Sure, but the Manhattan project was a huge undertaking with multiple extremely large facilities.

What if the NSA identified a critical backdoor and ordered all major companies to decommission and not talk about it? ( NSLs )


Fair enough... But then they all just did a really crappy job of concealing the patch?


Note I said decommission, not patch. And aren't most of the outages of this period routing related?


Why would you assume it to be so malicious, if true?

Maybe they were under attack. Maybe they were just removing all of the Bloomberg rice grain chips! :)

There's an infinite number of "maybe"s though. Far more than there are truths.

Your own bias about the entities shapes your conspiracy theory.


From the public, yes. From other nations, no.


Let’s not forget the existence of ayyyy lmao.


If the US is attacking Iran via cyberwar, it'd be likely they would want to retaliate. Google's post about fiber cables being cut, seems like a good experiment to see potential impact of our US network. Many companies rely on backbone lines to transfer petabytes of data between data centers around the country, these are our weakest points.


Maybe some submarines are cutting some deep sea cables? In actuality it's probably massive backbone upgrades that are classified, the NSA has to obtain real-time data from somewhere.


Why would the NSA be doing such low-level (literal wiretapping) work anymore? Useful traffic is almost always encrypted.


Knowing whom is communicating with whom is pretty useful to know, even if you can't decrypt the data.


Clearly this means they've broken public-key cryptography.


Flaws could be discovered in modern communication such that they can decrypt it in 30 years time. How good is 30-year-old crypto today?


They have a #ailored #ccess #perations department. Look it up.


You sure are playing it safe there.


They log who talk to who


Is this a joke about the Russian communications sub fire?


Once is happenstance. Twice is coincidence. The third time it’s enemy action.

I don't even know if fourth of Fifth is confirmation.


Dude, I wrote that top comment. And I was like iCloud WTF?


Just wondering out aloud rather than floating any theories, if it is in any way related to the 6.4 magnitude earthquake and/or geological events leading upto it. Also inter/un-related, Google provides Apple with the infrastructure for iCloud; they also experienced some downtime in the last few days, along with others i.e. tremors causing some sensitive/critical servers to misbehave.

However, I doubt that it is down to these factors, as there will likely be significant amount of distributed fault tolerance, failover and contingency plans in place.

https://earthquake.usgs.gov/earthquakes/eventpage/ci38443183...



It's definitely the end times. Locusts coming anytime now.


The outages occurred long before the earthquake but my mind originally wandered to the same place.

What earthquake related causes could have had effects 24h or more ahead of time?


Apple runs their own bare-metal services - only some of which are hosted on Google Cloud. You could see which ones went down when Google did the other week.


Do you have a link? I tried to find out which GCP regions/zones affected were related to Apple, without any luck.

edit: Google Cloud Networking is indicating disruption at present: https://status.cloud.google.com/


So, when is Amazon / AWS outage day? Tomorrow?


How big of an outage do you want? Here's one from 7 days ago: https://news.ycombinator.com/item?id=20298719


If there's any credibility behind the notion of these being deliberate in any way, I'd wager July 15/16.


Why those specific days?


Amazon Prime Day(s)


Amazon prime day


Prime day is coming :/


I'd like to think the simplest explanation to this recent spate of outages is that all the engineers at each org spent too long reading the HN comment thread on the previous company's outage, and didn't notice their own servers catching fire ;)


Imagine all the ops folks gloating earlier in the week (cough CloudFlare) having to eat crow mere days later.


All the ops folks I know wouldn’t gloat at something like that. They’ve been through it themselves.


Imagine being on call this week. My best wishes to anyone doing DevOps or Ops.


Any internet engineers privy to what's going on the last few days, at a global infrastructure level?


Senior people at the orgs are starting to go on summer vacation. ;)


Actually, several members of the iCloud team were at SIGMOD 2019 in Amsterdam this week.


Wait, isn't it when the senior people leave that real work gets done?


No, that's when managers and bosses leave.


That honestly seems like the most likely answer.


And summer interns are there.


Coincidence.


Wait... so you're not part of a massive global conspiracy? :(


Here is how ThousandEyes viewed the impact: https://twitter.com/thousandeyes/status/1146862826566250499


Little off-topic, but lately i often find myself Airdropping files between my iPad & Macbook because iCloud isn't syncing newly added files for some reason (which can't be forced/refreshed on the iPad as far as i know.)


Are there any graphs anywhere showing Internet traffic at different ISPs and networks? Such a graph (especially over a world map) would make it obvious if there was any DDoS, SSH bruteforcing, or other monkey business going on.



Chicago IX posts their aggregate data publically.


I wish there was a historical view. Very curious if this is common, or if it's unusual (which would make it especially unusual with the Slack, Facebook, etc. outages recently).


The page is on archive.org if you want to write a script to cobble together some historical data.


I'm actually relieved to hear there were a bunch of known problems…earlier today I was wondering why none of my notes and photos were syncing and thought iCloud was loosing its marbles. It used to happen a lot to me, but in the last year or so it's been pretty stable. I was afraid sync reliability was regressing!

I suspect Apple's making upgrades as part of the push towards iOS 13 and macOS Catalina, and ran into some rollout glitches.


One of the services listed is Screen Time. Maybe I'm mistaken but isn't that the feature on iPhones where you can see how much time you spend in each app? Why would that require Apple's servers.


It labels what you're doing during that time, ie, which apps you're using then displays an icon next to it which could explain the web requests since there are so many apps.

My guess is that it's a glorified analytics system for Apple which they retooled into a useful service for end users.


I believe it uses iCloud/CloudKit to sync data from all devices associated with your account.


Cross device statistics.


Time zones are ever changing. Maybe they’re looking up a database?

Edit: actually screen time is the feature that measures how long you’re spending on each device each week. I’m not sure why that needs to talk to the cloud unless devs also get the inverse metrics.


It syncs across devices, presumably that mechanism is unavailable during this outage?



In India, I'm experiencing issues with online payment from last few days. Majorly with UPI payments. Not sure if these are related but hard to ignore it.


I was noticing some issues with mega sync earlier, maybe these orgs are sharing infrastructure at some level... I guess I won't rage quit just yet


And I've been wondering why my kid could play games so long today, in spite of Screen Time limits :-) I guess that explains it.


iOS Walkie Talkie has been down since Wednesday


Holy crap this has been a turbulent couple weeks for internet companies and services, eh?


These things are so much more painful to deal with when it's July 4th.


Tinfoil hat on, conspiracy mode on: And a fire on a Russian submarine whose purpose is said to be disrupting communications (it can access and cut underwater cables”.


Is it that much of an internet conspiracy theory when mainstream media reports on it?

https://futurism.com/russian-sub-fire-internet-cables

https://www.theregister.co.uk/2019/07/02/russian_sub_disaste...

https://www.businessinsider.com/russia-submarine-losharik-un...

Etc.

What I find interesting is that before big news networks covered it some HN users were linking to the story in the previous outage threads. Way before anyone knew that the submarine had cable cutting capabilities

At the time, everyone downvoted them because of how unrelated it seemed.

I'm not saying that I subscribe to the theory. But it is interesting nonetheless.


Is it that much of an internet conspiracy theory when mainstream media reports on it?

https://futurism.com/...

https://www.theregister.co.uk/...

https://www.businessinsider.com/...

Still waiting for some "mainstream media" links.


I sent the first three that I saw. I have to admit that in my eyes all those sites have the same lack of credibility. Here's a few more mainstream, I think.

"Analysts suggested that one of its possible missions could be disrupting communication cables on the seabed." https://time.com/5619197/russia-navy-submarine-fire-deaths/

"Western military experts have suggested it is capable of probing and possibly even severing undersea communications cables." https://www.cbc.ca/news/world/putin-russia-submarine-fire-nu...

"Analysts suggested that one of its possible missions could be disrupting communication cables on the ocean bed." https://www.cnbc.com/2019/07/02/sailors-killed-in-russia-sub...

"Russian ships may be using underwater cables to spy" https://nypost.com/2018/03/30/russian-ships-may-be-using-und...

etc.


Whether or not something is a conspiracy theory has nothing to do with the person or persons advocating it.

And the mainstream press is really bad at careful logical analysis. If anything, their desire for sensationalism can easily outweigh their concerns over looking foolish. For example, back in the '80s there was the Satanic daycare abuse scare, and it was absolutely pushed by mainstream news outlets.

(That's more of a moral panic than a conspiracy theory, but real conspiracy theories tend to be partisan and I'm trying to avoid that.)


Might as well add the Vice President suddenly being required to stay in Washington DC early on July 2nd, to the list.

Even if it involved the Russians attempting to tamper with underseas communications cables (and US service providers being forced to respond), we'll never get confirmation of it from either side.

I don't know why you're being downvoted. The Russians trying to screw with US communications makes perfect sense as an explanation for such a comically implausible whoops one-after-another result like this.

People here are saying that it wouldn't be possible for all the companies to conspire together to cover that up. Sure, tell me how it wasn't possible for numerous of the forced-to-join NSA programs to be covered up for years, all of which involved the participation of huge US service providers. Of course it's not only possible but very easy, major US service providers routinely work with US Government agencies.


What happened to 5 nines uptime?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: