Hacker News new | past | comments | ask | show | jobs | submit login
Facebook, Instagram go down around the world in an apparent outage (usatoday.com)
610 points by ajiang on March 13, 2019 | hide | past | favorite | 331 comments



Can you imagine if Twitter and Google went down at the same time?

People would be reactivating their Facebook accounts and having to sift through conspiracy theory posts about Hillary Clinton still just to figure out what was going on.

Edit: The points on this post keep going up and down every time I check these comments. Yes, it was sarcasm, I was joking, but I was trying to point out that most people rely on a small set of services. "Cloud" has centralized things a lot.


Whenever I hear when some service is down, I immediately go to that service to confirm. Then I repeatedly hit reload if it doesn't work to see if it can come up. I guess many people do the same and that may contribute to the problem...


Gmail's exponential back off, with it's visible countdown to next retry is a nice idea. Probably reduces that compounding wave of customer reloads.


Years ago I worked at a large online casual gaming company who's name ended in -ynga. Our web tier was split into two: one for serving static content required to load the HTML, Flash app, assets, etc. The other was for actual communication regarding actions taken in game.

Whenever we had any sort of issue we could generally get a good idea of what was happening by looking at changes in traffic in those two web tiers.

If people couldn't play for most reasons, game action traffic would drop to near zero, but the static asset tier traffic would usually at least triple.

So yeah, there are a lot of F5 buttons being hit out there when pages don't load.


Gmail and other Google products went down last night. Close though. Thankfully not on Twitter or FB.


I don't believe Gmail was ever fully down. For me, I was just having problems with attachments. I also noticed app icons in the play store failing to load.


I experienced issues with Drive last night but it was never fully down. I was trying to work on a 500GB file and the API would drop the link intermittently. I could parse the directory no problem, just couldn't reliably access the files.


Dir layout is likely stored in a different system then large files.


^ this.

It was not fully down, there were issues with their blobstore.


Didn't they just announce shutting down Flickr or something. Plus they could be decommissioning Google+ stuff. Maybe this is related. Just a guess.


Flickr is unrelated to Google and Facebook. Flicker is owned by SmugMug, previously by Yahoo, previously by Ludicorp.


You may be thinking of Picasa, which had it's EOL announced in 2016, and the web API due to be discontinued 15 March 2019 (tomorrow)

Google+ APIs were shutdown 9 March 2019


My company's tech support sent us an email to tell us our email was down. Fun times we live in.


Back in my tech support days I received an email from a customer "I am unable to send or receive emails" I replied "I am very sorry for your inconvenience, I have resolved the issue".

Customers in 1999 really couldn't believe no one had replied to their emails within a day or two.


People would be calling emergency hotlines..?

Wait, people are doing this already: https://twitter.com/SA_SES/status/1105969450698694656


I'm more worried of all of it going down at the same time.


When it all goes down at the same time you should be worried. Not because of the lack of Twitter or FB or Gmail bit for what it means if it's all down.


[flagged]


It was a joke based on what people often say about FB. I don't use their platforms.


Totally agree.


it really goes to show how different all of our feeds are based on who we are friends with (and of those, who we interact with the most). Anecdotally, I have 3 people who will share any "conservative" attack meme (at this point, I don't even know if it counts as conservative so much as just outright attacking Democrats. Sometimes it reads more like an attack for the sake of attacking than a statement of belief in something different. Kind of weird. Part of me wonders if maybe some of these accounts do conservative attacks and some do liberal attacks in an attempt to get shares and what not with no political interest whatsoever: effectively acting as an arms dealer of the meme variety.) they can get their hands on and of my friends of a more liberal view, it is mostly policy things they share (pro-choice, anti-rape, etc.). There's one dude that is pretty anti-Trump, but his for the most part stand out as an exception. Most of the ones I see referencing Trump directly (if it isn't during a period where he did something that Democrats felt was highly suspect) are more in support of him than anything.

Which is why I have to refrain from taking an "over the top much?" slant when people post the pro-Trump/victim of the left type memes as I don't see a ton of attacks on him directly, but then again, their feed could be totally different from mine so who knows?


I've got a few people in my feed who just dump on the right no matter what. I've fact checked them a little bit. Again it's about a 2-1 for me fact checking my right leaning and left leaning friends (typically older). But I keep all of them on there just to have an ear to the ground.


Yeah, I appreciate the various view points and items shared in my feed from both conservative/liberal viewpoints have been wrong (almost consistently so).

I think the real bummer is when you present the actual video of what the person says, and the response is essentially “yeah? Well, they still suck” or something in that ballpark. I have zero issue with someone not liking another person’s views, but a lot of it is just outright libel.


> Can you imagine if Twitter and Google went down at the same time?

Google sure, but what people in the real world cares about twitter?

Twitter could be down for days and only the technocracy would notice.


Twitter is where we complain that Google is down


The US president?

It also seems popular with journalists and media companies (e.g. TV shows asking viewers to "tweet us your questions")


Could this be related to the storm?

I was out shoveling, and came back in to my phone blowing up. Our systems at IronMountain (formerly Fortrust) in Denver all rebooted at once. These are all on redundant power, each systems redundant power supplies connecting to different circuits entering the cabinet, and those two circuits fed from 3 PDUs (two separate, one share). Each of those is supposed to be fed by a separate UPS and generator. Last status update I had says that they are running off generators, but they've been shockingly tight-lipped about it.

Don't get me wrong, it was hi-LAR-ious to call into their NOC and have them pretend that I was the only one having problems. "Can you tell me if there is a major data center outage going on?" "We are trying to gather information, we are making a bunch of client phone calls, we will know after we make those calls." "... Why are you making a bunch of client calls if you aren't having an outage?"


So, a storm in Denver stops me from using Messengner in Estonia? I wonder where the butterfly flapped its wings.


Pretty sure it doesn't apply to Facebook, but Amazon's cheapest AWS tiers are around there. Same with Virginia.


But AWS doesn’t have any regions near Denver.


Wrong, AWS has no regions near Denver.



No, FB datacenters are geographically diverse.

They do run quarterly 'storms' where a datacenter is shut down to test failover and resiliency. I have no idea if today is one of those days, since I left last year.


Theoretically a real shutdown might go in a different way than previous tests or simulations. For instance in a test you might cut the connection completely, while in the real case only some power circuits go down or whatever.

For instance GitHub's relatively recent shutdown was due to a fail-over heartbeat not going as expected.


Test failures are all well and good, but don't always match reality. In this case, the design of the power infrastructure was solid, and their plans include running monthly generator testing and quarterly "disconnect from the grid" testing. But apparently something about this failure of both of the incoming power lines caused failures in multiple UPSes. Still waiting on the after-action review.


Interesting. Out of curiosity, how hard is it to turn the datacenter "back on" in case they discover there's a problem with the failover?


Typically that could be done with one undrain command and take about 30 seconds.


[flagged]


it's only so quick because stuff isn't actually turned off with disks wiped. The machines are still running, with applications loaded, just with no traffic directed towards them.


It is an internal software problem.


> "We are trying to gather information, we are making a bunch of client phone calls, we will know after we make those calls."

I think that is a yes, and he getting ahead by saying "Yes and we have no idea why or ETA so let us do our job".

Granted, they should have a status page.


Last time I dealt with a cloud provider outage the status page was unresponsive during the outage because the status page had some kind of dependency on the resources that were down...


I remember that, the status page for AWS stored the red icon on s3 which was the service that was down.


Actually in this case it was Azure related but it's pretty funny that the same thing has happened to the two biggest cloud platforms


Sure, I understand the "so let us do our job". I've been on the other side of that.

On the other hand, I need information to be able to do my job: Is this only our cabinet having problems and I need to start rolling to the datacenter (in the middle of a giant blizzard)? Is this possibly some sort of problem with our own power infrastructure? Is something on fire (an EPO triggered by fire could cause this)? Did the roof cave in under the weight of the snow we are getting? Is the power stabilized or is there some indication that power might be up and down?

In short, I need answers to: Do I need to gracefully take down my site to prevent lost transactions and database corruption? Do I need to switch to our backup site?

For context: All of our servers powering off at once and then back on shouldn't be possible. It should require the failure of at least 3 independent pieces of equipment (except at the breaker panel or in our cabinet where it could be only two failures). It is extremely unusual for this to happen, first time it's happened for me and I've been in that facility since 2004.

So, yes, I respect that you need to do your job. But I also need to do my job.

Plus, I'm pretty sure the guy answering the trouble line, his job WAS talking with the customers. The people working the problem likely didn't include him. This is a huge data center run by a ginormous company. I don't think I was taking him away from twisting a wrench. :-)


I wouldn't be surprised if they think a status page would open up liability for not putting it up soon enough, or for too long, or for some text that turned out to be wrong or unnecessary.


"The storm"? It's sunny in the Bay Area for the first time in I don't know how long. I imagine it's nice in other parts of the world as well, other than where this localized "storm" is.


Denver is getting slammed right now — power surges everywhere.


You missed the bomb cyclone that's happening right now in half the country?


Yes. Believe it or not, it's not really major news for those not impacted. We have the local evening news on most nights in the background, haven't heard a thing about it. I also regularly read NYT and WaPo and follow the Internets.


It's sad that such outlets don't bother to print about stuff that impacts middle America. They could regain some credibility with a lot of said middle Americans without ever changing their political alignment just by giving them a little more coverage.


Colorado weather: Bomb cyclone brings wild winds, big impact as blizzard whips state

https://www.denverpost.com/2019/03/13/colorado-weather-bomb-...


My first reaction to "could this be related to the storm," was "oh no, now this QAnon stuff has spread to HN."


Do they do this to get around and 99.9% uptime agreements?


Probably a combination of that and to curtail the "I just spoke to Brad in Customer Service who confirmed _the whole datacenter if offline_" type posts.

But that's my presumption, I don't actually know anything and don't want to imply I do.


It's easy to be cynical but it's optimistic expectation management.

It might be resolved, it has to get worse before you escalate it further. They might not know the full facts. Might be worse than it really is. How do you know? You can't judge that because your personal rendering of Facebook failed. You have load balancers and CDNs and A/B testers all getting in the way of delivering data to your machine.

It's too easy to draw a conclusion from the client-side armchair and the provider is absolutely not going to make false promises, for the worse or for the better.

You want to hope that Facebook, in this case, acts on more complete information.


0.1% is still almost an hour per month


That's the trust issue with current agreements we are solving. If an API is down the bound agreement is enforced instantly with our platform, no lies, no call, no pain. We are actually onboarding companies to try it out! https://stacktical.com


Why does this need a blockchain?


TLDR: Because Smart Contracts on the blockchain are the right tool for Secure Digital Agreements.

Paperweight contracts are irrelevant in a world of data

* A Smart Contract is cheaper to publish that the stack of paper handled by lawyers.

* Code is cheap to iterate from whereas traditional SLA are expensive/slow to renegociate. Over time, SLAs drive behaviors that are focused on delivering a minimum level of service at minimum cost to the provider.

* A Smart Contract is a code you can trust, understand and expect to behave instantly compared to the traditional SLM.

* A Smart Contract is immutable. More about the benefit of Smart Contract over paper/digital agreements: https://www.forbes.com/sites/cognitiveworld/2019/03/10/rise-...

Why not a python script? I don't trust the guy that handle that script. Blockchain is very good at bringing trust.

I hope I answered your question

[edit] formatting


so, are you saying we should replace social media platforms w/ decentralized sharing & aggregation driven by smart contracts? sounds intriguing but daunting


Even with your platform there will be calls and pain, maybe even lies during an outage. BC-Recordkeeping or not


unscheduled outages are always painful and people will always call, I agree. But instant compensation is doing a better job at damage control that a status page. Keeping customer satisfaction even in bad situation is key in a world of high availability expectations. And with a distributed, non partisan metric sourcing about the availability of an API, it's not possible for a Service Provider to lie anymore.

Feel free to give that whitepaper a look


I don’t understand something: what kind of company is so down to the wire with cash flow that an outage requires income within seconds/minutes instead of weeks? Anyone with a financial runway so short that it can be described as “instantaneous” doesn’t sound like a customer you would want to be in business with.


> what kind of company

The kind that will make a lot of noise as publicly as possible and create ample work for your support/admin people if you don't keep them happy...

> doesn’t sound like a customer you would want to be in business with

I could say that about most of the companies I have had the dubious pleasure of doing business with! Very few are pleasant when something goes awry even for a moment.


Classic response for any kind of service provider:

Deny, deny, deny, obfuscate, deny, then blame someone else (usually, YOU).


A company as large and sophisticated as FB has data centers and cloud services in multiple countries, and in the US, probably colocated data centers. Certainly nothing localized to where you are.


If the outage is at all infrastructure-related, the root cause was something that at some point was local and cascaded. Unless someone git pushed to a repo used by both companies and it's taken all day to get it git revert'd, their redundancy obviously didn't work, did it? There's effectively a category-2 hurricane moving from the Rockies through the mid-west right now.


Awesome we use iron mountain for escrow


“Outage, what outage?” Is a sort of laughable response but all too common with tons of providers.


Reminds me when I was contacting Deutsche Telekom last year regarding an outage in Monschau area. "We have no problems". In fact, the whole exchange was down, press got a sniff of it when people could not contact emergency services anymore: https://www.aachener-nachrichten.de/lokales/eifel/netzstoeru...


I'm looking at you, AWS...


A nuclear war could take out an AWS region and we'd get an "increased error rates" informational message.


Are you saying that a cold-war-era system like the internet/arpanet meant to survive a nuclear war might be vulnerable to an attack if we take all the code and data and store it in the same place?

:-)


Cloud is not much different then Mainframe Computing 2.0.

I'm already curious and have been pondering about this for quite a while to understand what/how/when we will see a disruption of this.


Limewire is gonna make a come back


Once upon a time Multics' developers predicted that someday computing power would be treated like a public utility that homes and businesses would buy like electricity or water or natural gas.

Given the proliferation of minute/hourly billing among service providers, it looks like the Multics folks guessed right. It just happened on top of Unix(-like systems) instead of Multics.

I wonder how long it'll be before we start seeing municipal datacenters?


No he's saying that AWS sucks at updating their status page.


Remember when S3 failed and the S3 green arrow was being cached on S3. I remember... unfortunately.


Man that reminds me of that century link cluster f


I'm interviewing for a Production Engineer role at Facebook on Monday, thanks for providing relevant "do you have any questions for us" content.


Good question is why oh why switch WhatsApp to Facebook tech when it was running perfectly ok on its own. Never crashed.


So that engineers can be moved between product groups while carrying relevant knowledge and experience with them.


I'm pretty sure a bunch of Erlang programmers being shoehorned into a PHP codebase is the literal definition of Hell for all parties involved.


Facebook is not a “PHP codebase.” I’d guess that fewer than 0.01% of CPU cycles at Facebook are used by PHP.


Isn’t a significant amount of the codebase in Hack, a PHP derivative?


C++ is the most popular language at Facebook, I know that one for sure. They used to run PHP on HipHop VM which was written in C++, but now they transpile PHP to C++.


The transpiler is HipHop. They discontinued that in favor of HHVM, which does JIT compilation instead. More info: https://hhvm.com/

EDIT: apparently, though, HHVM stopped supporting PHP itself last month; now it only supports Hack. I'm not familiar enough with Hack to know how much it actually deviates from / improves upon PHP.


If that’s the goal, why not start incorporating more Erlang into the rest of Facebook instead? It proved its mettle at WhatsApp.


Facebook has to hire thousands of engineers per year. They may incorporate more Erlang into Facebook, but they have to have a core tech stack that can easily onboard engineers from a variety of backgrounds. I don't have the foggiest idea of whether Erlang can be part of that or not, but people talk about it as if it's a special-purpose tool.


Facebook was actually using Erlang way before WhatsApp

https://www.quora.com/Why-was-Erlang-chosen-for-use-in-Faceb...


Speaking as an Erlang developer I approve of this question.


The answer should be clear at this point. WhatsApp provided too much privacy and not enough monetisation opportunities.


Good luck! If you’re interviewing in MPK make sure your recruiter takes you to the barbecue shack for lunch


But be prepared to stand in line for longer than it takes you to actually eat.


#worthit (although I’ll admit my afternoon interviews were a bit... challenging... after 5lbs of brisket)


When I interviewed for SRE at Google, they'd had a non-trivial cross product outage days before. Good conversation starter, but I couldn't get many details out of them.


You mean this week when they took out most of our their cloud and public products for a few hours? /s


You think they might have a couple more vacancies going now? :P


Funnily enough, same for me too! New Grad this Friday. Good luck!


I’ve seen many systems go down over the last few days worldwide. Aside from the possibility of a mega-DDoS attack (which Facebook denies), all of these organizations have fairly diverse tech stacks to my knowledge. Google’s issue (supposedly) had to do with their Blobstore API, we don’t know what happened with Facebook, and many other, smaller services have had issues as well, including three intranet services at my workplace.

This leaves me wondering what software all these places have in common. The application layers are all different, the databases are all different, the containerization and provisioning systems are different, but I imagine that all these systems rely on two things: the global Internet backbone, and maybe the Linux kernel.

Have there been major security vulnerabilities patched lately in the Linux kernel that could have had unintended consequences?


Both companies are massive and have tons of developers. It becomes almost impossible to look at the system as a whole with the amount of changes coming through. And, you get scenarios where small failures cascade through the stack reaking havok. Often times its just one config change

Its telling that one of the hottest areas of distributed systems research these days is the boring topic of configuration management. Google, Microsoft, etc are paying researchers top dollar to figure out how to prevent massive outages through novel techniques. It is one of the harder problems to solve and requires massive investment in tooling, refactoring, etc.


You’re undeniably right about not looking at Facebook or Google as one whole system, but there have also been what seems like an unprecedented number of strange little outages (see the ones mentioned by https://news.ycombinator.com/item?id=19382418) that aren’t huge companies. My workplace had some of their own today that I haven’t heard an incident report about (it’s a pretty large company and I’m not in IT).


>>Google, Microsoft, etc are paying researchers top dollar to figure out how to prevent massive outages through novel techniques

Curious what makes you think this. Are there specific job postings in either company that are focused on this?


I work for Microsoft, I know of at least CrystalNet [1].

[1] - https://www.microsoft.com/en-us/research/blog/eliminating-ne...


The best explanation is coincidence, I think. I have direct knowledge of two of the incidents in the past few weeks, and they have completely unrelated causes.

Sometimes you just get unlucky!


"Once is happenstance. Twice is coincidence. The third time it’s enemy action."

-- Ian Fleming (in Goldfinger)


If it can happen, it will happen.


That’s certainly possible. We’re probably still too early to tell, but the innate conspiracy theorist slash pattern-matching part of my brain wants to find a probable connection.


Maybe we're in the first week of a rogue AI's hard takeoff. ;)


>all of these organizations have fairly diverse tech stacks to my knowledge

>This leaves me wondering what software all these places have in common.

dunno what systems you're talking about, but seems likely they are mostly x86 systems and maybe even mostly using Intel hardware and microcode

those systems can-be/are rooted and more, to my knowledge


It could also just be coincidence.


> This leaves me wondering what software all these places have in common.

Cisco or Arista


FANGs all use white box hardware with “merchant silicon” meaning they buy the chipsets directly from Broadcom, Mellanox, etc. and build their own devices. However, they do all have Broadcom and Mellanox in common and Cisco, juniper, and arista do too.


Some FANGs definitely use Arista/Cisco. As far as I know white box hardware is mostly for top of the rack switches (as opposed to backbone infra).


Depends on which backbone infra :-) Take a look at the B4 and Jupiter papers from Google, for example.


Facebook's own status dashboard (https://developers.facebook.com/status/dashboard/) showed no issues or outtage just 30 min ago.

I run a messenger bot platform - the webhooks stopped being delivered _hours_ ago... nothing on their status page until it had been down for hours.

Their current issue...

"We are currently experiencing issues that may cause some API requests to take longer or fail unexpectedly. We are investigating the issue and working on a resolution."

What? lmao


I'm pretty sure businesses use status pages to divert attention from support resources, they never seem to give useful information about outages and half the time don't even mention the outage.


that page is down now as well


you get what you pay for


If you're inferring that Facebook is "free," I'd strongly disagree. Data is the currency of today, and they're swimming in it.


fb platform is free


Facebook is a paid advertising platform. There are plenty of people not getting what they paid for as a result of an outage.


ok but you are all talking about the users. facebook platform is free to use for developers

as for the advertisers, i doubt they 'll be charged for impressions that didn't happen


But for someone who runs a business that relies on Instagram for marketing, and pays for advertising on that platform, it's a bit scary when the whole thing is down. Obviously this was only temporary this morning, and sure, no charges while down, but doesn't do me much good...



I can't even reach that status page right now...


when I click I see a blank screen. why


It looks like something much larger is going on. If you look at the front page of https://downdetector.com/ you'll see most major sites/backbones are having issues (Verizon/ATT/Sprint/CenturyLink/TMobile/Comcast/Level3/etc).


That site relies on user reports.

I strongly suspect users are reporting "my Internet is having troubles" because their FB, Messenger, etc. isn't working right.

For example, in the comments of the T-Mobile outage page, there's stuff like "Haven't been able to upload anything to social media all day" and "Cannot send pictures through whatsapp and fb messenger".


That's true, but it is a good indicator. Here's a better map from Akamai: https://www.akamai.com/us/en/resources/visualizing-akamai/re...

Also, check out the "Attacks" tab. That one really lights up. Like seriously lights up. Something is going on... all over. US, China, Russia, EU...


The attacks tab says the current count is "100% Below normal". It does not support what you think it supports.

It "lights up" because there are always attacks happening.


Look at the scale. The blue area is 100% below normal. The red are way above normal.


That doesn't make any sense, given that the "traffic" tab's scale says "7% above normal".

The red are the areas with the most attacks, and as you'd expect, they correspond to large population centers. (It's also not very granular, and appears to largely correspond to "where does Akamai have a datacenter".)


It’s a poorly designed graphic that’s hard to get any real info from.



That chart is a marketing tool, nothing more.


Yeah in Turkey people thought government is slowing down/blocking the whole internet


To be fair, the priors lean into that rather than Facebook and Whatsapp having troubles.


Yes, I had that exact problem yesterday. I couldnt upload pictures/videos/gifs on whatsapp and instagram.

I thought maybe my ISP blocked a port which these services maybe transferring their multimedia on.

/Sweden


That's a fair point, but those folks probably aren't reporting AWS problems, and it's showing a spike too.


AWS is reporting an EC2 outage ("increased API error rates and launch failures in the US-WEST-2 Region") in Oregon currently.


I believe that too but some of the services seem unrelated (like Flickr and Capital One).

Now, another interpretation is that the reports are simply false...


Ironically https://outage.report/ is down too.

Edit: it's back now (8:37 PM UTC)


HN was down for a minute.


So yesterday Google had a major (and out of character) outage across its apps, and today Facebook has a major (and also out of character) outage across its apps.

I can't wait to see the RCA for both of these and if they're related.


Private post Morten: The NSA middleware we are required to run (that took time to deploy to each of our social partners) is breaking something so let’s revert.

Public post Mortem:

Entirely believable technical cause.


Alternate Post Mortem: Cyber WWIII's first public skirmishes become visible...

(Ignore Stuxnet, Ignore DUQU)


/me puts on tin foil hat

It's the rice grain implants, man


yeah I wonder


I imagine the NSA uses an optical tap device. These devices create identical copies and require no power or management.


That's why it's called PRISM. It's exactly what you describe. Splitting an optical signal into 2 using, basically, a prism. One signal goes out to the net as normal, the other goes to their own datacenters, that they keep continually building and expanding. The newer ones are being build on military bases, for added security. Check em out. Look at the size and cost of them. Some are over a million sq. ft. That's a lot of data. They measure it in terms of zottabytes and zettabytes (in 2013, a lifetime ago in terms of storage space):

https://modernsurvivalblog.com/government-gone-wild/nsa-loca...

http://worldstopdatacenters.com/government-data-centers/


Nah, PRISM referred to the front door for lawful access to customer records under warrant. That's the sort of portal that China once hacked Gmail by gaining access to.. the companies explicitly built those access relays.

The beam splitter stuff (e.g. Room 641A) went by different codenames, TRAFFICTHIEF and TURMOIL iirc. That's the back door.


> The newer ones are being build on military bases, for added security.

IIRC, the NSA is organizationally part of the military, and it's currently headed by a military officer who gives congressional testimony in his uniform (https://www.youtube.com/watch?v=nMi241XLeQ8). It makes sense they'd build on military bases, it'd be kinda weird if they didn't.


No, it was called MUSCULAR. PRISM is different.

Also, the majority of tech and data companies have closed this loophole by encrypting traffic between data centers. Nobody thought it was necessary to do it before over dark fiber before because, hey, who was listening? (answer: the NSA was)


I wonder if they publish their drive failure rates like BackBlaze? 5 Eyes tax payers and all.


One of my coworkers came from a large telecom. He mentioned they had to get technology from an Israeli firm that specializes in quantum cryptography on the fiber optic line to fend of NSA and GCHQ, who are apparently worse than NSA. (iirc) the tech apparently encrypts data streams on one side and check to see if the hash is the same on the other side, if somethings off (evidence of tamper) it instantly changes the cipher.


And they have been using the USS Jimmy Carter sub with the front huge cable splice bay for decades to compromise all undersea cables.

>>>The New York Times reported in 2005 that the USS Jimmy Carter, a highly advanced submarine that was the only one of its class built, had a capability to tap undersea cables. An Institute of Electrical and Electronics Engineers report speculated that a 45-foot extension added to the Jimmy Carter provided this capability by allowing engineers to bring the cable up into a floodable chamber to install a tap. But it is unlikely that the USS Jimmy Carter routinely taps cables since U.S. intelligence agencies can much more easily (and lawfully) obtain cable data through taps at above-ground cable landing stations.

https://www.lawfareblog.com/evaluating-russian-threat-unders...


I wonder how Jimmy Carter feels about his namesake being used to wiretap the world?

https://www.washingtonpost.com/news/post-politics/wp/2014/03...


It was a known FU to carter...


But the USS Jimmy Carter launched in 2004:

https://en.wikipedia.org/wiki/USS_Jimmy_Carter

I don't think he made any pre-Snowden comments on mass surveillance?


Optical tap for unauthorized access, but surely port mirrors for the national-security-letter stuff. Wouldn't make sense to go through the hassle of a tap install and the ongoing risk of it failing, versus using a capability available on almost all serious switching hardware to give you a guaranteed 1:1.


They would still need the encryption keys since Google encrypts their inter-data center fiber


interesting that facebooks cavalrylogger is still being sucessfully injected despite their being nothing but a blank page also interesting that cavalrylogger has a function that lets you bind key-presses to callbacks even more interesting is that cavalrylogger seems to come prepackaged with any facebook like button! cheers for the keylogger facebook

https://stackoverflow.com/questions/4188605/what-is-cavalryl...


Alternative post mortem (blind): Massive power outage

Yet another alternative: Third World War has just started, and this was the first battle.


I don't think it's the NSA this time, for once they don't have to do deep package analysis or install any MITM device since they get the whole info in bulk, maybe it's just a 400-pound hacker.


I've never understood the 400 pound hacker thing. Does it refer to body weight or technical prowess?


I spontaneously took it as a loose analogy to dominant male gorillas and the extent to which they need to be taken seriously.


It's the stereotype that "hackers" are obese, chugging mountain dew and living in their parents basement.


The 400-pound hacker is a reference to what Trump said during the 2016 campaign but it seems like not everybody gets it.


Remember when youtube was down for like six hours a few months ago and we still haven't heard why?


Yeh why was that? I kept wondering when we'd hear. I've got bets to collect on!


Curious of they did it due to a kernel exploit being used by a nation state bad enough that it was worth YOLO patching.


there are actively people downvoting such comments. I guess that's suspicious too.



The heavy traffic is due to sports events - champions league last-16 matchday live streaming: Bayern Munich vs. Liverpool, FC Barcelona vs Olympique Lyon. The heatmap matches the clubs' home countries UK, Germany, France & Spain quite well.

Dont conflate that with fb/insta problems.


Indeed. The games are streamed like crazy! Lots of streams are in HD.


Yep, the actual bandwidth consumption is mindblowing :)


Multicast was designed exactly for this - same data streamed to many endpoints at the same time. Too bad it's not being more widely used, the bandwidth savings would likely be huge.


That chart doesn’t support your assertion. Akamai’s traffic and attack charts usually look like that, and the attack chart even says it’s currently low.


That traffic map seems to line up with time zones


Let's see whether we have a spike in the birth rate in 9 months.

(Oh, turns out the Great Blackout Baby Boom was a myth:

https://www.snopes.com/fact-check/from-here-to-maternity/ )


What if there is a fall in birthrate because people can't organise risky hookups without their preferred communication platform?


Or a fall in STDs for the same reason.


"This usually means we're making an improvement to the database your account is stored on. While this process won't affect your account, you temporarily won't be able to access the site." https://www.facebook.com/help/134401680031995

I guess that this is all that I will get. Facebook is never down, it is just making improvements (like restarting the services to make them work again).


We've always been at way with Eastasia


and autocorrect - it's doubleplus bad


*ungood


They could be consolidating all of the DB infrastructure for their platforms. A zero down time dial-up would not be possible as they would need to nearly double their DB infrastructure. Short planned temporary outages of various features probably become long unplanned cross-platform outages. They probably decided to not rollback the migration after the first outage.


What manner of failure would cause such globally deployed and distributed systems to go down like this? I'm very interested to read up on this when they release details of the failure.


Short duration: network, bad software deploy Long duration: db. If you break data, it takes a while to unbreak.

Source: Me. My career has been spent managing db's for internet scale sites.


I work for a smaller but comparably large platform. "If everything is down check the DB" is at the top of one of our internal monitoring websites in red.

Screw ups related to data loss are rare (I've been here years and haven't seen one with the DBs that the stuff I work with uses) but failures at this scale tend to cascade a little ways and it takes time to dig out of the hole. They probably have the problem solved but they have to spend a bunch of time synchronizing things and verifying the fix before they press the big red "go live" button.


Shouldn't the monitoring websites be able to check the DB status for you before you even look at that red text? :)


We have a different dedicated page that gives an overviews of what's going on with the DB. The page in question is supposed to be a single stop that lets you visually get an overview of the state of the application servers and whether things are "normal" and if not allow you to quickly identify what is not normal.


Nothing worse than that sinking feeling of "oh fuck, we have to backfill a lot of data.


Why did my username on this site just change to 'test123'... oh, where clauses.


Nothing worse then the page on Friday night, oh there goes my weekend.


I have no inside knowledge of this one, but broadly speaking, these sorts of failures can be caused by a change thought innocent at the time to the core software that is then widely deployed using automated systems. If the core's tests didn't catch a real issue in production (and for whatever reason, the rollout happens faster than the regular small-release verification process can catch the error), things can go sour in a way that's expensive to un-sour.

Amazon once pushed a seemingly-innocuous change to their internal DNS that caused all the routers between and within datacenters to drop their IP tables on the floor. They had to re-establish the entire network by hand---datacenter heads calling each other up and reading IP address ranges over the phone to be hand-entered into lookup tables. Cost a fortune in lost sales for the time the whole site was inaccessible.


As someone who works at a large company in the networking space, you would be surprised that minor changes to configuration can cause catastrophic failures that are really challenging to come back from

Network failures are usually really bad when your system is globally deployed and distributed -- often times you can't even communicate with your machines to deliver fixes :p




If you use their API and haven't seen it yet, their issue is listed here on their status page:

https://developers.facebook.com/status/issues/55989644784543...


Looks like the status page is down for me as well.


Increased Error Rates Created by Gary Fitzpatrick · · Facebook Team — Today at 10:32 AM

Current State: Investigating

Description: We are currently experiencing issues that may cause some API requests to take longer or fail unexpectedly. We are investigating the issue and working on a resolution.

Start Time: 2 hours ago

Last Update: about an hour ago

Updates: There are currently no updates for this issue.


> Increased Error Rates Created by Gary Fitzpatrick

Well someone tell him to stop, for pete's sake!


Pete's in on it too!


I read it that way as well :)


Are you joshing me?


The real storm is realizing through Facebook OAuth you cannot access your affiliate accounts. Caution to move your accounts away from Facebook

Edit: Or have other methods than just relying on Facebook authentication


realized that when trying to play Pokémon Go today, guess I'll break my weekly challenge...


Serious question: Was any value lost? (this may appear sarcastic)

Facebook obviously loses some ad revenue and Facebook customers may lose sales. But do Facebook/Instagram users suffer? But how does losing social media for several hours affect the quality of life of users?


I am not a big fan of social media too but you will be suprised ... For example here in Sudan (East Africa) the country has been under continuous protests for over 2 months now (53 dead, 4k+ detained, 500+ injured) with strong censorship from the regime & silents from the internatinal community. So facebook, whatsapp & twitter are the only media left for the people to fight for freedom —> every Thursday is the main protests in the week and this Wedensday night the outage might affect this as thousands around Sudan won't know about the meeting points of tomorrow!!!

Actually the government did block all social media for over a month but that was fixable with vpn. (Follow hashtag #SudanUprising on twitter to learn more)


[flagged]


I doubt Sudanese protestors are watching football and eating big macs on their off-time. Most likely limiting their protests to once a week 1. makes for a single, effective push, 2. keeps the protestors and their families from starving.


Lmao "no disrespect intended" "scheduling it between watching the football and eating big macs." pick one and keep in mind you're talking about Sudan and not the US...


As I said: "Main protests on Thursday" so no, Sudanese ppl don't protest once a week & btw People get shot here just for standing out and peacfully protesting so its far a way from the picture you have in mind. I've put a hashtag where you can see photos/videos & learn more and ofcourse share productivity tips


Interesting point in general, but...

What I asked was what is the effect of sporadic interruptions of few hours. I mean, if Facebook had 30% availability, would I lose anything valuable from the experience? Is it that we are just used to it and and want it to be there always?

The value of 99.5 availability fore __users__ is not clear to me. Instant messaging is exception for this.


I know parents who keep in touch with their children via Messenger. In part because it works in more places: Messenger works wherever there's internet over wifi not just cell service. People rely on Facebook for non-trivial reasons whether or not I (or someone else) think it's a good idea or not.


It might seem pithy, but my wife has a small internet based business and uses facebook as a login for one of the sites she sells on. So, today instead of being able to autofill labels directly for shipping, she had to hand type addresses in for all shipping labels for products sold on that site.


I reached out to an old acquaintance that could be a great help to my company. I reached out over Facebook. Now that contact can not respond and may have not even seen my message. I have no other way of contacting this person. This affects my business.

I hate Facebook, but to deny its value is pretty naive.


FB has effectively replaced all other text messaging for several of my social circles. It's nice when you have groups that kinda change over time, otherwise group-texts always end up with numbers you don't necessarily have in your phone, etc.


I can live without FB and Instagram, but Messenger being down is a whole different story.


Do you think of Messenger as separate from Facebook?


I consider my relationship with Messenger separate from FB. Most of my conversations happen there. I've deleted FB from my phone, but I don't think I could ever go without Messenger.


I used to be like that. One day I just sent the same message to all people I still contacted on messenger saying that I was getting rid of it in one week and listing 3 alternate services people could use to contact me. Didn't lose a single contact and never looked back.


Group chats won't follow you though.


Potentially much value was gained (or stopped hemorrhaging)


For hours, not especially - it's annoying but no worse than a power cut. There could even be benefits.

On the other hand, if someone were to sabotage the platform and prove/convincingly argue that they induced the failure, at minimum it would do significant damage to the tech sector and at maximum cause public panic.

This is a hypothetical, not speculation on the cause of this outage.


Obviously this could be argued differently from a shareholder perspective, but I would say otherwise no. Interestingly, this might be one of the only times where a large outage could be claimed to be adding value. Again, not for FB, but for users, sure.


After a few hours of not being able to use an app people might start realizing how addicted they are to it. "I was bored initially but then realized talking to people in real life still works."


I've also seen issues uploading images to Whatsapp in the past half hour. I wonder if there's anything to do with the Google Cloud Storage outage that took down Gmail yesterday?


Facebook doesn't use GCS (at least they didn't in 2015), they have their own infrastructure/data centers.


The only things that I can think of that would cause this scale of being down is either a T1 center outage or (conspiracy hat on) a major hack and everyone is rush patching

Would be interesting to read the post mortem if there is any regardless


If rush patching were going on, we’d likely see some hints in commit messages of open source projects, like the Linux kernel commits that were tipoffs to Meltdown and Spectre.

Edit: Has anyone seen anything of this sort in any of the projects they follow?


Did we ever get a post mortem for the global Youtube outage from late last year?


No but everyone is 99% sure it was related to killing Google+ which was announced not too long before and everyone who has used YouTube way back when knows they had to make a Google+ identity at one point to link em. HMMMMM....


Google and FB having successive outages? Is this just a coincidence or is there some shared infrastructure that would explain this?



Both companies have their own private data centers/infrastructure.


^^^ I noticed this too. GCP is under all 50 shades of outages since past few days. Feel I might need to rush back to my house and start digging a bunker


Seems like when GMail and O365 experienced major outages on the same day.


Likely a firmware issue with the NSA intercept devices.


Are you basing that on anything or just BS'ing?


Hail Ghidra!


...and their key infrastructure in Venezuela.


Yes, I can't help to entertain the thought of another Stuxnet going wild.


The Bay Area Peninsula has been having strong winds and heavy rains for the past few months. The last 3 days, there have been major power outages across the area. Redwood City had power outages 2 days in a row and Pacifica lost power to a good chunk of the city for like 7 hours last night. It wouldn't surprise me if all these major tech outages that have been happening this week is all related to poor Bay Area infrastructure.


Bunch fb employees near pacific catch was talking about how fb was hacked


https://www.facebook.com/platform/api-status/ still returns "Facebook Platform is Healthy", but you can't even load https://developers.facebook.com/status/dashboard/. Why have status pages if they are so susceptible to going down themselves?


I remember an S3 outage a number of years ago where AWS discovered that their status page was hosted on S3. Whoops!

I believe this is why Github's status page is now on its own domain; so a github.com DNS outage won't take it down.


IIRC the AWS status page was fine but the indicator of the status of the different services was a red or green .PNG which were hosted on S3.


BTW, many apps are affected by this. I cannot log in to any app that uses Facebook for authentication.


This is one of the many many reasons I don't use fb for auth for any other websites.


So yesterday Google had a major (and out of character) outage, and today Facebook has a major (and also out of character) outage.

I can't wait to see the RCA for both of these and if they're related.


Ironically as I've been checking this post HN has been experiencing errors loading new comments.


Instagram seems to load the feed here fine (EU), but doesn't allow you to log in from any device or post anything new. FB is totally fine if you are logged in for reading, but also can't log in if logged out.

VPN to US, insta can login, but still not post.

Distributed services are weird man!


I see they've made some progress putting Instagram on the same infrastructure as Facebook.


:/

Something fun happening in Germany? https://www.akamai.com/us/en/resources/visualizing-akamai/re...

And Level3 traffic going to Argentina? https://twitter.com/bgpstream/status/1105819050968580096

And GreatBritain going to cambodia? https://bgpstream.com/event/197968



> Something fun happening in Germany?

Major sporting events in the EU.

BGP fuckups appear to happen regularly, based on the tweet history of that account.


As a layperson, can you explain what exactly we're looking at here?


As a techperson ....

barely - BGP is that complex I'm afraid, but its the wild wild west when it comes to potential nationstate attacks on internet traffic. I wonder if these patterns are typical or worrysome


Coincidentally, just watched The Social Network, the plot of which includes that quote by Mark:

> Let me tell you difference between Facebook and everybody else. We don't crash ever! If the serves are down for even a day, our entire reputation is irreversibly destroyed. <…>

> Even a few people leaving would reverberate through the entire use base. The users are interconnected. That is the whole point. College kids are online because their friends are online, and if one domino goes, the other dominos go.


You’re quoting a movie, which played fast and loose with facts to tell a story (as movies do).

In real life, Facebook had significant issues with uptime in the early years.


Can’t argue with that, however the fact that Facebook used to go down before would not preclude that each time was (and maybe still is) seen as a major incident within the original corporate culture.


Did they move too fast and break too many things?


Doesn’t their world class team make such a long outage to be quite unlikely? How hard would it be to devote ample resources to a cover story for the “incident report”? Is the timing relative to the plethora of indictments relevant at all? Reasonable that this may be related to shredding of data and/or code, or even a cooperation to turn over data to government in secret deal?


Minor update: https://twitter.com/facebook/status/1105907126424109056?s=21

> We're focused on working to resolve the issue as soon as possible, but can confirm that the issue is not related to a DDoS attack.


There's something ironic about a social media company being forced to rely on a competitor to facilitate communication between them and their users.


Why? When Facebook is down, you cannot use Facebook to communicate about the issue. That is also reason why FB still uses* irc instead of messenger for coordinating how to resolve such issues.

* Or at least used 1 year ago, when I was working there.


Hmm, I wonder if we could leverage that to make irc more popular again. "FB uses X" is often all the justification small startups need for picking a tech.


most startups can use their standard comm channels since most startups are not doing collaboration/communication tools.


Look at how many service providers have increased incidents reported here: https://downdetector.com/

My bet is that people are having problems with FB/Insta and immediately assuming that the whole internet is messed up.


Especially if they can't sign in via OAuth. To an average user who signs into Spotify with their Facebook account, "I can't sign into Spotify" means Spotify is down, not Facebook.


Looks like that site is having a problem logging reports! They all go to zero at around 17:00.


> The team at Jefferies remains reasonably positive, and in the firm's top growth stock calls for the week we found four tech stocks that are offering more aggressive accounts good entry points. Carl Court / Getty Images

What's that weird tagline about ?


In other news, productivity everywhere skyrocketed!


It's hard to believe that people simply can not live without fresh instagraphies


What's the cause of outage then ? Disk, memory, CPU, network bandwidth,... ?


I got my github two factor auth SMS two hours late. Fortunately it was just my old laptop. I wonder if it was related. Good reminder to set up an authenticator app on my new phone so I don’t have to rely on SMS!


Which you should do anyway because relying on SMS is only slightly more secure than not having 2FA enabled at all...


SMS is still an option if you have a working 2FA Authenticator on GitHub. And even if I went through the trouble to disable it, I disagree. There are conceivable ways people could get to my email to initiate a password reset without getting to my phone, such as snatching my laptop from me while I'm working at a coffee shop.


Whatever happens right now at Facebook is less important than the fact they will never say what affected them. Of course nobody would tell 'hey, outage right now due to 0day / mistake' but...



It's an experiment to see whether productivity improves when people aren't able to access FB and Instagram to slack off.


And yet Facebook's stock is still up on the day (+.74%)?

You'd think being down for hours would be negative news and revenue impacting.


2 hours of downtime in a quarter is 0.09% of downtime -- probably very little effect on their monetization products.


We're up to around 7 hours of partial outage now. I have spoken to FB employees in the past, every hour they are even partially down they are loosing millions and millions in revenue.


the issue is also in EU: Is Facebook down? Messenger, site, app and Instagram hit by issues[1]

[1]https://www.manchestereveningnews.co.uk/news/uk-news/faceboo...


First thought when I heard the news was BGP hijacking (ignoring whether accidental or deliberate). Doesn't the symptom fit other known cases like the Telegram incident in Iran last year, just at a larger scale?

Admittedly networking is not my strength, so perfectly happy for someone to shoot down this hypothesis.


Is Facebook actually working for anyone?


I haven’t been able to post anything on Facebook, neither a new post to my wall nor add a comment on a friend’s post, since mid-morning US/Eastern and this is still the case. In addition I can’t login to the site - I am able to access the site only where I’m already logged-in.


Instagram works for me on my iPhone, but the comments are all missing. I kind of like it that way.


You're already logged in so it's just going to show you old content. I'd be surprised if your able to post anything, or if any "like" you give during the outage is saved. Same thing is happening with both Instagram and FB apps on Android for me.


I never do any of those things anyway


This is the first time I experience this. Also note that current session on messenger.com still work, we can still send/receive message, but can't upload any image or send sticker. Looking forward for a post mortem analysis on this.


Hacker News might just have a "Major Internet Services" status board.


Google then Facebook and Instagram?

My hunch is that it's the end of Q1 and people are trying to release code changes so they can pad their Q1 performance reviews "designed and delivered feature X on time in Q1".


Yeah I'm gonna go with about a 0% chance of this.


"and brought down a service used by billions and losing potentially millions in revenue."


"Move fast break things"


Well, it's certainly impressive.


Can't claim they didn't have impact then. :)


Year end performance reviews are the most important..those were already done at most companies.


I've seen all quarter ends being targets for releases. And things that have been delayed since start of the end of the year are usually pushed to the end of Q1.


Perhaps relevant that npm has been having issues although they only recently caught and fixed them. Scoped Private npm packages were getting cloudflare 503 errors


Since it went down for PC and not mobile, I was concerned if it was just an idea of audience testing, in the process of moving to an app-only platform.



see also: Facebook, Instagram down: Social media sites not working for many, FB doing 'required maintenance'

[1] https://www.abc15.com/news/national/facebook-down-social-med...


Whatsapp now down across much of Europe it looks like. Cannot send/receive messages.


Argentina: Whatsapp works for text, but any type of media takes very long to send.


FB has been down for me for around an hour, but has just come back up again.


From Google I got an error clicking on instagram and Quora links today.


At least you are free of Quora now!


I'm surprised there wasn't a national crisis alert sent out


Can confirm. Also, API integrations, such as Buffer, are not working.


...terrifying API developers everywhere...


seems to be more than just messenger login; All of facebook is super flakey this morning.


Unfortunately the AI software used to censor fake news from facebook has decided, again, to censor facebook :D


somebody please wipe off everything from all these big corps


must be a super 0 day.


Affecting more lives than any terrorist act.


  ... and nothing of value was lost.


[flagged]


When EW speaks, the silos on the web tremble.

(Ok, maybe not, but was fun to think of a bumper sticker-type phrase.)


I didn't notice.


Good! Keep them down.


yesterday Google and today Facebook. My conspirator says it's the Chinese government showcasing.


Meanwhile in Russia they are talking about disconnecting their network from the rest of the world. Some test gone south?... Maybe... Someone has a traceroute handy?


All my funny cat videos and memes are loading fine. Did you try rebooting your router 62 times?

All joking aside, is this news? :/


One of the largest internet companies in the world having a massive, global outage?

Yes, that seems like appropriate news for "hacker news", a website where people discuss technology news.


wow, tough crowd today. i didn't think people took facebook and instagram so seriously.

i'll put the sarcasm machine back in the drawer.


This is HN, we take everything seriously.


Do you want to explain why you believe it's not [HN worthy] news?


My fiance's uncle sent something today that because of a school shooting in Brazil, they were blocking all images and video shared to social networks like "WhatsApp, Instagram, Facebook and other social networks". I haven't been able to verify this myself or from any other sources, but I wonder if either people are misinterpreting the FB outage or if Brazil is blocking content it's having weird ripple effects.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: