Hacker News new | past | comments | ask | show | jobs | submit login
Cellular outage in U.S. hits AT&T, T-Mobile and Verizon users (cnbc.com)
489 points by rooooob 4 months ago | hide | past | favorite | 465 comments



This is a serious architectural flaw.

In the entire history of electromechanical switching in the Bell System, no central office was ever out of service for more than 30 minutes for any reason other than a natural disaster, or, on one single occasion, a major fire in NYC.

The AT&T Long Lines system in the 1960s and 1970s had ten regional centers, all independent and heavily interconnected. There was a control center, in Bedminster, NJ, but it just monitored and sent out routing updates every 15 minutes or so. All switches could revert to default routing if needed, which meant that some calls would not get through under heavy load. Most calls would still work.


There was the Mother's Day outage of 1990. That was caused by someone swapping a break statement for a continue statement in some C code that handled the routing, and there was a cascading effect.

Then again. That only affected long-distance service.


I've been thinking about how pretty much no USA infrastructure of today is as reliable as of the last half of the 20th century. Just my imagination, or true? And what does it mean?


Airliners crash far less, road fatalities kept going down till 2010, so taking the broad view of were infrastructure is I’d say at least those got better.

Anecdotally, I remember more electricity interruptions and plumbing issues when I was a kid, but that could be location dependent and I couldn’t quickly find good numbers going back that far.

Edit: While the phone network didn’t necessarily go down, I frequently got “all circuits busy” when I was a kid. I don’t remember the last time that happened.


I wish we had metrics for utility companies. In my midwestern experience, things have gotten worse. I don’t remember any outages as a kid in the 90s that were over 24 hours aside from the major blackout in the early 2000s. As an adult I’ve experienced several outages greater than 24 hours in both summer and winter months. It’d be nice to be able to measure this.


I would caution comparing today against one's childhood memories.

Children have few responsibilities and are shielded by their caretakers. They simply do not notice much of the things that happen.


Right? As a kid we might not have questioned that sudden trip to Grandma's for the day.


Not to come off pedantic but improvements in airplane and car design has nothing to do with infrastructure.

I think the average person views infrastructure improvements as improvements in the roads, airports, or air traffic control.


They are part and parcel.

In USA for example, road design AND vehicle design is directly linked and beholden to NHSTA regulations and policies.

Infrastructure (IE State route roads) are ever trending towards wider lanes and more gentle shoulder. This is precisely due to vehicle industry requirements requiring more vehicle safety features (and thus width and length), height, and shoulder level clearance @ windows.

All these infrastructure and endpoint changes are driven "organically" by the USA trend towards SUVs, but mainly driven by insurance requirements. Insurance and gov't "make out" on safer/roads/vehicles due to (perceivabally) less accidents and road maintenance.

I can't speak to airplanes, but I imagine the fact that far, far, far more people are able to fly today than even 25 years ago should show that the infrastructure has drastically improved.


And how often one of our five TV stations would be "Experiencing Technical Difficulties... Please Stand By".


And tangentially related, it was much easier for anyone to eavesdrop on your conversations.

When it rained, I could pick up my phone and hear conversations from my neighbor on my landline and talk to them without calling.

Not to mention if you were in the same house, you could surreptitiously here conversations by just picking up the phone or getting a device from radio shack that didn’t have a microphone, that you could plug in to another phone outlet.

With analog cellular, you could also buy a receiver from Radio Shack and hack it to pick up the unencrypted signals from cell phones.


Redundancy was needed because individual nodes/machines were more prone to failure. As machines got more and more reliable, having highly redundant infrastructure was seen as an extra cost.


Yes. Electromechanical switching systems were substantially more reliable than their components. How this was done should be understood by anybody designing high-reliability systems today.

"A History of Science and Engineering in the Bell System - Switching Technology 1925-1975" is a readable reference. The Internet Archive has it.[1]

More hardcore: "No. 5 Crossbar"[2]

The Connections Museum in Seattle still has a #5 Crossbar working.[3] Long distance used toll switches, "#4 Crossbar", and there were 202 of them.

#4 and #5 Crossbar machines are collections of stateless microservices, implemented from electromechanical components. The terminology is used in the old books is completely different, but that's what they are. Each service always has at least two servers. The parts that do have state are distributed. The crossbar switches that make actual connections have state, but are dumb - they are told what to do by "markers", which are stateless but can read the state of the crossbars and of other components. Failure of a single crossbar unit can take down less than a hundred lines at most. Other than the crossbars to external lines, everything had alternate routes. Everything has fault detection, with lights and alarm bells.

Error rates were fairly high. In the previous "step by step" system, a good central office misdirected about 1% of calls. With bad maintenance (and those things were high maintenance) that could get much worse. Crossbar was better, maybe 0.1% misdirected calls.

Routing tables in crossbar were mostly static ROMs of one kind or another. Routing consisted of trying a predetermined set of routes, in order. Clunky, but reliable.

Modern systems need a backdown to that mode.

[1] https://archive.org/details/historyofenginee0000unse_q0d8

[2] https://archive.org/details/bellsystem-no-5-crossbar-blr

[3] http://www.telcomhistory.org/connections-museum-seattle-exhi...


Efficiency or reliability: pick one.


Do you work at a telecom or are you just guessing?


Used to work with telecom equipment in situations where high availability was required and equipment failure was expected.


Highway construction standards are much higher than they were 30-50 years ago, but it's a mixed bag. Administrative costs are significantly higher. Survey has dramatically improved with GPS. Highway engineering has not improved since about the 90s. Automated machine guidance has significantly improved the potential accuracy of grading operations in th last decade.


>>>Highway construction standards are much higher than they were 30-50 years ago,

The roads in many US cities arent built to those standards and are grandfathered from them. New York City highways areas horrible.


It depends on you view this. If you measure by total bit / data / time per customer served divided by the total outage time I would think we are still very reliable in terms of telecoms. It is worth remember we are likely serving 1 to 100 million times of what we did on mobile network 30 years ago.

Just purely in terms of total downtime itself we are less realible. Purely because of the complexity involves.


I’m surprised no one noticed the 502 when attempting to enable WiFi calling. Azure was the source of the 502. Cloud architecture problem.


Perhaps this might have also been the reason https://www.webwire.com/ViewPressRel.asp?aId=318230


This this might have been the reason: https://www.webwire.com/ViewPressRel.asp?aId=318230


In this context, what's the physical scale of a 'central office', as far as regional dimensions? Thanks!


I suspect that when AT&T built all the COs (1950-70s) they were constrained by both max number of lines and physical distance.

You’ll see numerous COs in a big city, but they are also pretty widely dispersed throughout the suburbs and rural areas.


I worked for Southwestern Bell in the 90s pre-SBC (aka just before remote terminals and dslams became common). COs handled mid-tens of thousands of lines in big cities, for smaller more rural areas they generally covered a single town or less often there were a few in a county where the full county was under 50k people.

In towns, we generally tried to keep loop lengths under 30k feet, but in rural areas that simply wasn’t possible. You’d often find remnants of party line systems in those areas and definitely load coils out the wazzoo. It was “fun” unwinding all that crap to install ISDN circuits and later DSL.

I remember the old hats at the time laughing about VDSL saying “leave it to the nerds to dream up some unrealistic shit where the loop length can be at most 2k feet, where does that exist!?” not realizing a few years later RTs and DSLAMs would mean a significant portion of city and suburban customers would be.


I used to work at BellSouth in outside plant engineering in the early 2000s. That's exactly what it was. Of course, by then any expansion was done via remote terminals and COs were becoming very antiquated.


>COs were becoming very antiquated.

I mean COs are integral in AT&T's architecture, it's where every fiber connection lands on an OLT


Well, one is a skyscraper filled with equipment near the Brooklyn Bridge IIRC.


What about the Chicago area outage in 1988? The entire area was without phone service for several weeks


Hinsdale telco, 1988.

Didn't take out the entire Chicago area but was probably the worst case scenario for a suburban switch. Hinsdale handled the airports, FAA ATC offices, and the emerging mobile/cellular network. Long distance and 411 was down for whole counties.

This is a pretty good USENET archive/digest of the event, if you can get past the Web0.1 formatting:

http://telecom.csail.mit.edu/TELECOM_Digest_Online/1309.html


Hopefully this get wider coverage and report. One of the top discussion point of 6G was to simplify it both for technical reason and cost reduction reason. 5G brought us enough capacity headroom that most MNOs today still dont find enough incentive to deploy.

May be it doesn't even need to be 6G. With 5.5G and above and OpenRAN there is another opportunity to radically reduce complexity I could only hope there is enough of push towards this.


The complexity and scale of moderns systems are on another scale of magnitude.


Watch "Without Fail" (1967), on how the Bell System did it.[1]

[1] https://www.youtube.com/watch?v=ZAJpionUxJ8


And centralized. Data is cheap (though they won’t admit that) while big iron cellular core stuff is expensive.

Funny when they billed extra for long distance calls even though all calls were routed through one place for a huge geographic area. Calling your neighbour could be a hundreds of miles round trip over mobile.


> And centralized.

Yes. Too much of routing is centralized. Since phone numbers are no longer locative (the area code and exchange number don't map to physical equipment) all calls require a lookup. It's not that big a table by modern standards. Tens of gigabytes. All switches should have a database slave of each telco's phone number routing list, to allow most local calls if external database connectivity is lost. It may be behind, and some roaming phones won't work. But most would get through.


This is no justification. They should have another scale of resilience.


Agreed. Not a justification, but an explanation/excuse as to why systems are less reliable. You're right on the mark -- when reliability doesn't scale with complexity, you get this.


True words :checkmark


Something I'm not seeing discussion on:

What is/were the cascading effects of this, particularly for drivers?

Many people in buildings were unaffected, as they could fallback to wifi. But I imagine this had a pretty broad impact to drivers.

Just a few things I can think of:

- Packages delayed (UPS, FedEx, Amazon, truck drivers, etc.) for drivers that relied on their phone's mapping apps to get them to their deliveries

- Uber/Lyft/taxi/etc. drivers not able to get directions to their pickups/dropoffs

- Traffic worsened because drivers weren't able to optimize their routes, or even get directions to their destination

Maybe larger companies have their own infra for this, or have redundancy in place (e.g. their own GPS devices)?

I'm curious to hear thoughts on whether these (and others) were impacted, or if there are ways they're able to get around this.

Also, unrelated to drivers, I can imagine there is/was a higher risk of not getting treated for emergencies due to not being able to make calls (I'm not sure whether/how emergency calling was impacted).


> Traffic worsened because drivers weren't able to optimize their routes, or even get directions to their destination

During Canada’s Rogers outage in 2022:

> In Toronto there was some dependency on Rogers. One quarter of all traffic signals relied on their cellular network for signal timing changes. The Rogers GSM network was also used to remotely monitor fire alarms and sprinklers in municipal buildings. Public parking payments and public bike services were also unavailable.

https://en.m.wikipedia.org/wiki/2022_Rogers_Communications_o...

As it was summer, I recall some park programming for kids had to be cancelled because the employees were required to have a phone capable of calling 9-1-1 (but sounds like that at least still worked here)


This is putting it mildly. The Interac network went down and no one could use their debit cards nationwide.


I'd put that on Interac single-homing itself without redundancy.

Their ops are critical enough you'd expect better from them.

Not the kind of shortcut Canadian banking takes for core stuff.


The funny thing there is that Interac "did" have redundancy i.e. another network provider to fall back onto.

Unfortunately they failed to notice that this was a reseller for Rogers lines.


that explains a lot


Not all of it. My credit union's interac services still worked.


> Traffic worsened because drivers weren't able to optimize their routes

This might explain a huge random traffic jam I hit in the middle of my town this morning.

I had no idea any kind of an outage was happening because I've intentionally scaled back my dependence on my phone. I always used to automatically pull up Google Maps to navigate no matter how short the trip. At some point I realized I was losing my ability to travel without being completely dependent on some company tracking my location and telling me what to do, so as part of my phone de-Googlification I switched to Organic Maps. And even then I try to navigate on my own without any GPS assistance as often as possible. I feel like navigating is a skill you can actually lose if you don't practice doing it.

After running an errand across town this morning, I decided to try getting back home via the biggest arterial through the city that I know about, and I immediately hit a huge westbound backup stretching at least a mile. It was a total standstill. I peeked ahead trying to see if there was some kind of accident or something and didn't see anything. Everyone was just sitting in this traffic jam, and I couldn't for the life of me figure out why.

I immediately flipped a u-turn and went 3/4 of a mile north to another westbound road I knew about. That one was completely clear of any traffic at all, and I was able to drive the speed limit all the way back.

The most-used navigation apps I know of suggest alternate routes when there's congestion, so why were all those people just sitting there in that jam while a parallel road less than a mile away was clear? Maybe it was this cascading effect of too many people conditioned into being told what to do by their phones while their phones couldn't tell them to take the other route.


> Maybe it was this cascading effect of too many people conditioned into being told what to do by their phones while their phones couldn't tell them to take the other route

There is a lot of people who couldn't navigate to a neighboring street without a direct directions even if their life depended on it.

Add to that what the most people doesn't have a slightest idea where are they, where are the cardinal directions and what they need to get from point A to point B.


there is also a big of fear, that if you go to an alternate route it may also be congested, and it will all have been for nothing.

radio is pretty good with traffic news, but how many people would even think of local radio?


> if you go to an alternate route it may also be congested, and it will all have been for nothing

Yeah, I get that there can also be a bit of a sunk cost thing along with regret minimization going on too. I think game theory suggests that you should switch routes the instant you hit significant congestion though, because P(congestion on the current route)=1 as soon as you hit it.


I think many many people would check local radio, not everyone lives on their phone or in a tech bubble.


Many people don't have a way of listening to local radio outside of their car. Loads of people don't even have a way to watch OTA TV.


If you use Google Maps, it will automatically prompt you to download a map of the area if there is known poor coverage. It also has automatic (?) local maps.


One beef of mine with Google’s offline maps is that they’re only driving maps, and not walking/transit/cycling maps. Obviously you can kinda figure out walking paths anyway, but since I’m sometimes travelling without roaming access, it’s unfortunate.


Have you tried Organic Maps for walking or cycling routes?


No one walks in Mountain View or most of the US so product management doesn't understand the use case.


I image it would be hard do transit maps if you weren't connected to get the schedule.


They already get them somehow while "online". Offline with beginning & end times and a rough idea of frequency should be good enough for local use.

Offline road maps are subject to construction/seasonal/holiday route closures/deviations too, and so is transit.


Road closure tends to be much more rare. Transit is much more variable in the US.


Worth noting that GPS does not rely on cell service.


GPS on most cell phones uses data connection to download current satellite data in order to decrease the time from cold start to GPS lock. Lack of cell or WiFi can cause GPS to take 5-15 minutes to "search the sky" and download satellite data via low bitrate channel under poor signal conditions.

https://en.wikipedia.org/wiki/Assisted_GNSS

Edit: You can think of it as a CDN for the GPS almanac.


From cold start. Most starts are not cold. The phone knows where it is, approximately, what time it is (within a second or so, from built-in RTC), and orbital parameters of the satellites overhead (maybe without the latest corrections).

My Garmin watch gets a GPS lock in way less than 5 minutes without any cellular connection.


Actually, your Garmin probably gets A-GPS data uploaded to it via the app.

I think that because Huami/Amazfit/Xiaomi smartwatches already do that. We know this from reverse engineering efforts in Gadgetbridge, but support for Garmin is still new and so there isn't as much info about it; either way it probably works in the same way.


My first GPS Garmin watch used for running back in 2011 didn’t have an app and didn’t have any cellular signal. I put it on my wrist and started running. I don’t remember it taking more than a minute to get a signal.


No, it's not paired to the phone, and there is no Garmin app on the phone either.


That’s just not true for modern phones. I use iPhone on hikes without cellular connection and GPS lock is instantaneous. Organic Map app is great for hiking.


You're talking about something very different called a hot start. The GP is discussing time to fix in a cold start scenario. You'd only see this on a phone that had been powered off for months, or "teleported" hundreds of miles away. In this scenario the receiver has to download the new time, new ephemeris data, and a new almanac (up to 12m30s in the worst case) before it can fix. Depending on the receiver, there may also be a delay of several minutes before it enters cold start mode.

If the receiver has recently (last few days) gotten a fix and hasn't moved too much from that fix, it'll be in at least warm start mode. It still needs to download ephemeris data, but this usually takes 30ish seconds to fix.

If the receiver has seen a fix very recently (last few hours) or a recent network connection, it can fix from hot start like you saw, which only takes a few seconds and may not even be observably slow depending on how the system is implemented. Phones go to great lengths to minimize the apparent latency.


You reminded me of my first GPS, which connected to a laptop via RS-232:

https://www.bevhoward.com/TripMate.htm (Not me)

Back then, just getting a GPS fix at all was exciting. Then driving around with it propped on the dashboard or rear window.


The sibling response here covers all of the points I would say. Scott Manley has a nice video covering the history of GPS and how it works, well worth a watch https://www.youtube.com/watch?v=qJ7ZAUjsycY

It's not as simple as you think.


The routing does though.

I have Google offline maps downloaded for areas I end up in just in this case. Gotta do traffic rerouting the old fashioned way though.

Or have an old-school GPS map thingy in your glovebox.

(Also have kiwix and a whole archive of Wikipedia on my phone).

I wonder if meshtastic communicators sales took off during this. How’s LoRa traffic these days?


Yes, much that we think of as Google Maps relies on API calls made to the backend. Plus this assumes that you downloaded the offline maps ahead of time, which in my anecdotal experience is not something that most people really consider. GMaps does (or did at one time at least) have a neat feature of auto-downloading your home area map but, the one time I needed it, it didn't work.


> which in my anecdotal experience is not something that most people really consider

Thankfully I’m in Canada where it’s not impossible to end up in the sticks with no service.

Chewing through your handful of gigabytes/month of data wasn’t hard. Only in the past year or so have double digit gigabyte/month data plans become cost-effective.

And our roaming prices are extortionate, so for jaunts over the border (or internationally), I’ll sometimes go “naked”.


The "Here" app or whatever it is called did offline maps and offline routing decently enough. It wasn't perfect, but it worked for "here to there", even if it didn't find the best possible route.


Google Maps does offline routing. It doesn't do traffic routing but updating routing is better than nothing.


I've been using this on Android for a couple years and love it: https://en.wikipedia.org/wiki/Organic_Maps

You click a few buttons to download OSM tiles and then it does routing. The latest OSM even has a decent amount of stores, restaurants, etc., listed.


Carriers have mapping independent of networks. Drivers keep personal GPS too. You would lose traffic and road conditions, I guess, but nothing proper trip planning wouldn't cover.


> Drivers keep personal GPS too.

Do they? I know there are a lot of old units out there but I figure people would have tossed them.

At least I’ve found Waze has been pretty good at starting off with wifi and loading the map of the whole journey after coverage was lost with some resilience for stops/detours.


I am consistently in areas with zero cellular service and I’m reasonably sure Google Maps will route offline. At least, I’ve never switched to another mapping app because I couldn’t route — it’s more usually because Google Maps is more primitive areas is kind of detail-less.

But even if it doesn’t, there are a ton of offline map apps that use OpenStreetMap data.


Apple Maps has offline navigation with historical traffic included


> Or have an old-school GPS map thingy in your glovebox

You can also install Organic Maps on your phone.


Google Maps and now Apple Maps (as of ~6 months ago) have offline maps, but not by default. If you enable and download them for your area of interest you can use a subset of the normal app.

I make sure to have this around my usual area and anytime I travel to an area with poor coverage, plus my Garmin watch has offline maps and GPS everywhere, but this is not typical.

OSMand usage is even less common.


Offline maps are a life saver in areas with bad coverage. One of the first things I setup for a new phone or when I’m headed somewhere new on vacation.


This is one of the most interesting differences I often notice between users who rarely leave the city and those who routinely leave. Offline functionality often seems unnecessary at best and absurd at worst to the former group, while the more rural/remote the person the more they value offline functionality. For the most extreme example, talk to the average person who lives outside of Anchorage or Fairbanks in Alaska, and they only really care what the app can do when it's offline as that is it's assumed status when on the go (disclaimer: I moved out of alaska a little over 5 years ago so things might have changed somewhat).


Yeah, if I'm going to travel internationally or if I'm somewhere I know I'll have spotty cell service, I'll download maps. I should probably be better about doing it in local areas where I "assume" things will be fine.


I grew up in a rural area and lived in Colorado for a while. Going home or venturing into the mountains often resulted in bad service so it just became second nature. Good observation!


Lots of people dislike the design choices in OSMAnd, so it's worth mentioning that there are lots of apps that use OSM data and provide offline maps and routing.


agreed. On Google Maps app, there is a feature called "offline maps" which allows a user to select a rectangle on a the map and download all the street info inside it. A whole US state can fit in less than a few hundred megabytes. I have all the city I live in downloaded so I can go on walks without needed to use my data plan.


Not as useful as back when google maps required a 5GB download IIRC


Downloaded a whole island yesterday. Was 40mb. That's a lot better now, has less resolutions packed in tho.


That's assuming you have it on and updated before you hit the road.

I think it's off by default, and I'm guessing most people haven't thought to turn it on, or are even aware of it.


I'm pretty sure maps caches the data around you if you've used it somewhat recently. It saves Google bandwidth too.


I’m not so sure.

Anecdotally, I’ve made it to a remote destination using Maps, then hopped back in the car an hour later (with no signal), and it couldn’t load anything. This seems to happen quite often.


Maps used to expire after 30 days (no idea why), and the auto-updating while on wifi wasn't great unless you were in the app forcing it update. Nowadays they last 365d.


worth noting that without cell service, GPS can reliably give you time, lat, long and elevation. So if previously you had no actual map downloaded, or an old or out of date map, you'd just get a pretty accurate dot on an inaccurate map, or just raw coordinates.


>> Traffic worsened because drivers weren't able to optimize their routes

I'm not sure that is a thing. The vast majority of drivers are on familiar routes and are not navigating via electronic means.

Better question: How are the autonomous cars doing? Are they parked by the side of the road unable to navigate without cell coverage.


> The vast majority of drivers are on familiar routes and are not navigating via electronic means.

I've been rerouted due to an accident many times, and I've seen the detours get backed up because of people taking more optimal routes (without traffic being redirected via other means).

I'd be curious to see more data on it, but I would speculate it's less than the "vast majority".

> Better question: How are the autonomous cars doing? Are they parked by the side of the road unable to navigate without cell coverage.

Yeah, that falls under my point about Uber/Lyft/taxis. I would speculate there is broader impact from those vs. autonomous cars (that are probably still relatively uncommon).


ups is known for optimizing route planning for time and fuel efficiency purposes, particularly biasing towards three right turns instead of a left


It only takes a few people missing an exit and swerving to create a bunch of traffic. So many people are used to not navigating manually anymore I can’t imagine it doesn’t have a big effect.


I have a 20-mile commute. I used my phone on day one at the new job, then never again since. It just isn't worth the effort for a road I've driven literally hundreds of times before. Do people also use google maps to get them from their front door to their garage? From the grocery store to wherever they parked their cars?


It depends on your daily commute.

If I need to drive 20 minutes with most of it on the expressway, and they’re prone to accidents and there are multiple viable routes, I’m 100% going to load it up on Maps every trip, if it will save me being delayed 10-60 minutes every few weeks.

But if I’m going mostly backroads, probably not worth it, since you can more easily go around accidents, and they’re less common.

But again, I’m guessing more city expressway commuters use navigation daily than you think.


I use it every day. There are two roughly equivalent paths that I could take, so I use it for information on traffic conditions, and then I leave it running on the off chance that it might route me around a slowdown that wasn't present at the start of my commute.


I do. Mostly from curiosity about which way Google will suggest I go; sometimes because the traffic or road-closure awareness is useful. Though it's often the case that I know about the road closures before it does -- but sometimes it surprises me in a pleasant way.


Two jobs ago, Google Maps literally suggested a different route from my home to work every single day during my first week at the job. The traffic on the main arterial was terrible every day, but somehow Google managed to find different detours every day to shave off a few minutes.


Also can't login to sites that require SMS 2FA.


This really shouldn't be the only way to verify its you if its going to prompt you every single time.


Tell that to $8 trillion Schwab and $12 trillion Fidelity.


If you are their direct customer you should just keep emailing support as much as possible.


Schwab support accidentally broke my account so that it sends 2FA verifications over email and I couldn't be happier.

(Well I could be happier if they supports TOTP, but I'm not holding my breath)


My service came back around 1:30PM in Connecticut. Data and calls are working fine. I requested a 2FA code at 2:30 from a service that only offers SMS. An hour and a half later, I still haven't gotten it.


Well ... let's be honest: SMS 2FA shouldn't be a thing.

TOTP or stronger, please.


> TOTP or stronger, please

One of the biggest weaknesses with TOTP apps I've tried using is that you have to remember to transfer them to a new phone before you get rid of your old phone. I once got locked out of a domain registrar because I set up TOTP on an old phone many years back. That was long gone by the time I wanted to do something with my domain.

TOTP is fine, but always give me recovery codes I can print and out and keep with my other important documents. Too many services don't do that.


SMS 2FA and TOTP aren't mutually exclusive are they?

TOTP -> Time-based One Time Password SMS -> Delivery mechanism.

You can deliver TOTP over SMS.

Obviously, SMS shouldn't be used, but I was under the impression that the code generation mechanism and the code generation algorithms are completely disparate concepts.


Not sure I'm understanding you, but TOTP isn't delivered. It's generated locally/offline, based on the time and a private key that was set up when you first turned on TOTP 2FA.


Isn't that what passkeys are suppose to be? Better and stronger than passwords with TOTP?


TOTP or SMS, it's just another text password you're entering in that's fully phishable.

TOTP just "feels" more secure.


SMS 2FA is a code that you're entering from a phone number. The "risk" is that your phone number can be ported without your permission, and then someone else can get the code.

TOTP is more secure because it isn't tied to a phone number. You're right that it's still phishable but that's not the point.

In both cases, the primary benefit to the general population is to have a rotating credential that, if one website is hacked, is useless on another website.


No, TOTP is far more secure because it has no dependence on a third-party who can mess up in many ways (Denial of service like in this case by being unavailable, Impersonation by allowing SIM swaps or intercepting messages directly).

You fully control how to store the TOTP seed and how you compute the value, so it is far more secure.

Yes, it can be phished if you fall for that, but it removes several attack vectors.


> Yes, it can be phished if you fall for that, but it removes several attack vectors.

How was the first factor (the password) compromised?

Assuming the user is using site-unique passwords, in 99% of cases where an attacker obtains a functional password they can get at least one TOTP code or the seed in the same manner. (ie, if I can steal your password DB, odds are pretty good for me stealing your TOTP seed DB as well.)

The outcome of a single successful authentication is a longer-lived session cookie. Once an attacker has that they can reset your creds (usually just requiring re-entering the password) and the account is theirs.

IMO, the only 2nd factor that matters are those that mutually authenticate like PassKeys / FIDO keys.


> You fully control how to store the TOTP seed

Sorta. The seed still needs to be issued to you in some way.


TOTP is more secure in that you can't be simjacked by someone impersonating you in the cell phone store.


That's assuming your attacker already has your password, or the service allows SMS password reset. (thus negating the second factor. Essentially SMS becomes the only factor.)


> What is/were the cascading effects of this, particularly for drivers?

I wish for an economic system in which all causes could be backpropagated to the source and the source be held responsible.

If for example I lost 2 hours of my time today because I had to fight with Comcast, Comcast should be charged for 2 hours worth of my hourly salary.

If I lost a job offer because of bad interview performance because of heating issues because of bad maintainence on part of landlord, landlord should be charged for the difference in time until I get my next job offer or the difference in salary until the next job offer.

If I had to fight health insurance for 5 hours on the phone due to incorrect bill and that caused me additional stress that caused my condition to worsen, health insurance should be held liable for the delta effects of that stress.

In this case the cellular operators in question would be held liable for the lost incomes of those drivers plus the lost incomes of passengers who lost money because they couldn't get to their destinations on time or missed flights and had to rebook them.

I know this level of backpropagation is hard to implement in the real world but it would be awesome if the entire world were one big PyTorch model and liabilities could be calculated by evaluating gradients.


> Maybe larger companies have their own infra for this, or have redundancy in place (e.g. their own GPS devices)?

Modern trucks have cell modems tied to a private APN that are used for updating vehicle firmware & doing telematics. They also typically have a route to the internet that provides a WiFi hotspot in the cab.

Depending where the fault was in the telco stack, that APN may have still been functional

Not saying this was a significant resolution, but at least a possibility.


In Australia we recently had a telecom outage with Optus; there were untold amount of damage - card payments at shops/cafes were out. - rural towns completely cut off (a few in particular are only serviced by Optus) - emergency services unavailable; for example a snake wrangler was unable to receive his call-outs - hospitals infrastructure came to a halt

And I'm only going off of examples I have heard. These outages are very damaging.


cashiers always look at me dumbstruck when I tell them about the mechanical offline credit card machines we had when I was a cashier back in 2004


If we're thinking of the same thing, I have fond memories of those at mall department stores in the 80s and 90s.

I think Sears around '04-06 (?) was the last time I saw one of those used. I think I bought a dehumidifier or air purifier.

When they started rolling out credit and debit cards without the raised numbers I thought fondly of those and how they were definitely done for now.


Oh yeah I didn’t even think about the raised numbers thing! Cash is still king.


Minor in the grand scheme of things, but I thought interesting (or at least unexpected) - a bunch of rides at Disney World went down because they rely on AT&T push to talk for communication between staff, which is required for safety reasons to run some attractions.


What about those mobile card readers like you see in small businesses and food trucks and such? I've never owned one, but assumed they ran over cellular.


There are some models that can batch transactions throughout the day, and then upload them at the end of the day.


I used to think having a World Wide, or City Wide WiFi Network as backup was a stupid idea. May be not so anymore.


We need to maintain paper-based systems of information storage and retrieval. People should be familiar with a physical map. If we are too dependent on the technology, that is a risk.


Just keeping the paper isn't a solution, people need to know how to use the back up, and use it regularly. When I delivered pizzas, we had a big paper map of the city that we used to consult for deliveries, drivers quickly learned where nearly all of the streets in the city were and how to get there. For most deliveries, drivers just knew where to go, for the rare times I didn't, I either remembered the main street near my delivery or wrote down some notes on the box. Someone marked new streets on the map, as well as the names of major apartment complexes.

Just having that map on the wall isn't going to do any good since without regular use, no one's going to be able to use it effectively. And it's doubtful that people can be forced into using it.


To everyone trying to speculate on the root cause, I haven't seen enough information in any of the comments to really draw any conclusions. Having worked on several nationwide cellular issues in Canada when I worked in telecom, we saw nationwide impacts based on any number of causes.

- A new route injected in the network caused the routing engines on a type of cellular specific equipment to crash nationwide. This took down internet access only from cell devices nationwide. But most people didn't notice because it happened at 2AM maintenance window and was fortunately discovered and reversed before business hours why the routing engine was in a crash loop.

- A tech plugged in some new routers, and the existing core routers crashed and rebooted. While the news worthy impact was just a regional outage for something like 20 minutes, we discovered bugs and side effects from the Pacific to Atlantic coasts over the next 12 hours. So when you say you're impacted at location x, that data point could be everyone is down in the area, many people are having issues, or only one or two people have issues spilled over to some other region. This is why seeing it does or doesn't work in location x is limited value, as almost every outage I've investigated could result in some people still having service for various reasons. The question is in a particular area is it 100% impact, 50% impact, or 0.001% impact.

- A messaging relay ran into it's configured rate limit. Retries in the protocol increased the messaging rate, so we effectively had a congestion collapse in that particular protocol. Because this was a congestion issue on passing state around, there were nationwide impacts, but you still had x% chance of completing your message flows and getting service.

And then there was the famous Rogers outage where I don't remember them admitting to the full root cause. It's speculated that they did an upgrade on change on their routing network, which also had the side effect of the problem booting all the technicians from the network. Then recovery was difficult because the issue took out the nationwide network and broke the ability for employees to coordinate (you know because they use the same network as all the customers who also can't get service). All the CRTC filings I reviewed had all the useful information redacted though, so there isn't much we can learn from it.

So it's fun to speculate, but here's hoping at the end of the day ATT is more transparent then we are in Canada, so the rest of the industry can learn from the experience.


Rogers, of course, blamed their vendor (Ericsson I believe it was). Rogers can do no wrong!

Of course, was fun to see yet another huge org have no back-out/failure plan for their potential enterprise-breaking changes. No/limited IT 101 stuff here.

The only positive thing we learned was that the big 3 (really 2) telcos thought it would be a good idea to give eachother emergency backup sims for the other network to key employees in case their network went down. They did that in 2015, but better late than never.

Fun that Rogers used the same core for wireless and wired connections, so many of us were in total blackout, even if we used a 3rd party internet provider that ran over Rogers. Like, everything including their website was down, corp circuits, everything with non-existent comms from Rogers.

Thankfully my org was multi-homed and switched over its circuits at 6am so on-site mostly continued without issue.

Also fun where the towers remained just powered on enough for phones to stick to them but not be able to do anything, so 9-1-1 calls would just fail, instead of failing-over to other networks. Seems like a deficiency in the GSM spec (or Rogers SIM programming?) that I don’t think was actioned on.

https://en.m.wikipedia.org/wiki/2022_Rogers_Communications_o...


> Also fun where the towers remained just powered on enough for phones to stick to them but not be able to do anything, so 9-1-1 calls would just fail, instead of failing-over to other networks. Seems like a deficiency in the GSM spec (or Rogers SIM programming?) that I don’t think was actioned on.

Actually, I think this is going to change after the Rogers outage, it's just slowly happening behind the scenes so it's not getting much attention these days. The government has mandated a lot of industry response to failover between providers... we'll see where they land after all the lobbying happens. I do think implementations are changing a bit around this, mostly in the phones so that they give up and go into a network scan if the emergency call is failing.

I worked mostly on core network stuff, so I was a layer removed from the towers, but if they hadn't lost management access they would've been able to tell the tower to stop advertising the network and 911 service. I do understand the question of from a vendor implementation perspective of how automatic this should be though... because automation in this regard does have some of it's own risks and could complicate some types of outages or inadvertently trigger and confuse recovery of problems.

I'm with you though there should be an automatic mechanism to fail over to other network operators, I just haven't thought through all the risks with it and I hope the industry is taking their time to think through the implications.


> I do think implementations are changing a bit around this, mostly in the phones so that they give up and go into a network scan if the emergency call is failing

It seems like this is a global problem, since all Rogers-subscribed devices in a Rogers reception area couldn’t make 9-1-1 calls. But could be a SIM coding issue and not afflict other providers elsewhere.

I just always imagined the GSM spec was so resilient that you could always make a 9-1-1 call if a working network was available but this outage proved that wrong. Surprising to learn in 2022.

Of course it’s Canada, so I agree with them that the thought of letting users failover to a partner for everything would thrash the partner’s networks. Even though Canadian subscriber plans are laughably low in monthly data and population density is low (per the telecom’s usual excuse for our high prices) it turns out the telecoms still underbuilt their networks to have less capacity than what other networks internationally built out to support plans available on the international market (e.g. close to truly unlimited data/free long distance calls)


> I just always imagined the GSM spec was so resilient that you could always make a 9-1-1 call if a working network was available but this outage proved that wrong. Surprising to learn in 2022.

The X is broken but claims it isn't stops failover pattern is strong all over networking. It's not unusual to see it in telco root cause analysis.


> I just always imagined the GSM spec was so resilient that you could always make a 9-1-1 call if a working network was available but this outage proved that wrong.

As I recall it is slightly more nuanced than this and was particular to the failure mode, and has a couple of different things aligning to create the failure mode.

If you're phone is just blank, no sim card. To make an emergency call, it has to just start scanning all the supported frequencies. This is very slow, tune radio, wait for the scheduled information block that described the network on the radio protocol. See if it has the emergency services bit enabled. If not, tune to next frequency and try again. I used to remember all the timers, but almost a decade later I can't remember all the network timers for the information blocks.

The sim card interaction, is say you're at home and you boot up your phone with 100% clean state. You don't want to wait for this scan to complete, so the SIM card gives the phone hints about which frequencies the carrier uses, so start on frequency x to find the network. But if you roam internationally, it can take alot longer to find a partner network, and there are some other techs around steering to preferred partners, but I don't know that those come into play here. I don't know but would be surprised if there is a SIM option to try and pin the emergency calls to a network, I think it's more likely the interaction is this hint on where to start the scan.

The way the rogers network failed, it appears to me it caused the towers to stay in a state where they advertised in their radio block the network was there, and the 911 bit was enabled so the network could be used for emergency calls. This is where I don't really have the details since they haven't been public about it, how much of their network was still available internally. Maybe the cell towers could all see each other, that network layer was OK, and the signalling equipment was all talking to each other as well. That's the part I don't really know and have to speculate, as well as the tower side since I was a core person. So because the towers had enough service to never wilt themselves, they kept advertising the network, along with the 911 support. But then when you try to activate an emergency call, somewhere in the signalling path, as you get from tower to signalling system, to the voip equipment, to the circuits to the emergency center the outage knocked something out. Oh and for all these pieces of 911 equipment, there are two of everything for redundancy... two network paths, two pieces of equipment, etc.

And because they lost admin access to their management network, no one could go in manually and tell the towers to wilt themselves either.

If the towers had just stopped advertising 911 services, the phone would fall back into the network search mode as I described when you have no sim card. It just starts scanning the frequencies until it see's an information block for a network it can talk with the emergency support advertised to and does an emergency attach to the network that the carriers will all accept (An unauthenticated attach for the sole purpose of contacting an emergency center).

So my suspicion is because carriers are so used to we have two of everything, and all emergency calls are marked for priority handling at all layers of the equipment (they get high priority bits on all the network packets and priority CPU scheduling in all the equipment), this particular failure mode where there was a fault somewhere down the line, and they lost control of the towers to tell them to stop advertising 911 services all sort of played together to create the failure mode.


Multi-faceted failure mode.

0) At the network terminal level (mobile phone): at least for emergency calls if a given network fails to connect, fail over and try other networks. Even if the preferred networks claim to provide service.

1) At the network level: failure thresholds should be present. If those thresholds are crossed enter a fail-safe state. This should include entering a soft offline / overloaded response state.

2) Where possible critical data paths should cross-route. Infra Command and Control and Emergency calls in this case. Though if Roger's issue was expired certs or something the plans for handling that get complicated.


it’s that “0” level that surprised me the most here.

Days later, Rogers said you might be able to pull out/disable your SIM card to call 9-1-1, but then it depends: if Rogers is the strongest network, you might end up in the same predicament anyway.


Agreed, 'zombie state' is a valid failure mode. Partners / Infra can think they're alive and respond but be non-functional. As can agents spoofing valid infra but always failing operations.


Say there is an outage at the 911 call center. Now you try to call, don't get through, and your phone writes off that tower. Who were you planning to call after 911? Too bad, should have placed that call first.


Your phone would try other towers from other providers. If 911 is experiencing an outage that’s a separate issue that needs to be mitigated at a different layer. Even still, 100% uptime is difficult and expensive.


> Fun that Rogers used the same core for wireless and wired connections, so many of us were in total blackout, even if we used a 3rd party internet provider that ran over Rogers.

If it ran over Rogers circuits then why wouldn't it go down too? Isn't that the case everywhere?


I just know that a part of Rogers’ response was to separate their cores between wireless and wireline so that the risk of both going down simultaneously would be reduced.

The 3rd party providers aren’t white-label resellers, but there’s obviously some overlapping susceptibilities to going down when Rogers breaks something. Depends what they break, and in this case, it took them down too.


The speculation is fascinating. For most people, their guess is a reflection of themselves. Is there a term for that? This is a gross generalization, but I've seen... - Science people guessing solar flares - My "right-wing friend" guessed international hackers - I, myself, guessed it was a botched software release - Someone in this post commented their military friend says get gas

And yet, like everyone else, I genuinely feel that I'm probably right


My speculation is: “Higher-ups kept demanding that technicians ‘do more with less’ in order to deliver on quarterly metrics and now we’re finally seeing the cumulative result of employees being stretched thin, underpaid, and overworked.”

You are welcome to infer as to why I’m thinking this way!


So how is the job search going?


The ops team can run the whole company and better without the C-Suite is my impression of modern day SV. Agile stickers on waterfall gates…


Obviously you’re a self loathing executive.


This is my bet; and mayyybe some external bad actors taking advantage of the situation on top of that.


> Someone in this post commented their military friend says get gas

The Rogers outage in Canada took out the nationwide debit card payment network because that infra depended on Rogers. Credit cards still worked, but depends on your station’s access to make the transaction. And no shortage of shops running their POS “in the cloud” and needing to close if they lose internet access. I actually did have to lend cash to a colleague to buy gas to get home during that Rogers outage.

All it takes is for one pipeline valve to depend on a cellular connection for billing to get the whole line shutdown.

And ugh, we hope for a botched software upgrade too, but a corp cyberattack is so much harder to recover from so can’t be discounted from the realm of possibilities. I know that’s where my mind went with Rogers given how thorough their outage was.

Was kinda unimaginable for a total outage to happen with no org comms ready to go in the pipeline. Your plans are supposed to have those comms ready for a bad update that you’ve been planning for weeks. It’s a cyberattack where you may stay silent. But I know Rogers isn’t going to admit fault until they find someone else to blame.


PoS devices are usually networked. If you don't validate transactions in realtime you would later validate in batch, but that has more risk than validating at the time of transaction.


> If you don't validate transactions in realtime you would later validate in batch

yeah, a lot of orgs just don't enable that (or don't have a process to enable it as required, and have difficulty pushing out a notice to do so if the network is down!).

Also can only do offline credit card transactions. Can't with our Interac (Canadian-only) debit network. Unsure about Visa/Mastercard debit transactions.


> Unsure about Visa/Mastercard debit transactions.

AIUI, the debit card itself enforces online confirmation, even if the transaction goes through the credit card rail.


> And yet, like everyone else, I genuinely feel that I'm probably right

This is the thing with black swan events. The more pedestrian explanations are almost always true, but then there's a tiny fraction of the time where you're much, much better off having taken a bit of an alarmist view.


It looks like you were right:

>A temporary network disruption that affected AT&T customers in the U.S. Thursday was caused by a software update, the company said.

>AT&T told ABC News in a statement ABC News that the outage was not a cyberattack but caused by "the application and execution of an incorrect process used as we were expanding our network."

https://abcnews.go.com/US/att-outage-impacting-us-customers-...

https://news.ycombinator.com/item?id=39477187


I literally caught myself thinking about a cyberattack merely because its sort of exciting (albeit terrible). And then realizing despite its prominence in my mind, it’s probably not the most likely cause (although certainly plausible still). And furthermore, that my mind gravitates to that without any real information suggesting it over other explanations. More about fearing for the worst instead of what you want I think.


> I genuinely feel that I'm probably right

We are wired that way for a reason. Until you personally see conflicting evidence you have to make an assumption or you would spend your life paralyzed or ignorant.

Biology rewards action more than accuracy.


A type of availability bias, maybe?

https://en.wikipedia.org/wiki/Availability_heuristic


> Is there a term for that?

Projecting, biased.


I don't think it has anything to do with routing by looking at the comments on down detector. Many people report they are in the same household, and one person out of six (in the household) experiences the problem while all are on ATT. It sounds more like an upgrade that went through halfway, or, considering the time it happened, maybe a rollback that went only half through.


> All the CRTC filings I reviewed had all the useful information redacted

Is this common in the industry?


> some people still having service for various reasons.

I assume roaming being one of the top reasons no?


Verizon and T-Mo both issued statements that they have no outages and the issue is just their customers being unable to call AT&T customers. Looks like most of the AT&T network in the US is down though.


A theory for the reported Verizon/T-Mobile issues is that when AT&T went offline, all of those phones went into SOS mode and tried to register on the remaining available networks (Verizon and T-Mobile) to allow 911 calls to be made. The surge in devices registering at once may have overloaded some parts of those networks.


Uhm, no.

You don't need to register to allow 911 calls. You 'register' (it's not a regular registration) at the moment you are placing the actual 911/112 call. At least that was in 2G/3G networks, doubt it changed.

There is always some amount of terminals without SIM or without working SIM, there is no need for them to bang every available network in the vicinity just in case there could be an emergency call.


Data point on ATT (via MVNO) in Atlanta: was connected until ~11:00 EST, then booted off and haven't reconnected.


My wife has google-fi and her coworker has verizon. Both of them say they can't make calls.


I've been tethering with T-Mobile as my primary internet connection and that's been working just fine. Voice also works for me with both TMo and Google Fi.


Any chance they’re trying to call an AT&T customer?


Anecdotal, but I have Google Fi and was on a ~1 hour call this AM during the height of the outage and had zero issues.


and google-fi uses T-Mobile I believe


[flagged]


That sounds entirely unrelated to this outage which began only a few hours ago.


It’s likely unrelated, but bad timing for AT&T as they have applied to end landline service in some areas of California.

https://www.wired.com/story/att-landline-california-complain...


Outage Over

Status: Restored AT&T FINAL, Service Degradation, Global Smart Messaging Suite AT&T Global Smart Messaging Suite

Event description: FINAL, Service Degradation Impacted Services: MMS MT Start time: 02-21-2024 22:00 Eastern, 21:00 Central, 19:00 Pacific End time: 02-22-2024 11:00 Eastern, 10:00 Central, 08:00 Pacific

Downtime: 780 minutes

Dear Customer, We are writing to inform you that Global Smart Messaging Suite is now available. The MMS MT service has been restored and our team is currently monitoring Thank you.

AT&T Business Solutions Kind Regards,

The AT&T SMS Service Administrator


I think a communication like this should include that they are investigating the root cause (assuming they aren’t completely sure) and that they will share it, and state where.

Maybe im reading to much into it but it bothers me that thats not in the communication.


They most certainly are investigating the root cause, and probably there's a witch hunt developing, but as far as customers go I would expect AT&T's attitude to be "none of your business." I've worked with many of these types of companies before, and outside of the occasional cool CS rep, their cultures are lots of information hoarders and responsibility dodgers. Taking responsibility for a problem is a good way to ensure you never get promoted.


"A recently departed employee had a core router's power going through a wall switch. This was done to facilitate quick reboots. A cleaning contractor turned off the switch thinking it was a light. It took us several hours to determine the situation and restore power"


I think the telecom issue playbook is significantly different than the SaaS playbook. Not sure if that’s just cultural or if there are other drivers - maybe paying customer telecom interfaces are simpler and more closed than typical SaaS?


IME, telecom as an industry is highly focused on the RCA, ICA, and uptime, and has had that embedded culturally for decades. Sharing the information publicly doesn't have much value, in the balance, unless there are a string of incidents where an acute perception problem needs to be addressed. This would more likely result in a marketing and advertising strategy rather than the sharing of technical RCA details. Additionally, one must consider that not all RCA details are fit for public disclosure. _You_ may be interested in deets, but John Q. Public is not interested beyond "Is it fixed yet?". If you want insider perspective, work in/with the industry. It's fascinating stuff.


Doe that include cellular voice calls?


This reminds me of the recent discussion on status pages. https://news.ycombinator.com/item?id=39099980

They need to be accurate. At&t status claims everything is fine.

My wireless service is down. Down detector has tens of thousands of reports, so clearly everything is not fine.


Status pages are basically useless if they’re public facing.

Either they automatically update based on automatic tests (like some of the Internet backbone health tests) or they’re manually updated.

If they’re automatic, they’re almost always internal and not public. If they’re manual, they’re almost always delayed and not updated until after the outage is posted to HN anyway.


The other problem with status pages is depending on what happened it may not be possible to update the status page anyway. You really need a third party to have a useful status page.


Which is pretty much what down detector has evolved into. And it looks like they have an enterprise offering to alert companies to their own issues.


Which is better? How do you know whether an issue is individual to a customer or a quick blip that will resolve in a few seconds?


I prefer fully automated tests publicly revealed because the main thing I want to know (as a customer) is should I keep trying to fix my end or give up because GitHub exploded again.

It’s most annoying when you have something like recently - known maintenance work on my upstream home fiber connection that was resulting in service degradation (but not complete loss, my fiber line was back to DSL or dialup). The chat lady could see that my area was affected, but the issue lookup system couldn’t.

If the issue lookup had told me there as an issue I’d’ve gone on my merry way.

I even checked a few more times until it was resolved; the issue never appeared in the issue lookup system.


> should I keep trying to fix my end or give up because GitHub exploded again

Making this decision easy is a fight I fight for my customers every day. :)


This was much much much easier when websites used to explode with tracebacks and other detailed error messages, now you just get a "whoopsie doopsie we did a fuckywucky" and you can't really tell what's going on.


you can't operate at any scale at all without mechanisms in place to know perfectly well whether an issue is impacting a single customer or if your world is on fire


You'd like to think so, but surprisingly large number of "large scale" things operate on the "everything is fine" until too many people complain about the fire.


Caches make problems fun too.

Quite often you see automated tests that check how well your cache/in memory data are working. But when some other customer that isn't in the hot path tries to access their request times out. I've seen a lot of people making automated checking systems fail at things like this.


The phrase “the hardest parts of computer science is caching and naming things” come to mind.


I see 2 things here but you're off by one.


Yes, but those mechanisms take time to determine this.


> They need to be accurate

It would be nice if the FTC mandated this. It is exhausting when the status page is taken over by the marketing department (the infamous green check with the little "i").


> the infamous green check with the little "i"

I'm not familiar, what are some examples?


Currently there is a banner on the AT&T outage page with this message:

>Service Alert: Some of our customers are experiencing wireless service interruptions this morning. We are working urgently to restore service to them. We will provide updates as they are available.

https://www.att.com/outages/


Which status page?


No info in NANOG yet but expect some in this thread in the coming hours: https://mailman.nanog.org/pipermail/nanog/2024-February/2250...


> Word around the campfire is that it’s a Cisco issue.

https://mailman.nanog.org/pipermail/nanog/2024-February/2250...


Using myATT app doesn't even show my wireless account anymore. My entire family account doesn't even show up as a service. Seems like a hack or internal issue that deleted accounts? Can others confirm whether they see their accounts?


Many of their APIs appear to be intermittently returning 502s, leading to strange behavior in their web/mobile apps.


That's just people trying to figure out if their service has been disconnected rather than it being a network outage


From what I read just a bit ago, basically there is a problem with the database of SIM numbers. So, SIMs all just dislodged from the network because they lost their network authorization. That would lead one to believe it was a botched software push. I imagine online accounts get this information somehow which could explain the portal being broken. I have a prepaid hotspot and it works fine, but none of the "family plan" month to month contract phones work. I also wonder if there's a physical SIM vs eSIM situation that could explain "newer" models working.


That would be consistent with the symptoms. Big telco networks are hierarchical with most functions pushed to regional data centers with a very small number of services in a redundant pair or trio of central data centers. Subscriber database (HSS, UDR) would be one such function.

The cause of a failure of the HSS could be manifold, ranging from router failures to software bugs to cyber attack (databases of 100M+ users being a juicy target).

One slightly scary observation from NANOG was that FirstNet, the network that ATT built for first responders, was down. That would be ugly if true and I'd expect the FCC to be very interested in getting to the bottom of it.


In our house this morning, the two phones with physical SIMs worked fine and the two phones with eSIMs were SOS mode only.

I could log into my AT&T account just fine and all phones showed up correctly.

(I’m submitting this from an AT&T 5G connection, no WiFi nearby)


aligns with my experience too. My wife's newer phone was sos but my older one was fine. Both ATT.


Their outage status page is also completely broken. Doesn't show anything.


I can see my account. The first thing I did when I saw my phone wasn't working was log in and pay my bill, thinking maybe I had missed one or something.


The account management / status tools were slow and flaky on the best days. I wouldn’t rule out a little extra traffic knocking them out. Correlation is not necessarily causation.


There's an outage map.[1] But it's useless. That's just a US map of where most people live.

[1] https://www.cbsnews.com/news/outage-map-att-where-cell-phone...



Seeing "SOS" only on iPhone currently. I got worried something had gone wrong with auto-bill pay since I only noticed after I was driving.

It's interesting how naked I feel without access to the internet. I reach for it way more often than I would have ever guessed, something you only notice when it's not there. Last March my area saw large wind storms that knocked out power for almost a week (I'm not in a rural area). I can work around the loss of power but the cell tower(s) that service my area could not handle the load and/or the signal in my house was weak and I was unable to load anything. Not having internet was way worse than not having power and I ended up driving a few hours away to my parent's house instead of staying home.


My earliest computers were amazingly capable and powerful devices, I could do anything I could think of and spend hours and hours on them.

Now my computer is insanely more powerful but without an Internet connection it feels dead and useless.


Optimize some low-level numeric algorithms, CPU or GPU, it brings back that feeling.


It's even sadder - I used to be able to play computer games for hours offline, now I get about five minutes into even the ones WITH on offline mode, and I'm grabbing for a wiki or other reference. Ah, some of it is just getting old.


I think some of it is just not having oodles of free time to figure it out on your own. When I was young I would just keep trying things till I figured out the game, my time wasn't worth much or at least I didn't value it highly. Nowadays I don't want to spend 1-3 hours figuring something frustrating that's blocking my progress. The "rush" I get from solving it on my own does not make up for the time lost. Also I feel like games made today almost expect you will need the wiki/guide to figure out certain things. Or at least I often think "How the heck was I supposed to figure that out?" when reading the wiki for some aspect of a game I'm stuck on.


So.. it's an adult strategy for playing video games to extract a candy coated "win." We are all either overgrown children and always will be, or something has gone drastically wrong in the schools.


That latter part is certainly true, only a few games "offhand" even really try to work "wiki-free" (Factorio is perhaps the best here, but Minecraft is trying).


I started driving across the US at 3am, didn't notice for the first few minutes until I tried pulling up the address in Apple Maps. Sure was strange following interstate signs for ~10 hours!


Yes I've felt the same way. I feel like we have an instinctual need for social connection that we've filled with internet. Luckily, we do still have meat-space friends and family.


Yeah, that was a big reason I went to stay with my parents/family. I felt super isolated from my friends (local and remote) when I couldn't participate in group chats/communicate. Also I just kept picking up my phone to look up something or check on something only to re-remember I couldn't do anything. I had podcasts and audiobooks on my phone which helped but the isolation was a weird feeling I hadn't felt before. After I thought about it I realized it had probably been a decade or more since I had been completely without internet for more than a few minutes. It was odd...


Portland, is that you?

This happened to us with the recent storms a month or two back, some places didn't have power restored for 2 weeks+


Portland checking in. Those storms were gnarly and there was carnage all around us. Luckily we maintained power and internet. We have 11 month old twins and a three year old so 10 days without childcare or help was its own challenge.


Lexington, KY. Not a massive city but the second largest in KY. I left after 2 days of no power and it didn't come back on for another 3-5 days more after that depending on where you lived.


Pure speculation, but CISA released this [1] a few weeks back and tweeted [2] it out.

[1] https://www.cisa.gov/news-events/cybersecurity-advisories/aa...

[2] https://twitter.com/CISACyber/status/1758495005176447361


My ATT Phone is in SOS mode. However, ATT's outages status reports:

  > All clear! No outages to report.
  > We didn’t find any outages in your area. Still having issues?
https://www.att.com/outages/


At least they’re in SOS mode. When Rogers in Canada had a total blackout (cellular, home internet, MPLS, corp circuits, their radio stations , everything), phones showed zero bars, but the towers were still powered on and doing some minimum level of handshake so phones didn’t go into SOS mode.

If you tried to make a 9-1-1 call, it would just fail. It wouldn’t fail over to another network because the towers were still powered up but unable to do anything, and Rogers couldn’t power them down because their internal stuff was all down.

Like a day later they said you could remove your SIM card to do a 9-1-1 call. Thanks guys.

Of course, no real info from the provider during the outage. Turns out they did an enterprise-risking upgrade on a Friday morning and nobody at the org seemed to have a “what if this fails plan”. CTO was on vacation and roaming phones were black too and he thought it was just an issue for him.

https://en.m.wikipedia.org/wiki/2022_Rogers_Communications_o...


Some people earlier on this morning said they couldn't make 911 calls. I wonder if it was the same issue and perhaps AT&T cut the towers completely pending a fix. Purely speculation.


Our land-based internet with a different [large] ISP went out about 18 hours before this stuff with wireless started, and is still going on. We've been getting similar contradictory messaging the whole time, and they seem confused about what's causing it. We got a message a couple of hours ago it had been resolved and then 30 minutes later said it was not.

It could be entirely coincidental and unrelated to the stuff with other networks, but the timing was odd and I have never ever seen anything like this outage from them. I can think of one time it was out for around 2 hours in the last 5 years, and it was with a very specific infrastructural upgrade they knew about.


I'm seeing a variety of outages listed there as of 08:30 Pacific, mostly landline. There are a couple wireless outages shown in Sonoma (and listed as impacting Sonoma and Ventura counties). The initial cause is shown as "maintenance activity".

https://imgur.com/a/oXZpEX9


I’m seeing an outage reported on this page.


Latest AT&T Statement: “Our network teams took immediate action and so far three-quarters of our network has been restored,” the company said. “We are working as quickly as possible to restore service to remaining customers.”

Still down for me though.


3/4 might just mean the internal facing side, which is still progress, but doesn’t mean any improvements for end-users.


Anecdotally, I woke up to no signal / “SOS” mode on my iPhone this morning at around 0600 and had service restored around 0830 in South Carolina. However, a coworker in Memphis confirmed he was still out of service at 1000 so it’s regional restoration.


I always wonder instead of a regional restoration, if they would “disable” segments of SIMs/accounts randomly to avoid lightning strike (it’s not a DDoS…) their network as they turn things back on. Depends on what the recovery method is, but could be problematic to turn everything back on at once.


I've heard that called the thundering herd problem.


I actually spoke with my wife after the initial comment, she was reconnected at a later time than I! So it’s not regional but some other mechanism.


I've never heard lightning strike, always thundering herd.


my wife's phone was SOS when she woke up at about 6AM central and finally become operational around 1:30PM central.


Just came back up for me in Oklahoma City, OK.


Also still down for me here around Nashville.


Still down for me too.


down all morning in ATL but back up at 1PM EST


Back up in Cartersville, north of ATL @ 13:12. Oddly enough my text messages say they went through at 12:43 but my response to someone's message when once my phone had everything roll in at once is timestamped 13:12


I wonder if this is a cyber incident. Curious if any telecom folks know what the most likely explanation for an event like this would be, and what telltale signs/symptoms might first indicate this was caused by something nefarious.


Due to the gross incompetence these companies operate at, it's too hard to tell the difference.


Unfortunately, unlike cyber security, there are no off the shelf products that are being sold to help companies with general incompetence.


> However, the US Cybersecurity and Infrastructure Security Agency is “working closely with AT&T to understand the cause of the outage and its impacts, and stand[s] ready to offer any assistance needed,” Eric Goldstein, the agency’s executive assistant director for cybersecurity, said in a statement to CNN.[1]

[1] - [https://www.cnn.com/2024/02/22/tech/att-cell-service-outage/...

This isn't telling of anything, right? Wouldn't CISA be involved with anything that impacts Public Infrastructure at this level?


by itself, not telling of anything per se.

like, you could commit a dumb BGP config and break lots of stuff. have done that in the past, actually...

but any time a national-tier ISP has a national-level outage, that warrants a look from multiple orgs. and given the number of threat actors like china, NK, iran, and russia, who are, and have, made aggressive efforts in this space -- and have strong reasons to do so now -- its not crazy for the US fed'gov to want to know a little more, and offer to help. but again, entirely possible it's unrelated.


This is normal for high profile outages, even if you are small you can still engage with the CISA if you think there's foul play.


from the same article above, it seems like it's a critical part of this.

> “Everybody’s incentives are aligned,” the former official said. “The FCC is going to want to know what caused it so that lessons can be learned. And if they find malfeasance or bad actions or, just poor quality of oversight of the network, they have the latitude to act.”

If AT&T gets to decide if they are at fault, they will, of course, never be at fault. So a third-party investigation makes a lot of sense.

I would also suspect that the FCC would not be as well versed in determining if there was a hack or even who did it, which is why I feel like CISA would need to get involved in the investigation.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: