Level 3 technician's misstep causes largest telephone outage ever reported

grecy · on March 19, 2018

Well, I dunno.

When I worked at the Telco that serves all of Northern Canada - the telco that has the largest operating area of any telco in the world, in fact - we had an outage that took out everything for between 1 and 3 days.

When I say everything, I mean if you picked up your phone you didn't get a dial tone. Or cell phone. Or internet. Even people that still have 2-way radios to use as phones were out of luck.

There is no other provider for every single one of those customers, so not a single person had a single shred of connectivity.

In the larger cities they posted Police and EMT Personnel on street corners in case people needed 911.

That was a rather big outage, caused by a backup generator not starting during a power outage. We were still trying to bring systems back online a week later that had never been shutdown in 20+ years.

avs733 · on March 19, 2018

This type of stuff is terrifying to me. When these incidents occur, people freak out and get angry and demand action...for a while. Then it gets forgotten about.

As I've gotten older I've started to separate process driven organizations from progress driven organizations. Process driven ones tend to be very boring, and it can be abused, but good processes trump progress to me everyday.

abandonliberty · on March 19, 2018

>Then it gets forgotten about.

Then consumers don't want to pay for it and action ceases. You can play those odds for quite a while.

avs733 · on March 19, 2018

unfortunately.

I forgot where I read it, but there was a story/quote/interview about how politicians can never run on maintenance because maintenance isn't sexy, they need to run on big projects.

occams_chainsaw · on March 19, 2018

> Even people that still have 2-way radios to use as phones were out of luck

How would a telecom outage impact two-way radios?

_9vzr · on March 19, 2018

There are systems used by first responders that allow for radio basestations to be connected to landlines to run to other basestations. This allows you to create more coverage area for your network. The radios run a trunked system that is more like how cellphones work than how something like CB radios work.

This is why things like ARES (Amateur Radio Emergency Service) is important for disasters where centralized systems aren't working but long range, simplex (decentralized) nets can still operate.

NateyJay · on March 19, 2018

Two way radios can often connect to the telephone network through a bridge at the radio repeater

grecy · on March 19, 2018

What he said

exikyut · on March 19, 2018

I have an ancient book on Private Mobile Radio. It has a section on using leased lines in it

walrus01 · on March 22, 2018

By population the entire area served by northwestel is smaller than a single rural wa state county, however.

exikyut · on March 19, 2018

Oh wow.

> The technician left empty a field that would normally contain a target telephone number. The network management software interpreted the empty field as a 'wildcard,' ...

Exactly the same type of technical error happened nine years ago at Google!

> We maintain a list of [malicious] sites through both manual and automated methods. We periodically update that list and released one such update to the site this morning. Unfortunately (and here's the human error), the URL of '/' was mistakenly checked in as a value to the file and '/' expands to all URLs.

What it looked like: https://i.imgur.com/W5ICyVq.png, https://i.imgur.com/LrtLceN.png

Source: https://googleblog.blogspot.com.au/2009/01/this-site-may-har...

Mtinie · on March 19, 2018

This is a painful reminder for those of us who routinely design form-based interfaces to ensure ambiguous fields are explicitly understood by the user before entry or called out during verification of the submission.

mdip · on March 19, 2018

An "Are you sure?" prompt goes a long way and sometimes takes a really long time to get added, even when it's historically been a problem[0].

Particularly in these kinds of cases -- this is software that is used by very few people at very few companies and at those companies, it's used very rarely. That nobody at Level 3 knew leaving that field blank would cause that issue doesn't surprise me at all. We had management applications for some switch software that we had to run on Windows 98 using ThinkPad laptops[1].

[0] https://en.wikipedia.org/wiki/Rm_(Unix)#Protection_of_the_fi...

[1] I was a Frontier, then Global Crossing and ultimately Level 3 employee for around 17 years. The story about the ThinkPad Laptops is detailed in another comment in this post.

Already__Taken · on March 19, 2018

We had some remote desktop software that asked "are you sure" about all kinds of things. If you hit logoff/restart/shutdown on a group with nothing selected it'd ask "Are you sure" yes/no and select everything in the group and perform said action. I pushed for ages to get that "default do everything" behaviour removed entirely.

If you think "are you sure" is a good solution, you may have problems much further back in the applications flow, any it may not be helping even if you do add it.

Mtinie · on March 19, 2018

You bring up a fair point.

Excessive prompts are equally dangerous. “Click fatigue” is a real thing and you can quickly shift from a state of “let me know when something differs from the expectations” to “why are you making me agree to everything?”, which means more often than not the user just clicks blindly instead of reading the prompt for context of why this time the prompt is different.

I’m a proponent of appropriately prompting the user when their submissions, if processed, would result in ambiguous / destructive and not-easily-reversible outcomes.

nkrisc · on March 19, 2018

Requiring confirmation for an action should be something a user very rarely, if ever, sees. That ensures when it does occur it 1) is serious and 2) surprises the user, gaining their full attention.

Another approach here would have been implementing sane defaults. Blank field as wildcard is not a sane default. Default to that which has the least impact.

sleepybrett · on March 19, 2018

While "Are you sure?" is often a great prompt, there are two things that can be done to improve it.

1) Just putting AYS? after every prompt is bad, people stop taking the time to think and just mash Y.

2) Discussing the consequences in the AYS? is better.

For instance if you @channel in slack you get an AYS that lists the number of people you are going to piss off with the notification. Great UX... and yet people at my office still seem to think things like girl scout cookies are worth @channel'ing #general with a few thousand people in it is a good idea.

mdip · on March 20, 2018

I think a lot of those problems can be solved with a few techniques:

In the case of the famous rm -rf /, having a flag to ditch confirmation (which, IIRC, the reason the confirmation prompt was not there in the first place is because "f" means force, but hey). I think about "zypper" (openSUSE's package manager), which has the -y flag for installs/updates, one to auto-accept licenses and one to force resolving conflicts aggressively so that there are, literally, no prompts. To me, having a -y and an auto-accept licenses is unnecessary, but I suspect that may exist for legal reasons. Having the extra flag for aggressively accepting conflicts is a really good idea because in interactive mode you're often given 3 choices and often all three of those choices will break something. Conflicts/package resolution issues aren't common and when the creep up it's usually because you've got a unique configuration and need to do some other steps before you're going to get a successful installation.

As for the "Are you sure?" GUI prompts where flags won't cut it, providing a "[ ] Never ask again" is usually suitable. Though in the case of Slack, I'm with you. My answer to that would be after a person has used @channel more than a twice in a day to throw another prompt up that says "You're going to get a reputation for being an obnoxious dick. Knock it off." ...and that's why I'm not a UX designer.

Mtinie · on March 19, 2018

I agree completely. As a designer I fear I’ve missed something like this any time a feature or function I’ve worked on is released.

JustSomeNobody · on March 19, 2018

Or, more concisely, invalid data should never make it past validation.

virtuowl · on March 19, 2018

This sounds more like a failed ui in the managing software than the technicians fault if noone there knew what that empty field would do

mdip · on March 19, 2018

I commented more extensively in the root of the post, but you can't even begin to imagine.

Think about every script you've ever written for "some thing at home" and how you only cared that it worked for the very narrow, specific, circumstances you were looking for. Maybe you left out error handling and just let it crash when you failed to put in the right parameter. Who cares? It's just a script for your one, lonely, workstation/server.

That's about the quality we're talking about. The companies that make these switches sell them to, maybe, five customers[0]. Software upgrades? Sure, if you replace that $30,000 card with the new version. Having trouble with the software? A support contract can be purchased for a similarly high fee[1]. A company producing this equipment doesn't put a lot of money into QA. In security, there was a general fear about these programs. It was so concerning that the management interfaces to the equipment was on "as close to air-gapped as you can be without being air-gapped" networks with the kinds of logging, auditing and the likes that you'd expect for a network holding government classified information[2].

[0] So few customers, in fact, that you can call them up with a serial number and find out who the purchaser is. I know this first hand due to someone propping the door to the switch site open resulting in, I think, 5 of what I was told were $30,000 a-piece cards being stolen. I was told they were effectively worthless to the thief, though, because nobody would buy them second hand in that manner and the moment they were offered for sale, if someone realized what they were, the thief would be caught.

[1] To be fair, I know of one specific circumstance where the company only offered paid support but that was mainly because I didn't work on that team; I'd speculate that all of them functioned this way.

[2] Well, maybe what you'd expect in an ideal world, anyway.

archon · on March 19, 2018

> Think about every script you've ever written for "some thing at home" and how you only cared that it worked for the very narrow, specific, circumstances you were looking for. Maybe you left out error handling and just let it crash when you failed to put in the right parameter. Who cares? It's just a script for your one, lonely, workstation/server.

I find seeing this mentioned oddly comforting. I write my worst software for myself. Zero validation, very little error handling, unchecked assumptions all over the code.

At least I'm not the only one out there with a barely-stable home lab setup because of shoddy programming.

gmueckl · on March 19, 2018

This is the difference between tinkering, experimentation and engineering. All of these have their proper place. There is no shame in having a shoddy piece of code as long as it is within a lab environment. Trouble usually starts when this kind of code changes hands and finds serious users.

mdip · on March 20, 2018

I joke that I have an ever growing private repository of code I'm too embarrassed to publish publicly. Only partly joking; it's actually several repositories. But as others have said, people who are programmers more than just professionally do this all the time. There's very few things that involve using a computer that I run into day-to-day that I don't think, "I could do this faster with a dirty script[0]"

I used to do interviews and would often encounter candidates who had no public repositories, anywhere, GitHub or otherwise. I learned quickly to be very disarming before even asking and settled on something along the lines of "Look, I know when you write things for yourself, you're doing it with very limited time and for an audience who is more interested in it doing 'The Thing' without regard for anything resembling best practices, or even typical practices. I fully expect lousy code and that's perfectly fine, but I really need as many recent samples as you can possibly give me before this evening to be prepared.[1]"

What wasn't probably realized by the candidates was that if I'd made that phone call they were getting the technical interview regardless of the code quality and I was barely going to glance at their code until the interview. And the best thing they could do for me was to give me code that was on the bad side of things[2]. The first question I'd ask is "OK, imagine you have all of the time you'd ever need to make this perfect. The goal is to get there in stages and maximize the improvements as early as possible. What would you change and in what priority." There's no way in their development career that they're not going to encounter something as awful as they've written in production and be asked to fix it, and they're going to have to start with "make it work again" and in a limited number of stages "get it to the best state; hopefully to an ideal state" but the goal is to get to as stable of a state as possible before you're pulled somewhere else.

[0] Or better, stop doing it at all by putting that dirty script in a cron job that won't have any logging, I'll probably be really happy with for the first few weeks, then forget it was there and not realize a few months later when it stops working.

[1] It was kind of a dick move and I always felt a little guilty, but I know that my instinct would be to spend an evening cherry picking the best examples which I'd then go without sleep to clean up as best I could before morning. And it wasn't a situation where I wanted to judge them at their worst "Well, if their worst code looks this good, they must be good."; if I got a code sample that was too good, I figured it was the rare script that was cared about but that the developer didn't want to unleash on the world and feel like they had to support. When this happened, it was always from a candidate who gave me one maybe two things while claiming to be a code addict and I always asked those candidates to rectify their love for software with only having a couple of very small, albeit well-written code samples. I recommended one guy who admitted he had a lot of other code but was too embarrassed by it and then logged into his BitBucket repo.

[2] Which I found worked best when I simply just asked candidates for their worst code and told them, generally, that I'd like to get a feel for how they handle eliminating technical debt. If the code was too good, I'd have to find something that I felt they should be able to quickly understand well enough to offer ideas for improvements; that rarely worked out well -- it's always easier on the candidate when it's code they've written because they're, at lease possibly, likely to be familiar with it.

indigochill · on March 19, 2018

To [0], would there be any value to stealing the cards not to sell them but to reverse engineer them to find vectors for attacks? "Effectively unsupported, never updated, and responsible for a massive amount of telecommunications" sounds like a great time for an attacker. Also seems like it would have value for industrial espionage.

mdip · on March 19, 2018

I can't say that I'm completely sure, but I'm going to guess the answer is "Probably Not". The big telecoms tend to be very conservative about where they spend their money -- understandable with a product this expensive despite the grief one tends to have in working with the product. The market is also moving toward more generic equipment and less specialized hardware for a lot of things and I'd imagine these cards will fall into that category at some point, as well.

blacksmith_tb · on March 19, 2018

What keeps most of from 'shipping' our home-grade, hacked-together code to anyone else is the fear of having to support it, though. "OK, grandma, now shell into the RPi and tell me what crontab -l says..." - you would think that aside from any sort of professional ethics (or even pride) that companies in these situations wouldn't dare ship something that 'mostly worked' (unless they had superhero lawyers...)

mdip · on March 19, 2018

LOL - I'm reading this as I'm checking my home nagios server to see why I just got an e-mail about the Raspberry Pi at my parent's cabin up north.

ashleyn · on March 19, 2018

My thoughts exactly:

"The network management software interpreted the empty field as a 'wildcard,' meaning that the software understood the blank field as an instruction to block all calls, instead of as a null entry. This caused the switch to block calls from every number in Level 3’s non-native telephone number database."

WTF kind of crappy design is that?

cesarb · on March 19, 2018

Probably a filtering interface with several fields, in which you fill the ones you want to filter on. If you don't fill a field, it's not used as a criteria for filtering (so empty fields are a "don't care" for that field's criteria).

wumpus · on March 19, 2018

This is the basic design pattern: "Google mistakes entire web for malware"

https://www.theregister.co.uk/2009/01/31/google_malware_snaf...

tootie · on March 19, 2018

Probably the same team that designed the Hawaiian missile alert system.

woliveirajr · on March 19, 2018

> The FCC report said Level 3 subsequently adopted measures to prevent a recurrence of the problem - measures in accord with best practices.

What is the "Best practices" to prevent someone from leaving a blank field, since this field was interpreted as a "*" and blocked everyone ?

Would a "send an e-mail telling to never leave this field blank again" enought ?

organsnyder · on March 19, 2018

I would hope that they actually added input validation to prevent this from occurring in the future.

But it wouldn't surprise me if someone managed to consider sending a "don't do this" email with the high-priority flag set as a "best practice"...

mdip · on March 19, 2018

I was an employee of Level 3 for a very long time (in IT, various roles -- most of my career was there). A small disclaimer: I left before this incident happened (about a year prior) and have spoken with nobody about it, so I have no insider knowledge specific to this. I was also an employee of Global Crossing that was acquired by Level 3 and this incident appears to have happened on the Level 3 side of the network (though it's not immediately clear; GC operated a SONET network and it's entirely possible it was from there).

The article and at least one other commenter mentioned this being a UI problem, and all I can say is "bingo". They didn't identify the vendor, but the article called out Cisco. I am a little skeptical of that, personally[0]. I lean toward it being something else, mainly because of the statement that followed "no one at Level 3 was aware of the consequences of leaving that field empty". We had a lot of very knowledgeable Cisco folks there. There are a few folks that I knew personally that were probably among the top folks on administering that equipment outside of Cisco. In addition, if a problem like this arose, they're accessible and helpful.

It was almost certainly one of the many, many, outdated software applications that make up the vast array of management interfaces into the equipment. I worked inside one of the phone switches (technically, one of the test switches). About a year prior to my leaving, the room next to my desk was filled floor to ceiling with cards that had been there on the day I started. I was told these cards ran along the lines of $30,000 a piece. If you wanted to upgrade the software, you had to replace the card. So we had machines in our operations center that were on isolated networks running Windows 98 in order to run the executable required to configure the switches[1]. We had devices made by Pirelli[3] that had similarly awful software. I knew of about 5 different devices with 5 different problem management tools but there were several more. And there were the handful of devices that were well known as ones "you don't even breathe on" (and I've never heard those words said about the Cisco products ... not that people spoke terribly highly of them, but never quite that negatively).

Telecom went through a really bad time in the early 2000s. At Global Crossing, every 6 months about 10% of the staff was laid off ... this happened for 10 years. Hardware didn't get upgraded, and therefore software didn't get upgraded. Sofware support contracts and maintenance were allowed to lapse. The quality of the software -- since its audience was a small handful of companies, many of which in the same financial state as Global Crossing -- was awful. After that many layoffs, the one or two guys who knew the myriad of corner cases involved in operating some of the management interfaces were on to other jobs at other companies, or retired. That this sort of thing didn't happen somewhat regularly is still surprising to me to this day.

At the time that I left, Level 3 was doing better from a financial standpoint and money was being invested in modernising a lot of these problems, but it's a huge network with many switch sites[4] in any major city that Level 3 terminated wires. Each of those sites had an array of equipment, some of it common, some of it from companies that went under shortly after the dot-com bust.

[0] It could easily have been a Cisco product, but there are so many -- far worse -- products out there that nobody outside of telecom has ever dealt with. I lean more toward that.

[1] Before someone says "Why didn't you virtualize them?". That was done for some machines with both running inside the isolated network, but it became more of a hassle than it was worth since these machines had to occasionally connect to the devices via serial port. The very buggy software included a very temperamental driver that only worked with a few models of older PCs and IBM (not Lenovo) Thinkpad laptop serial ports ... and only then if one of the wires was cut on the cable.

[2] Earlier than the style recently featured in a post about enthusiasts for the brand upgrading the motherboards.

[3] https://www.pirelli.com/global/en-ww/homepage ... Yes, that Pirelli. There's crossover between the rubber tires and fiber optics, apparently ... that was news to me. And they had some of the worst software I've ever encountered.

[4] We had, I think, 4 in Detroit. Some are not in terribly convenient locations, either. One in the area required passing an expensive OSHA certification, wearing steel toed boots and a hard hat if you wished to access it due to its proximity to a rail yard. I'd been there once and couldn't imagine the need for the restrictions.

exikyut · on March 19, 2018

Three things.

1. Thanks for the awesome TIL

2. Rumors fly about old versions of Windows, OS/2, etc still being actively used. I like to pin down and file away usage/year correlations, where possible. What sort of timeframe (roughly) was Win98 in active use here?

3. Regarding [1], I have an ancient [runs downstars to check] Compaq Prosignia 300 server here and I discovered in the (DR-DOS based) BIOS at one point that the serial port's electrical behavior can be customized between being edge triggered or level triggered. (Mildly interesting machine. Insists it has 83MB of RAM. Has the FOOF bug. Its SCSI disks make nice noises when they spin up.) Maybe this is related.

mdip · on March 19, 2018

1. You're kindly welcome.

2. The last I had heard about that machine, specifically, was around 2011 and I'm fairly certain it was there when they eliminated the NOC in Detroit which was around 2013, if memory serves. The thing is, I would be surprised if it was actually gone.

3. Nice - I was well known was the guy who could fix anything over at Level 3 around the IT side of the house. Around 2014, I was asked to take a look at an ancient desktop PC that had been sitting in the server room at another building. It had failed weeks prior, had a modem plugged in and was discovered to have been used for a billing purpose that was apparently costing well into the 6 figures. The VP asked me if I would spare a moment to see if I could figure it out, despite it having nothing to do with my normal duties. It was dusty as heck and wouldn't boot -- nobody on the server team could get it functioning. I noticed a home-made label stuck to the top and recognized it as drive geometry. The little battery had failed, probably 15 years prior, and someone put a label on it figuring that people would understand exactly what it was. I was the only one who remembered what that was. When I made it boot, people stared at me like I was a magician. :)

exikyut · on March 20, 2018

Interesting... I think you just helped me figure something (admittedly rather simple) out. Quite trivial (not nearly as interesting as your experience), but mildly related.

Many years ago I happened to find an ancient-looking machine buried in a spare room at a church. I think the room was occasionally used as an ad-hoc creche area.

After finally locating an IEC cable for it and finally getting it to boot, I found that it was of the opinion it didn't have any HDDs attached.

So, I went into the BIOS, and - yes! Just old enough to require manual CHS configuration, but juuust new enough to have a manual autodetect routine!

Turned out it was a cute 25MHz-or-so (IIRC) 486 with something like a 200MB HDD. Had some demos on it that I've long forgotten the names of. Was fun to find that machine.

Despite being so trivial that no conveniently-placed homemade rescue labels were needed, the people there also wondered how I'd figured out what was wrong with it as well.

(Honestly, I really want to work somewhere everybody's still using ancient equipment. Partly because it's what I've been exposed to for most of my life and I really like it, and partly because I'm still yet to have the chance to acclimatize to newer stuff and almost all of the small bit of knowledge I've accumulated covers older tech.)

Regarding what you helped me figure out, I've been wondering for years why that BIOS decided it didn't have a HDD. Initially I thought the HDD was on the way out and decided not to show up one day, and the BIOS happily deregistered it [after someone saw an indecipherable POST error and hit F1 or whatever]. Now I wonder if maybe the rechargeable battery went flat after the machine was left off for ages, then someone turned it on, hit F1 to accept the "bad CRC" error (I never saw one) and didn't know to do an autodetect for the HDD. It's possible. I know some BIOSes remember "time wasn't re-set after fail"; I never saw a time POST error, and don't remember if I checked the clock to see if it was wrong.

cesarb · on March 19, 2018

> and I discovered in the (DR-DOS based) BIOS at one point that the serial port's electrical behavior can be customized between being edge triggered or level triggered.

Things like that make John Titor's IBM 5100 story slightly more believable. Old computers often have strange features which are omitted from their successors due to being rarely used (for instance, the original IBM PC could use current loop on its serial port connector). One could imagine someone with access to a time machine taking a short trip to the past just to grab one of these old computers as a replacement for a failed one.

twic · on March 19, 2018

> There's crossover between the rubber tires and fiber optics, apparently

Nokia also started as a rubber company:

https://en.wikipedia.org/wiki/History_of_Nokia

They got into telecommunications by buying a cable works. I don't know if that was just random empire-building, or if there are material connections between cables and rubber. I suppose you need a lot of rubber to insulate the cables.

Dodgeit · on March 19, 2018

Having ongoing and massive issues with a Cisco phone system where I work currently, so I was reading into it as being a Cisco system and not being surprised.

Although agree that Cisco is usually pretty solid.

greenleafjacob · on March 19, 2018

This is just a failure in change management. They detected the issue in 4 minutes but it took over an hour and a half to mitigate?

bonesss · on March 19, 2018

That's the conclusion of the report: no one was previously aware that an "unrelated" field in an "unrelated" activity would have production-wide consequences.

The detected an issue within 4 minutes, but it took an hour and a half to diagnose the issue, find the unwanted and unforeseen change, and revert it. That's not an absurd amount of time for a complex system with millions of users.

I would hazard to guess that this long-running system isn't even taken in as part of the companies 'change management' routines, in so far as it's job was number filtering and the operation was 'routine'. At my work we produce production change reports for production changes, but we don't fill them out every time we run a stable application...

gonzo41 · on March 19, 2018

1 minutes to fix. 30 minutes rationalizing that the 1 minute fix your thinking of is going to work. 45 minutes convincing others. 14 minutes of terrible self doubt and worry. And we're back up. Ops life!

greenleafjacob · on March 19, 2018

Mitigate first. If you are confident in your change management system then rolling back is always safe.

SlowBro · on March 19, 2018

And quite possibly a failure in disaster recovery methods. If it was at all possible to have a replicated site it should have been failed over to it immediately. Then resolution and root cause analysis could have been carried over at their leisure.

Perhaps that would not have been possible with this style of equipment, I don’t know. But at least yes, change management procedures should have been in place and obeyed.

However, I hope they don’t fire the guy. I once read of an employee of some company who created a $600k mistake. They asked the CEO if he would fire the employee. “I just paid $600k to educate one of my employees. Do you think I’m going to give him away to my competitors?”

That has stuck with me. Unless there is clearly malicious intent I want to give second chances.

jlgaddis · on March 19, 2018

Four minutes to realize SHTF, but a while longer to figure out why:

> Level 3 was aware it had a problem within four minutes, the FCC report said. The problem was difficult to diagnose, however, because no one at Level 3 was aware of the consequences of leaving that particular field empty, nor had anyone at the company previously seen the system behave the way it was behaving.

That is, they didn't know that leaving that field blank is what caused the S to HTF.

Dodgeit · on March 19, 2018

I can imagine it once they found out the cause.

"Really? That was it? Are you fucking serious?"

avs733 · on March 19, 2018

sounds like most of the manufacturing issues I have been involved in...

'why did the machine break?'

'well the spec says to clean it using chemical X but the cabinet with X is 75 feet away and the cabinet with chemical Y is next to the machine so they used Y. They use X and Y interchangeably on other machine so the technicians (note: high-school grads, great guys but not chemists) thought it was interchangeable on this machine.'

'well why aren't they interchangeable on this machine'

'Y reacts with the glue used to assemble the machine, which was a change in the newer versions because of EPA regulations, so doing this weekly maintenance task for 8 years was finally enough to degrade the glue'

'so why was chemical Y stored near here?'

'because X has to be kept so many feet from Y. Last year in the efficiency audit we found techs had to walk to far on average to get X so we moved the X cabinet, which resulted in us moving Y to here.'

'has Y been used on any of the other machines when it shouldn't have been?'

'we don't know and we aren't sure how to check'

Fault trees usually make for really interesting reading.

exikyut · on March 20, 2018

Nice.

I'm guessing I can't read the source to this particular fault tree, but I wonder where I might find others. Preferably without digging through e.g. troves of court documents and the like.

avs733 · on March 20, 2018

https://www.ntsb.gov/investigations/AccidentReports/Pages/Ac...

exikyut · on March 20, 2018

Ooh, interesting. Thanks!

And I'd somehow gotten in my head the NTSB were only aviation, probably from old TV shows. TIL about their actual name.

greenleafjacob · on March 19, 2018

Figuring out why can be avoided if you have proper change management. If you are root causing every issue then that will drastically increase TTM. Instead mitigate first and just revert the change.

empath75 · on March 19, 2018

I wouldn’t consider that to be a technicians misstep. That’s poor software design.

bigiain · on March 19, 2018

Yeah, but nobody ever seems to throw the UX designers under the bus, right?

Mtinie · on March 19, 2018

TLDR; Ill-conceived design is only part of the problem.

——

You are assuming the entry form was put together with the assistance of a UX designer.

In my experience, back office or configuration software rarely is seen as important enough by the Powers-That-Be to justify all the “extra research and design effort” required.

Usually work like that is tossed over to whichever mid-level software engineer has an extra cycle or two during the sprint.

This is not to imply that software engineers are always bad at UX. In fact, most engineers I’ve worked closely with care a lot about the end users’ experiences, however, when push comes to shove, their leaders (or the overseeing Product teams they are accountable to) push for rapid delivery of features to reach market parity, rather than spending the extra few days to formally validate the right design decisions were made and if the implementation of those designs were solidly understandable.

Going back to your original comment: We should, as an industry, hold designers accountable if their decisions lead to detrimental consequences, especially ones which could be anticipated, like this one.

However, we should also recognize this is never the failure of solely the designer, but rather indicators of systematic issues of the extended team.

A poorly designed feature/function which makes it way through...

  * pre-dev stakeholder reviews 
  * development and implementation 
  * quality assurance 
  * acceptance testing

...before release has been vetted and signed-off by enough people to ensure everyone is complicit.

It is a failure of culture and leadership if no one along the chain had been comfortable or able to raise a flag if they disagreed or foresaw a problem.

Edit: Fixed formatting and this ended up to be longer than I expected when I stared typing. Also, typos.

bigiain · on March 19, 2018

100% agree with everything you say here.

I strongly recommend:

https://deardesignstudent.com/a-designers-code-of-ethics-f4a...

and:

https://medium.com/@monteiro/designs-lost-generation-ac72895...

from someone who's done _way_ more thinking about these sorts of issues than I have...

davidkuhta · on March 19, 2018

Well there was that one test missile notification system in Hawaii with stellar UX and UI...

madeofpalk · on March 19, 2018

...and it was the technician/operator that was fired.

ashelmire · on March 19, 2018

That’s a funny way to spell project manager.

scotty79 · on March 19, 2018

I always thought that it's a very strange default in SQL: if you don't specify which rows you want, the operation affects everything.

atkbrah · on March 19, 2018

It's like one of the most common user errors with cisco cli. When you want to add an interface to a vlan you would type:

  switchport trunk allowed vlan add $vlan

But if you by accident omit add keyword you would replace all interface vlans with $vlan

  switchport trunk allowed vlan $vlan

irishsultan · on March 19, 2018

At least in that case the important keyword isn't placed at the end (i.e. you would get an error if you merely typed ` switchport trunk allowed vlan add`)

pbhjpbhj · on March 19, 2018

IIRC that's like "groupadd" on Linux distros, if you don't add a switch it replaces all that users groups.

aexaey · on March 19, 2018

...and, if especially unlucky, that would be the trunk you're using to connect to that switch.

kazen44 · on March 19, 2018

Which is why you usually have a seperate OOB network for logging into switches and routers.

If it's very mission critical, even an entire seperate cable infrastructure for OOB is not uncommon.

GrumpyNl · on March 19, 2018

Its is exactly doing what you tell it to do, update without filter, it updates all.

yjftsjthsd-h · on March 19, 2018

One would expect to have to specify that with an asterisk or something and have blank just give a syntax error.

mannykannot · on March 19, 2018

...until one thinks it through and realizes that this would require that special value as an extra condition for every attribute in every relation mentioned in the query, which in turn would eliminate both the theoretical elegance of the relational model and the many very practical benefits that arise from it. A secondary problem is what to do if '*' is a valid value... Ad-hoc solutions are often more difficult than they first seem.

scotty79 · on March 19, 2018

Not really. You could have a keyword ALL and UPDATE should require either WHERE or ALL.

JdeBP · on March 19, 2018

This is telephony. '*' is indeed a valid element of a telephone number field.

aexaey · on March 19, 2018

Percent, rather than asterisk.

  UPDATE t SET foo='bar' WHERE baz LIKE '$INPUT';

This will do no changes when $INPUT is empty, and only with $INPUT set to '%' will it update every record.

JdeBP · on March 20, 2018

Ask yourself how many search forms you have filled in where you are allowed to leave the fields that you are not searching on blank. That is a fairly common design.

Shivetya · on March 19, 2018

that is an impressive flaw. why would you ever assume a wild card across all parts of a phone number? I get a lot of spam calls where there area code + prefix matches my number and I just ignore them.

I have a hard time justifying even wildcard values for the last four digits in whole, partial wild card yes to get a PBX or such

mdip · on March 19, 2018

My wild, totally uninformed, guess that it wasn't an assumption that empty meant wildcard but rather a failure to sanitize input at all. My experience, though limited, with these management applications is that the developers assumed the operator would never, ever, ever enter an illegal or unexpected value and therefor implicitly trusted the input. It's possible that the empty value was considered illegal and whatever module that handles call routing on the phone switch failed to function, blocking traffic as a result of the value.

It sounds crazy, but consider a similar circumstance of popping something illegal into an Apache or nginx configuration file. The service fails to start and anything hosted behind it is down. I'm not saying it's acceptable, just likely[0]. The difference here is that this software has an audience of very few people, is poorly developed to begin with, and usually outputs error messages similar to C++ compilers from the 90s. And the software was probably written in the 90s, too.

[0] While a competent sysadmin expects that a failure to provide valid values in a configuration file will result in a service not functioning, our typical interaction with modern software comes with the expectation that an invalid value provided to a configuration form will result in a rejection of the value. Even in the cases of Apache/nginx, they provide a method to check your configuration before using it -- just to be safe to make sure you didn't leave out a semicolon/closing brace/</Something>

paulie_a · on March 19, 2018

I rarely answer calls from numbers I don't recognize anymore, unless it is to simply fuck with the scammer and waste their time.

nkrisc · on March 19, 2018

I never even answer to mess with them because then they know it's a good number.

cosmie · on March 19, 2018

The sheer fact that it routed to a line that rang is enough for them to know it's a good number. Their auto-dialer could be configured such that answering it increased the cadence of calls or something such as that. But the absence of the call being routed to a not-in-service/no-longer-available response is enough of a signal for them to know it's a good number.

Source: I did data management for a company that performed a high volume of outbound business dials (not consumer lines). At one point we evaluated productizing our non-valid numbers list, so that businesses could do things like flag when their main contact at an account was no longer at the company, triggering an automatic alert to follow up with the remaining contacts at the company and re-establish a relationship. CNAM lookup services like Twilio Lookup[1] don't do so well at this use case, since companies tend to reserve a full block of phone numbers (always showing as active when doing a CNAM lookup), but when an employee leaves their line will temporarily be de-activated internally until it's re-assigned to a new employee.

[1] https://www.twilio.com/lookup

elil17 · on March 19, 2018

Title seems misleading - a poorly chosen default behavior caused the outage