The US airline industry has a serious problem with these outages. By my count, 4 major airlines have been hit with system-wide computer-related outages in the last year (southest, jetblue, delta, united).
It seems quite odd to list "Human Error" as a cause. This is only a secondary cause, because if a human is able to accidentally bring down an entire system, it's the system itself that is really the problem.
Are you trying to tell me I shouldn't have to treat every application like the developer is the laziest piece of shit possible and 90% of valid inputs will cause the application to crash or spit meaningless errors?
If the datacenters are really too complex for the people running them to understand, I would expect failures like this to drive the airlines towards AWS, Azure, or some similar service.
Having worked at SABRE its not as easy as you think, as there are a crapton of systems and people that depend on all that old stuff to keep running as it. SABRE has spent 30 years trying to modernize their mainframe based system piece by piece. But the core reservation system is actually highly available still, it's the modern bits that wind up less stable. The problem is that travel is a highly interconnected system of many different companies in which complexity is almost impossible to avoid. There are also many systems involved not visible to the public, such as weight balancing, crew scheduling, etc where even one of those failing and screw up airline travel worldwide. It's not always reservations or checkin that's broken, even something as simple as an airport system failing can have a domino affect all over the country.
It reminds me a bit of Vernor Vinge's "zones of thought" books (sci fi) in which many of our current technological dreams have failed to materialize and cascading failures of brittle automated systems cause logistic collapses that wipe out advanced civilizations.
> I would expect failures like this to drive the airlines towards AWS, Azure, or some similar service
I don't think that would help much. It's not really the core hardware or operating systems that tend to cause these types of outages.
More typically, it's the dependency chain between locations, applications, and services. And, there's more than one system that can cause a ground halt. The check-in service, the no-fly list functionality (which the govt runs), weight/balance, crew scheduling, dispatch functions, and so on.
Check-in is a good example. You can lose that either through a failure in the complex WAN, failures in the check-in backend service, failures with the no-fly service (run by the govt) or connectivity to it, failures in the CRS/GDS, failures in various services around check-in kiosks, failures in the online checkin, and so forth.
Once they go down, you also face an unusually high spike in request volume when you're trying to get them back up. It creates a wave than can overwhelm different parts of the system.
For the more recent failures (across different airlines) listed above, I know one was a routing storm on the IP network, one was the checkin service, and one was the central reservations system...I think a botched version upgrade. Similar effects, different root causes.
Not to say it's okay, or shouldn't be addressed, but just noting that there's not really one smoking gun.
I'm not advocating that it is ok, but I am sure these airline systems are super old and wouldn't be surprised if they use IBM DB2 or similar ancient database technology. Moving to the cloud is not a trivial task for these sorts of mission critical antiquated systems.
It's a legacy story like none other, in fact. The predecessor/origins of SABRE is IBM's Airline Control Program (ACP). When I worked for IBM years ago I heard many stories of how difficult it was to try an modernize to a newer system because of absurd complexity, but just as much because the whole airline industry became so inextricably bound to the legacy:
I'm currently doing consulting work for That Major Australian Airline and, while I'm rather high up the stack, you do get a sense of the amount of legacy that's built into everything and the monumental effort it would be to migrate all this old stuff to newer infra. I mean, AFAICT there's no database/service to _query_ flights - you have to register a web hook to receive flight data and store it yourself.
New projects are cloud-first, and more and more stuff gets migrated or replaced with equivalents which run in whatever cloud provider. But I can't even imagine how the replacements for all these old legacy services would go down.
You would be surprised how much has already been migrated away from the old IBM systems (TPF). The big players these days are Lufthansa, Jeppesen, Navitaire, Appolo, Sabre, etc.
Apollo is now Galileo. Galileo, Sabre, HP Shares, and Amadeus (the actual "Big 4" in this space) all still use the TPF operating system on IBM mainframes. They all are offloading functionality piecemeal, to more modern systems. But TPF is still at the core.
You mentioned Navitaire. They are not using TPF...they use COBOL on Windows (not kidding). They do have a large list of airline customers, but none with a big fleet. Reportedly, it doesn't scale up well enough to serve a large airline.
TPF also lives on in the financial world as well, like at Visa, for example.
IBM DB2 is downright futuristic compared to what some of these systems are. SABRE, for example, is probably the granddaddy and horror textbook example of what's commonly referred as "legacy codebase" (although the IRS Master Files written in S/370 assembly could give it a run).
Both the individual and business master files are still written in IBM mainframe assembly language, and are circa 56 years old. See the table on page 4 of this PDF for a list of the oldest systems in operation:
Number three on the the list is The DoD's Strategic Automated Command and Control System, which runs on "an IBM Series/1 Computer—a 1970s computing system—and uses 8-inch floppy disks". No biggie; it just "coordinates the operational functions of the United States’ nuclear forces, such as intercontinental ballistic missiles, nuclear bombers, and tanker support aircrafts."
> Standard CFOL Access Protocol (SCAP) is written in COBOL/Customer Information Control
System (CICS). SCAP downloads Corporate Files On-Line (CFOL) data from the IBM mainframe at
the Enterprise Computing Center, Martinsburg. The CFOL data resides in a variety of formats
(packed decimal, 7074, DB2, etc.)
never ever under estimate the government and military's ability to keep old systems operational well past what others consider reasonable.
anecdotal, back in the late 80s I was in the USAF. Our secure communication center was running the first model Burroughs machine to not use tubes. It was that old. It could boot from cards, paper tape, or switches. The machine was older than many people who would be assigned to it. This was closely repeated in the main data center (personnel records, inventory, and such) which had a decade old system that migrated off physical cards by 89 but still took them as images off 5.25 floppy uploaded by PC)
DB2 isn't "ancient" in any meaningful sense. The first release shipped long ago but IBM has kept it fully up to date since then and it's still competitive with any other relational database for high-volume OLTP.
Can confirm both - I've been writing LOB software for the financial industry in VB6 up until very recently, and stopped because I switched jobs, not because they've stopped writing them :P
To be honest, VB6 is much better than some of the other stuff they have around.
I now switched to a travel agency and am interacting with the Sabre blue screen systems that are similarly old.
> Moving to the cloud is not a trivial task for these sorts of mission critical antiquated systems
The question is if moving to _anything_ more modern is less trivial/costly than keeping the current systems which appear to have many single points of failure/
> Often I've seen businesses reason that the failures are cheaper than the upgrades.
I would love to know what the cost of today's outage in terms of overtime, gate fees, fuel, additional crew, &c.
Delta's cost ~$150MM [1]. That's something on the order of a thousand mid- to senior- level programmers for a year in my area. Even if you allocate a quarter of that cost to computer costs (which I'm betting is a fairly large over estimate), that still leaves a sizable team.
> DOS interfaces.
TUI interfaces can be really, really efficient in terms of navigation and getting around. I rarely see GUI apps that function anywhere near as smoothly or keyboard only.
> TUI interfaces can be really, really efficient in terms of navigation and getting around. I rarely see GUI apps that function anywhere near as smoothly or keyboard only.
TUIs have a steeper learner curve, but I agree that once someone masters the hotkeys for that particular interface, they're much quicker. In addition, depending how old the computer is running the TUI, the staff may not be able to use the computer to browse anything else. ;)
> That's something on the order of a thousand mid- to senior- level programmers for a year in my area
I shudder to think of a scheduling program as complex as an airline's ticketing system being run on a bit of software written by a thousand programmers in only one year.
Okay, I admit that I have absolutely no idea what software like that costs to build, but surely they could have rebuilt their entire software stack for that, couldn't they?
No. Software at that scale is humungously expensive.
I worked on quoting an insurance policy core (just the core mind you, not the extras), and for a medium-sized insurance company it would reach that amount of money.
I suspect a complete rewrite would go into the billions.
The company I worked for ended up not doing it (and regretting it).
It ends up not being that many people, but extremely high paid consultants (200 to 400 dollars an hour), and extremely high licensing costs. Some projects can go on for years, should be 1 to 2 years.
It's extremely profitable and very well paid, one such company, Guidewire, is one of the top 10 best paying employers in the U.S.
Yeah a lot of this stuff is on old school mainframes written in assembly. I work in the airline/travel industry and I've seen 28 year old code still running in production.
A lot of the systems the airlines use are typically hosted by large telco's. Some use SaaS, so those are hosted for them. Very few Airlines I know of actually have their own data centers anymore and typically it's a misconfiguration or an upgrade that causes these outages. Although it can also be a comm problem.
A couple weekends ago, RDU was unable to handle flights out of its major terminal due to a hard drive (or three) failure. Took 75 (!) people to diagnose and fix:
anyone know why they dont have a fail-over mode that will just keep the same routes as the previous week? Then at least you'd just have underbooked-overbooked flights but the network would keep running.
I have no reason to believe that there's any malice involved here.
I also think that incidents like this might become more common as the state of cyber warfare progresses. As engineers, we should take care to build secure software, especially when that software underlies important systems. We should impress upon non-technical management the importance of doing so, even if it may take a little more time or money in the short term.
I love flying and getting to experience some of the weird issues that crop up. I was on a flight in December when there was only one ground crew working at my mid-sized airport. I was really hoping to beat the winter storm coming in, as I live pretty far from the airport. If there was a delay, I'd be stuck driving home in a blizzard.
They were prioritizing getting planes to gates based on if the plane was picking up passengers and heading out again. Mine was the last flight of the day for my plane, so we were at the back of the line.
After 30 minutes, we finally had an opening and were moving... until another flight crew reminded the tower that they were reaching their FAA allowed hours and needed to be parked immediately. They took our spot, and then more planes came in that had to leave. Of course I know all of this because our pilot was pissed and would relay all of the info to us with a very sarcastic voice. He was supposed to be home by now too.
We ended up getting into our gate almost two hours after we landed. I normally live 30 minutes from the airport, and it took me another hour and a half to get home with the weather. The only reason we got in when we did is because the weather forced a shutdown of the airport, so we were literally the last flight to get parked at a gate, and only because there were no new flights coming in.
It's fun (after the fact) to think about these edge cases, how to keep the entire airspace running smoothly when one airport only has a single ground crew working, how to maximize efficiency to make sure the fewest number of people are delayed. Sucks to live through it though.
How will hiring more crews and gates be cheaper than one flight getting abnormally delayed? I mean maybe it will be cheaper, but there is no surety about it.
You misunderstand. It's not normal to only have one ground crew, this was a very unusual situation. My guess is a bunch of people called in due to the impending weather, or the airport expected to be shut down sooner than they were.
Like I said, this was an edge case. I've flown once a month for almost three years now and this was the first time I've experienced this.
That's a good observation. Often, the original issue, if it's not fixed in an hour or two...starts creating cascading problems.
Say just the boarding pass / checkin function is down. If you don't fix it quickly...
- Crews (pilots and/or fa's) become illegal to fly due to various rules around crew work, hours, rest periods, etc.
- They can't be replaced by other crews that need to be flown in, because the gates are occupied by the planes that can't leave
- Downstream flight connections no longer match up, so a massive process to change the current tail numbers <--> flight numbers plan has to happen, followed by matching crews and passengers to the new plan. Especially fun if you have a mixed fleet where only certain crews can fly certain models, and the models have different passenger capacities.
Basically, once you go past an hour or so, a giant shitshow starts...and gets worse as time goes on.
I got stuck in SLC two Sundays ago. Hotel tonight should have decent options. If you end up in town, the place will be dead. Bourbon House is decent and serves food late.
While in this case they probably are actual mainframes, at my firm we use the term to refer to the VMs running in tiny boxes that replaced what once were also mainframes.
This doesn't surprise me. US airlines seem like breeding grounds for management silos and the technology is going to reflect that, no matter how much you spend.
Teslas are remotely enabled and disabled. They regularly report telemetry back to HQ. They ask for and receive updates over the air.
What happens if that infrastructure crashes (or a natural disaster disables part of it)? What happens if the clocks drift? What happens if a bug (or an intentional virus) suddenly deauthorizes all Teslas on the network? It's the same problem - management of the remote systems is centrally located, and centrally located management systems can fail.
The planes were perfectly capable of flying. I haven't seen what the issue was but it's equally likely that it was a problem with the reservation system, the pricing system, the weight/balance calculators (admittedly I wouldn't want to fly with this being wonky), even the shift scheduling for backup flight crews.
There is no remote management of planes. Nobody at ATC or on the ground can prevent a plane from doing anything the pilot wants to make the plane do.
This article goes into some detail of why it happens: http://money.cnn.com/2016/08/08/technology/delta-airline-com.... It seems like human error and fragile computer systems are the biggest issues.