Hacker News new | past | comments | ask | show | jobs | submit login
Software engineering lessons from RCAs of greatest disasters (anoopdixith.com)
185 points by philofsofia on Aug 17, 2023 | hide | past | favorite | 146 comments



One of my favourites that I ever heard about was from a friend of mine who used to work on safety-critical systems in defense applications. He told me fighter jets have a safety system that disables the weapons systems if a (weight) load is detected on the landing gear so that if the plane is on the ground and the pilot bumps the wrong button they don't accidentally blow up their own airbase[1]. So anyway when the Eurofighter Typhoon did its first live weapons test the weapons failed to launch. When they did the RCA they found something like[2]

    bool check_the_landing_gear_before_shootyshoot(double weight_from_sensor, double threshold) {
        //FIXME: Remember to implement this before we go live
        return false
    }
So when the pilot pressed the button the function disabled the weapons as if the plane had been on the ground. Because the "correctness" checks were against the Z spec and this function didn't have a unit test because it was deemed too trivial, the problem wasn't found before launch, so this cost several millions to redeploy the (one-line) fix to actually check the weight from the sensor was less than the threshold.

[1] Yes this means that scene from the cheesy action movie (can't remember which one) where Arnold Schwartzenegger finds himself on the ground in the cockpit of a russian plane and proceeds to blow up all the badguys while on the ground couldn't happen in real life.

[2] Not the actual code which was in some weird version of ADA apparently.


Torpedoes typically have an inertial switch which disarms them if they turn 180 degrees, so they don't accidentally hit their source. When a torpedo accidentally arms and activates on board a submarine (hot running) the emergency procedure is to immediately turn the sub around 180 degrees to disarm the torpedo.


Lest someone think this is purely hypothetical: https://en.wikipedia.org/wiki/USS_Tang_(SS-306)


A mark 14 torpedo actually sinking something? What a bad stroke of luck!


The Mark 14 ended-up being a really good torpedo by the end of WWII.

It even remained in service until the 80ies.

In truth, and going back to this subject, the Mark 14 debacle highlights the need for a good and unbiased QA.

This also holds true for software engineering.


My understanding is the BeuOrd (or BeuShip? I don't remember which) "didn't want to waste money on testing it", so instead we wasted hundreds of them fired at japanese shipping that didn't even impact their target, or never had a hope of detonating.

Remember these kind of things next time someone pushes for move fast and break things in the name of efficiency and speed. Slow is fast.


Pre-war, it was more a case of "Penny wise and Pound foolish" partly due to budget limitation (they did things like testing only with foam warheads to recover test torpedoes).

But after Perl Harbor, a somewhat biased BuOrd was reluctant to admit the mark 14 flaws. It took a few "unauthorized" tests and 2 years to fix the issues.

In fairness, this sure makes for an entertaining story (ex Drachinifel video on yt), but I'm not completely sold on the depiction of BuOrd as some sort of arrogant bureaucrats. However, bias and pride (plus other issues like low production) certainly have played a role in the early mark 14 debacle.

Going back to software development, I'm always amazed how bugs immediately pop-up whenever I put a piece of software in the hands of users for the first time, and that's regardless how well I tested it. I try to be as thorough as possible, but being the developer I'm always bias, often tunnel visioning on one way to use the software I created. That's why, in my opinion you need some form of external QA/testing (like these "unauthorized" Mark 14 tests).


In NL folklore this is codified as 'the longest road is often the shortest'.


> weird version of ADA

Spark: https://en.wikipedia.org/wiki/SPARK_(programming_language)

And it's Ada not ADA which makes me think of the Americans with Disabilities Act.


Aah thank you on both counts yes. One interesting feature he told me about is they wrote a "reverse compiler" that would take Spark code and turn it into the associated formal (Z) spec so they could compare that to the actual Z spec to prove they were the same. Kind of nifty.


Sounds like Tranfor which would convert FORTRAN code to flowcharts, because government contracts required flowcharts.


> Yes this means that scene from the cheesy action movie (can't remember which one) where Arnold Schwartzenegger finds himself on the ground in the cockpit of a russian plane and proceeds to blow up all the badguys while on the ground couldn't happen in real life.

I think you meant "Tomorrow Never Dies" and the actor was Pierce Brosnan. Took me forever to find that, is it the right one?


Yeah maybe. I think that sort of rings a bell.



Oh, that's also the sequence with the "peace / war" switch! That did make me laugh. Turns out it's a real thing, though - but they probably wouldn't have flipped it in this situation.


That is exactly it. For some reason my brain had spliced Schwartenegger into the cockpit. Funny how memory works.


It really threw me because I also remembered the scene but not the actor so I kept looking for Schwarzenegger movies. Good, checked that one off the list :)


> this cost several millions to redeploy the (one-line) fix to actually check the weight from the sensor was less than the threshold

Well maybe this is the other, compounding problem. Engineering complex machines with such a high cost of bugfix deployment seems like a big issue. It's funny that as an industry we now know how to safely deploy software updates to hundreds of millions of phones, with security checks, signed firmwares, etc, but doing that on applications with a super high unit price tag seems out of reach...

Or maybe, a few millions is like only a thousand hours of flying in jet fuel costs alone, not a big deal...


>It's funny that as an industry we now know how to safely deploy software updates to hundreds of millions of phones, with security checks, signed firmwares, etc, but doing that on applications with a super high unit price tag seems out of reach

A bunch of JavaScript dudebros yeeting code out into the ether is not at all comparable to deploying avionics software to a fighter jet. Give your head a shake.


I don't think they're referring to dudebros' js, they're referring to systems software and the ability to deliver relatively secure updates over insecure channels. I've even delivered a signed firmware update to a microprocessor in a goddamn washing machine over UART. Why can't we do this for a jet?


>Why can't we do this for a jet?

Well, because the software load of an aircraft is certified as part of the approved type design, for one. If you update the software it requires an engineering approval, because the risks inherent to operating an aircraft and the engineering that goes into mitigating those risks and making them acceptably safe are quite a bit more significant than a washing machine.

What's more we're talking about stores clearance (i.e. releasing shit from the aircraft in flight).

The attitude behind "Just write that function and flash the firmware" gets people killed.


I'm not saying just write the function and flash the firmware, but it's not like the super rigid certification process doesn't have its nefarious side effects either. My experience is that the more expensive fixes are, the more humans are willing to turn a blind eye to problems or wish them away.


>but it's not like the super rigid certification process doesn't have its nefarious side effects either.

The system isn't rigid so much as thorough. You can omit portions of the review for Minor Changes (term of art), for example. Unfortunately "writing the code to correctly release deadly explosives from the aircraft in flight" is far from a Minor Change, so gee willikers I guess it required some due diligence.

Maybe sometimes doing things correctly takes time and money for a reason, even if the reason isn't obvious. Maybe there's a good reason not to have OTA firmware update capability on a warplane.


I hear you. I respect the process and practice. I would invite you to ponder what would happen if all the iPhones and iPads in one country were to be bricked overnight by an OTA update - billions of dollars worth of instant economic damage just from the device cost, many billions more in consequences including lives lost. Well this capability probably exists somewhere at Apple. I hope it's well guarded by process and perhaps this process is costly not unlike recertification. Does the ability to deliver critical fixes quickly make it a safer system on balance, versus the Nokia and BlackBerry era where your firmware essentially never changed ever because the cost of delivery was so high? My guess it that it does on balance represent an improvement. But maybe I misunderstood and the millions of dollars in cost of delivering the fix were actually spent on due diligence, as opposed to just mechanically applying the patch.


Many people will hear the usual story of the fixes (for the plane example) being enormously expensive without really diving into what all goes into that figure. The source code change itself may be trivial, so it's easy to compare that to the multi-million dollar figures thrown around and have criticisms.

We can do OTA updates, there's no technical reason it can't be done other than not allowing it (which I mostly agree with in secure applications). Hell our spacecraft do this now. We must keep in mind these fixes do not go from dev environment straight to the field (prod), which would be a terrible idea. These are extremely complex integrated systems and must be tested in multiple phases because let's face it, if this supposedly trivial issue made it all the way through, what else may not have been discovered yet?

Not only does the 'easy' fix need to be tested (time and money), but related interactions need to be investigated as well (more time and money). The time cost of people doing the work, investigations, testing adds up from all this. Then there's potentially hardware in the mix which is never cheap, also simply being able to get access to hardware for testing can be a huge hassle.

Keep in mind this comment is only geared towards situations where the end item is a physical system. I would expect a fixing a pure software product to have significantly lower costs.


Is it because you risk doing something like this? https://hackaday.com/2022/03/18/welcome-to-the-future-where-...


We don’t really know the context of this anecdote, but if you have to completely re-run your test plan on a real plane with real munitions for newly deployed software, which is a pretty good idea, then I could see it costing millions, even if the fix deployed in a minute.


Because your firmware would put the jet into a spin.


This makes no sense and is difficult to even respond to coherently.

> It's funny that as an industry we now know how to safely deploy software updates to hundreds of millions of phones, with security checks, signed firmwares, etc,

Either you're completely wrong, because we "as an industry" still push bugs and security flaws, or you're comparing two completely different things.

> doing that on applications with a super high unit price tag seems out of reach...

is true because of

> a few millions is like only a thousand hours of flying in jet fuel costs alone

like do you really think they spent millions pushing a line of code? or do you think it's just inherently expensive to fly a jet, and so doing it twice costs more?


I would generally pass this comment by, but it's just so distastefully hostile because you totally missed the point.

GP's comment was expressing sardonic disbelief that a modern jet wouldn't be able to receive remote software updates, considering it's so ubiquitous and reliable in other fields, even those with much, much lower costs. Not that developers don't release faults.


People tend to opine on systems engineering as if we had some sort of information superconductor connecting all minds involved.

Systems are Hard and complex systems are Harder. Thinking of entire class of failures as 'solved' is kinda like talking about curing cancer. There isn't one thing called cancer, there's hundreds.

There's no way to solve complex systems problems for good. Reality, technologies, tooling, people, language, everything changes all the time. And complex systems failure modes that happen today will happen forever.


Ahh, then I did misread it entirely. Thanks for stopping by to call me out.

It's still probably not a matter of capability... I wouldn't be so cavalier about software updates on my phone if it was holding me thousands of feet above the ground at the time.


I already commented on this elsewhere but I came across a company that did OTA updates on a control box in vehicles without checking if the vehicle was in motion or not. And it didn't even really surprise me, it was just one of those things that came up when prepping for that job from a risk assessment. They never even thought of it.


Imagine remotely bricking a fleet of fighter jets.

https://news.ycombinator.com/item?id=35983866


> Imagine remotely bricking a fleet of fighter jets.

> https://news.ycombinator.com/item?id=35983866

That's about routers, was that the article you meant?


Remote software updates on military vehicles? Hasn't anyone seen the new Battlestar Galactica? :)


> Or maybe, a few millions is like only a thousand hours of flying in jet fuel costs alone, not a big deal...

Pretty much tbh. For example, the development of the Saab JAS 39 Gripen (JAS-projektet) is the most expensive industrial project in modern Swedish history at a cost of 120+ billion SEK (11+ billion USD).

It was also almost cancelled after a very public crash in central Stockholm at the 1993 Stockholm Water Festival [1]. A crash that should not have happened because the flight should not have been approved in the first place, because they weren't yet confident that they'd completely solved the Pilot-Induced Oscillation (PIO) related issues that wrecked the first prototype 4 years prior (with the same test pilot) [2].

It was basically a miracle that no one was killed or seriously hurt in the Stockholm crash, had the plane hit the nearby bridge or any of the other densely crowded areas then it would've been a very different story.

[1] https://youtu.be/mkgShfxTzmo?t=122

[2] https://www.youtube.com/watch?v=k6yVU_yYtEc


a few million dollars works out to a surprisingly small amount of time when you add overhead.

Call the bug fix a development team of 20 people taking 3 months end to end from bug discovery to fix deployment. You'll probably have twice that much people time again in project management and communication overhead (1:2 ratio of dev time to communication overhead is actually amazing in defense contexts). Assume total cost per person of 200k per year (after factoring in benefits, overhead, and pork), so 60 people * 3 months * $200k/12 months = 3,000,000 USD.

It takes a lot of people to build an aircraft.


> if a (weight) load is detected on the landing gear

This state "weight on wheels" is used in a lot of other functionality, not just on military aircraft, as the hard stop for things that don't make sense if we're not airborne. So that makes sense (albeit obviously somebody needed to actually write this function)

Most obviously the gear retraction is disabled on planes which have retractable landing gear.


Yup, watch videos of actual missile launches--the missile descends below the plane that fired it. Can't do that on the ground, although you won't blow up your base because the weapon will not have armed by the time it goes splat.


You would think "grep the codebase for FIXME" would be in the checklist before deployment.


Regarding [1], was it Red Heat, True Lies, or Eraser?


True Lies was an airborne Harrier, not a Russian plane on the ground. So, while there are many reasons the scene was unrealistic, “weight on gear disables weapons” isn’t one.


I know, such a great movie, total classic of my childhood! It was the only R-rated film my mother ever allowed and even endorsed us watching, "Because Jamie Lee Curtis is hot." :D

I wasn't sure if there might've been a scene I'd forgotten.


Another train crash that holds a valuable lesson was https://en.wikipedia.org/wiki/Eschede_train_disaster

This demonstrates that sometimes "if you see something, say something" isn't enough - if a large piece of metal penetrates into the passenger compartment of a train from underneath, it's better to take the initiative and pull the emergency brake yourself.


Up to a point, but the "sometimes" makes it difficult to say anything definite. There's no shortage of stories where immediate intervention has made things worse, such as a burning train being stopped in a tunnel.

Furthermore, this sort of counterfactual or "if only" analysis can be used to direct attention away from what matters, as was done in the hounding of the SS Californian's captain during the inquiry into the sinking of the Titanic.

Here, one cannot fault the passenger for first getting himself and his family out of the compartment, and he correctly determined that the train manager's "follow the rules" response was inadequate in the circumstances - in fact, the inquiry might have considered the incongruity of having an emergency brake available for any passenger to use at any time, while restricting its use by train crew.

RCA quite properly focuses on the causes of the event, which would have been of equal significance even if the train had been halted in time, and which would continue to present objective risks unless addressed.


Not entirely true:

> Dittmann could not find an emergency brake in the corridor and had not noticed that there was an emergency brake handle in his own compartment.

The learning from that should maybe instead be to keep non-technical management out of engineering decisions. The Wikipedia article fails to mention there was a specific manager who pushed the new wheel design into production and then went on to have a long successful career.


I dislike the current trend of calling lessons "learnings". I don't understand the shift in meaning. Learning is the act of acquiring knowledge. The bit of knowledge acquired has a long established name: lesson. What's the issue with that?


The English article sounds more like he started looking for an emergency brake after he had notified the conductor (and apparently failed to convince him of the urgency of the situation), not before. The German article is much longer, but only mentions that both the passenger and the conductor could have prevented the accident if they would have pulled the emergency brake immediately, but that the conductor was acting "by the book" when he insisted on inspecting the damage himself before pulling the brake.


In the movie 'Kursk' there is exactly such a scene and the literal quote is 'By the book, better start praying. I'm not reli...".


I dont think it should be "instead". Suggesting that emergency brakes are inadequate due to one passenger failing to locate one is kinda cheap.

We could also easily construe your argument as "engineers would never design a flaw", which is demonstrably untrue. We should both work to minimize errors, and to provide a variety of corrective measures in case they happen.


> Suggesting that emergency brakes are inadequate due to one passenger failing to locate one is kinda cheap.

That’s not what I wanted to say at all - the op talked about the willingness to pull the emergency brake, but my understanding is that he was willing to but due to human error failed to find it. I didn’t mean to suggest in any way that emergency brakes are not important.

> We could also easily construe your argument as "engineers would never design a flaw"

Another thing I didn’t say. The whole original link is proof that engineers make mistakes all the time.


I think the software industry itself has accumulated enough bugs over the past few decades. E.g.

F-22 navigation system core dumps when crossing the international date line: https://medium.com/alfonsofuggetta-it/software-bug-halts-f-2...

Loss of Mars probe due to metric-imperial conversion error

I've a few of these myself (e.g. a misplaced decimal that made $12mil into $120mil), but sadly cannot devulge details.


The worst bug I encountered was when physically relocating a multi rack storage array for a mobile provider. The array had never been powered down(!) so we anticipated that a good number of the spindles would fail to come up on restart. So we added an extra mirror to protect each existing raid set. Problem is a bug in the firmware meant the mere existence of this extra mirror caused the entire arrays volume layout to become corrupted at reboot time. Fortunately a field engineer managed to reconstruct the layout, but not before a lot of hair had been whitened.


Close call. I know of a similar case where a fire suppression systems check ended up with massive data loss in a very large storage array.


My worst bug was a typo in a single line of html that removed 3DS protection from many millions of dollars of credit card payments


Pretty epic.

I was working for a webhosting company, and someone asked me to rush a change just before leaving. Instead of updating 1500 A records, I updated about 50k. Someone senior managed to turn off the cron though, so what I actually lost was the delta of changes between last backup and my SQL.

I was in the room for this though: https://www.theregister.com/2008/08/28/flexiscale_outage/


I love the title to that article "Engineer accidentally deletes cloud".

It's like a single individual managed to delete the monolithic cloud where everyone's files are stored.


That is eerily similar to what happened to us in IBM "Cloud", in a previous gig. An engineer was doing "account cleanup" and somehow our account got on the list and all our resources were blown away. The most interesting conversation was convincing the support person, that those deletion audit events were in fact not us, but rather (according to the engineer's Linked-In page) an SRE at IBM.


Although it probably wasn't funny at the time I can imagine how comical that conversation was.

Thinking about it further the term "cloud" is a good metaphor for storing files on someone else's computer because clouds just disappear.


This was ~14 years ago and both MS & AWS had loss of data incidents iirc.


Bare in mind, this was a small startup in 2008 that claims to be the 2nd cloud in the world ( read on-demand iaas provider ).

Flexiscale at the time was a single region backed by a netapp. Each VM essentially had a thin-provisioned lun ( logical volume ), basically you copy on write the underlying OS image.

So when someone accidently deletes vol0, they take out a whopping 6TB of data, that takes a ~20TB to restore because you're rebuilding filesystems from safe mode ( thanks netapp support ). It's fairly monolithic in that sense.

I guess I was 23 at the time, but I'd written the v2 API, orchestrator & scheduling later. It was fairly naive, but filled the criteria of a cloud, i.e. elastic, on-demand, metered usage, despite using a SAN.


At this point there's basically 3 clouds, and then everyone else.


AWS, Azure and Cloudflare?


And Google Cloud Platform


I if any such pathways remain at AWS, Google, Apple and MS that would still allow a thing like that to happen.


You could call that a feature making payments easier for customers! All 3DS does is protect the banks by inconveniencing consumers since banks are responsible for fraud.


I believe they pass on the risk to merchants now. If you let fraud through, $30 per incident or whatever. So typically things like 3DS are turned on because that cost got too high, and the banks assure you that it will fix everything.


It's mostly the payment processor. It may or may not be the bank itself.


My worst bug was changing how a zip code zone was fetched from the cache in a large ecommerce site with tens of thousands users using it all day long. Worked great in DEV :D but when the thundering herd hit it, the entire site came down.


Startup, shutdown and migration are all periods of significantly elevated risk. Especially for systems that have been in the air for a long time there are all kinds of ways in which things can go pear shaped. Drives that die on shutdown (or the subsequent boot up), raids that fail to rebuild, cascading failures, power supplies that fail, UPS's that fail, generators that don't start (or that run for 30 seconds and then quit because someone made off with the fuel) and so on.


I posted this, in case we want to collect these gems: https://news.ycombinator.com/item?id=37160295


In terms of engineering (QC, processes etc), modern day software industry is worse than almost any other industry out there. :-(

And no, just plain complexity or fast-moving environment, is a factor but not the issue. It's that steps are skipped which are not skipped in other branches of engineering (eg. continous improvement of processes, learning from mistakes & implementing those. In software land: same mistakes made again & again & again, poorly designed languages remain in use, the list goes on).

A long way still to go.


If you're making the same mistakes over and over again I think that says more about your company than it does about the software industry.

My first job was at a major automotive manufacturer. Implementing half the procedures they had would slow down any software company 10X - just look at the state of most car infotainment systems. If something is safety critical, obviously this makes sense but the reality is 85% of software isn't.


GP was speaking in the general sense not about their company.


Is that not coming from experience of working at a software company? As I believe you said elsewhere


It could easily be from looking from the outside in, as it is in my case.


Oh, so, without experience. Understood.


No, with lots of experience. I've been programming since 17, did it professionally for three decades and have since moved into technical due diligence, 16 years and counting. That gives me a pretty unique perspective on IT, I get to talk to and work with teams from a very large sample of companies (230+ to date) and that in turn gives a fairly large statistical base to base that opinion on. This includes seed stage companies, mid sized and very large ones.

So forgive me if I take your attitude as non-productive, you are essentially just trying to discount my input based on some assumptions because it apparently doesn't please you. I'm fine with that but then just keep it to yourself instead of pulling down the level of discourse here. If you wanted to make a point you could have been constructive rather than dismissive.


And reading this thread it doesn't look as if there is much awareness of that.


> In terms of engineering (QC, processes etc), modern day software industry is worse than almost any other industry out there. :-(

How do you know that?


Take airplane safety: plane crashes, cause of the crash is thoroughly investigated, report recommends procedures to avoid that type of cause for planecrashes. Sometimes such recommendations become enforced across the industry. Result: air travel safer & safer to the point where sitting in a (flying!) airplane all day is safer than sitting on a bench on the street.

Building regulations: similar.

Foodstuffs (hygiene requirements for manufacturers): similar.

Car parts: see ISO9000 standards & co.

Software: eg. memory leaks - been around forever, but every day new software is released that has 'm.

C: ancient, not memory safe, should really only be used for niche domains. Yet it still is everywhere.

New AAA game: pay $$ after year(s?) of development, download many-MB patch on day 1 because game is buggy. Could have been tested better, but released anyway 'cause getting it out & making sales weighed heavier than shipping reliable working product.

All of this = not improving methods.

I'm not arguing C v. Rust here or whatever. Just pointing out: better tools, better procedures exist, but using them is more exception than the rule.

Like I said the list goes on. Other branches of engineering don't (can't) work like that.


Exactly. The driving force is there but what is also good is that the industry - for the most part at least - realizes that safety is what keeps them in business. So not only is there a structure of oversight and enforcement, there is also an strongly internalized culture of safety created over decades to build on. An engineer that would propose something obviously unsafe would not get to finish their proposal, let alone implement it.

In 'regular' software circles you can find the marketing department with full access to raw data and front end if you're unlucky.


Experience?


On HN and reddit, experience doesn't count. Only reading about others experiences after they've been paid to write research experiences.


Don't forget to cite your sources! The nerds will rake you over the coals for not doing so.


In any other field of engineering, the engineers are all trained and qualified. In software 'engineering', not so much.


That training and qualification is only as good as the processes and standards being trained for and qualified on. We don't have those processes and standards to train against (and frankly I'm not convinced we should or even can) for generic "software engineers". I have a number of friends who are PEs, and it isn't the training and certification process that differentiates their work from mine, it is that there are very clear standards for how you engineer a safe structure or machine. But I contend that there is not a way to write such standards for "software". It's just too broad a category of thing. Writing control software for physical systems is just very different from writing a UX-driven web application. It would be odd and wasteful to have the same standards for both things.

I do think qualification would make sense for more narrow swathes of the "software engineering" practice. For instance, "Automotive Control Software Engineer", etc.


This is exactly why I support "engineer" being a protected term, like Doctor. It should tell you that a certain level of training and qualification has been met, to the point that the engineer is responsible and accountable for the work they do and sign off on. Especially for things that affect safety.

Many software engineers these days are often flying by the seat of their pants, moving quickly and breaking things. Thankfully this seems to largely be in places that aren't going to affect life or limb, but I'm still rubbed the wrong way seeing people (including myself mind you) building run of the mill CRUD apps under the title of engineer.

Is it a big deal? Not really. It's probably even technically correct to use the term this way. But I do think it dilutes it a bit. For context, I'm in Canada where engineer is technically a protected term, and there are governing bodies that qualify and designate professional engineers.


I'm curious how you think the word "Doctor" is protected.

Do you mean that History PhDs can't call themselves Doctors?

Or chiropractors can't pass themselves off as doctors?

Or you mean Doctor J was licensed to perform basketball bypass surgeries?

Or perhaps podiatrists can't deliver babies?


> Thankfully this seems to largely be in places that aren't going to affect life or limb

I've seen Agile teams doing medical stuff using the latest hotness. Horrorshow.

I've also seen very, very clean software and firmware development at really small companies.

It's all over the place and you have to look inside to know what is going on. Though job advertisements sometimes can be pretty revealing.


Are all "engineers" trained on human error, safety principles, and the like? The failures described in the article are precisely not software failures.


Yes? Most engineering programs (it might even be an accreditation requirement) involve ethics classes and learning from past failures.

My CS degree program required an ethics class and discussed things like the CFAA and famous cases like Therac-25, but nobody took it seriously because STEM majors think they are god's gift to an irrational world.


The important distinction is that engineers are professionally liable.


Does anyone have a similar compendium specifically for software engineering disasters?

Not of nasty bugs like the F-22 -- those are fun stories, but they don't really illustrate the systemic failures that led to the bug being deployed in the first place. Much more interested in systemic cultural/practice/process factors that led to a disaster.


Yes, the RISK mailing list.


Find and take a CS ethics class.


my funniest was a wrong param in a template generator which turned off escaping parameter values provided indirectly by the users. good that it was discovered during the yearly pen testing analysis because it lead to shell execution in the cloud environment.


IMO this list isn't specific enough with the causes and takeaways, and esp. not specific enough to highlight minor critical omissions.

Most of these feel like "They didnt act fast enough or do a good enough job. They should do a better job".


I don't think this list hits any fundamental truths. The great depression doesn't have parallels to software failures beyond the fact that complex systems fail. And many of the lessons are vague and unactionable - "Put an end to information hoarding within orgs/teams" for example, says nothing. The Atlassian copy that section links to also says nothing. A lot of the lessons lack meaty learnings, and good luck to anyone trying to put everything in to practice simultaneously.

Makes a fun list of big disasters though and I respect this guy's eye for website design. The site design was probably stronger than the link content, there is a lot to like about it.


Complex systems fail, but they don't all fail in the same way and analyzing how they fail can help in engineering new and hopefully more robust complex systems. I'm a huge fan of Risk Digest and there isn't a disaster small enough that we can't learn from it.

Obviously the larger the disaster the more complex the failure and the harder to analyze the root cause. But one interesting takeaway for me from this list is that all of them were preventable and in all but few of the cases the root cause may have been the trigger but the setup of the environment is what allowed the fault to escalate in the way that it did. In a resilient system faults happen as well, but they do not propagate.

And that's the big secret to designing reliable systems.


> ...one interesting takeaway for me from this list is that all of them were preventable...

Every disaster is preventable. Everything on the list was happening in human-engineered environments - as do most things that affect humans. The human race has been the master of its own destiny since the 1900s. The questions are how far before the disaster we need to look to find somewhere to act and what needed to be given up to change the flow of events.

But that doesn't have any implications for software engineering. Studying a software failure post mortem will be a lot more useful than studying 9/11.


> Every disaster is preventable.

No, there is such a thing as residual risk and there are always disasters that you can't prevent such as natural disasters. But even then you can have risk mitigation and strategies for dealing with the aftermath of an incident to limit the effects.

> Everything on the list was happening in human-engineered environments - as do most things that affect humans.

That is precisely why they were picked and make for good examples.

> The human race has been the master of its own destiny since the 1900s.

That isn't true and it likely will never be true. We are tied 1:1 to the fate of our star and may well go down with it. There is a small but non-zero chance that we can change our destiny but I wouldn't bet on it. And even then in the even longer term it still won't matter. We are passengers, the best we can do is be good stewards of the ship we've inherited.

> The questions are how far before the disaster we need to look to find somewhere to act and what needed to be given up to change the flow of events.

Indeed. So in the case of each of the items listed the RCA gives a point in time where the accident given the situation as it existed was no longer a theoretical possibility but an event in progress. Situation and responses determined how far it got and in each of the cases outlined you can come up with a whole slew of ways in which the risk could have been reduced and possibly how the whole thing may have been averted once the root cause had triggered. But that doesn't mean that the root cause doesn't matter, it matters a lot. But the root cause isn't always a major thing. An O-ring, a horseshoe...

> But that doesn't have any implications for software engineering.

If that is your takeaway then for you it indeed probably does not. But I see such things in software engineering every other week or so and I think there are many lessons from these events that apply to software engineering. As do the people that design reliable systems, which is why many of us are arguing for liability for software. Because once producers of software are held liable for their product a large number of the bad practices and avoidable incidents (not just security) would become subject to the Darwinian selection process: bad producers would go out of business.

> Studying a software failure post mortem will be a lot more useful than studying 9/11.

You can learn lots of things from other fields, if you are open to learning in general. Myopically focusing on your own field is useful and can get you places but it will always result in 'deep' approaches, never in 'wide' approaches and for a really important system both of these approaches are valid and complementary.

To make your life easier the author has listed in the right hand column which items from the non-software disasters carry over into the software world, which I think is a valuable service. A middlebrow dismissal of that effort is throwing away an opportunity to learn, for free, from incidents that have all made the history books. And if you don't learn from your own and others' mistakes then you are bound to repeat that history.

Software isn't special in this sense. Not at all. What is special is the arrogance of some software people who believe that their field is so special that they can ignore the lessons from the world around them. And as a corollary: that they can ignore all the lessons already learned in software systems in the past. We are in an eternal cycle of repeating past mistakes with newer and shinier tools and we urgently need to break out of it.


It is 2023. The damage of natural disasters can be mitigated. When the San Andreas fault goes it'll probably get an entry on that list with a "why did we build so much infrastructure on this thing? Why didn't we prepare more for the inevitable?".

And this article is throwing out generic all-weather good sounding platitudes which are tangential to the disasters listed. He drew a comparison between the Challenger disaster and bitrot! Anyone who thinks that is a profound connection should avoid the role of software architect. The link is spurious. Challenger was about catastrophic management and safety practices. Bitrot is neither of those things.

I mean, if we want to learn from Douglas Adams he suggested that we can deduce the nature of all things by studying cupcakes. That is a few steps down the path from this article, but the direction is similar. It is not useful to connect random things in other fields to random things in software. Although I do appreciate the effort the gentleman went to, it is a nice site and the disasters are interesting. Just not relevantly linked to software in a meaningful way.

> We are tied 1:1 to the fate of our star and may well go down with it

I'm just going to claim that is false and live in the smug comfort that when circumstances someday prove you right neither of us will be around to argue about it. And if you can draw lessons from that which apply to practical software development then that is quite impressive.


An Ariane 5 failed because of bitrot, so the headline comparison of rocket failures makes sense. Not testing software with new performance parameters before launch sounds like catastrophic management to me.


> It is 2023. The damage of natural disasters can be mitigated.

That is a comforting belief, but it is probably not true. We have no plan for a near-Earth supernova explosion. Not even in theory.

Then there are asteroid impacts. In theory we could have plowed all of our resources into planetary defences, but in practice in 2023 we can very easily get sucker punched by a bolide and go the way of the dinosaurs.


> It is 2023.

So? Mistakes are still being made, every day. Nothing has changed since the stone age except for our ability - and hopefully willingness - to learn from previous mistakes. If we want to.

> The damage of natural disasters can be mitigated.

You wish.

> When the San Andreas fault goes it'll probably get an entry on that list with a "why did we build so much infrastructure on this thing? Why didn't we prepare more for the inevitable?".

Excellent questions. And in fairness to the people living on the San Andreas fault - and near volcanoes, in hurricane alley and in countries below sea level - we have an uncanny ability to ignore history.

> And this article is throwing out generic all-weather good sounding platitudes which are tangential to the disasters listed.

I see these errors all the time in the software world, I don't care what hook he uses to again bring them to attention but they are probably responsible for a very large fraction of all software problems.

> He drew a comparison between the Challenger disaster and bitrot!

So let's see your article on this subject then that will obviously do a much better job.

> Anyone who thinks that is a profound connection should avoid the role of software architect.

Do you care? It would be better to say that those that fail to be willing to learn from the mistakes of others should avoid the role of software architect because on balance that's where the problems come from. You seem to have a very narrow viewpoint here: that because you don't like the precision or the links that are being made that you can't appreciate the intent and the subject matter. Of course a better article could have been written and of course you are able to dismiss it entirely because of its perceived shortcomings. But that is exactly the attitude that leads to a lot of software problems: the inability to ingest information when it isn't presented in the recipients preferred form. This throws out the baby with the bath water, the authors intent is to educate you and others on the ways in which software systems break and uses something called a narrative hook to serve as a framework. That these won't match 100% is a given. Spurious connection or not, documentation and actual fact creeping out of spec aka the normalization of deviation in disguise is exactly the lesson from the Challenger disaster and if you don't like the wording I'm looking forward to your improved version.

> Challenger was about catastrophic management and safety practices.

That was a small but critical part in the whole, I highly recommend reading the entire report on the subject, it makes for fascinating reading, there are a great many lessons to be learned from this.

https://www.govinfo.gov/content/pkg/GPO-CRPT-99hrpt1016/pdf/...

https://en.wikipedia.org/wiki/Rogers_Commission_Report

And many useful and interesting supporting documents.

> I mean, if we want to learn from Douglas Adams he suggested that we can deduce the nature of all things by studying cupcakes.

That's a complete nonsensical statement. Have you considered that your initial response to the article precludes you from getting any value from it?

> It is not useful to connect random things in other fields to random things in software.

But they are not random things. The normalization of deviation in whatever guise it comes is the root cause of many, many real world incidents, both in software as well as outside of it. You could argue with the wording, but not with the intent or the connection.

> Although I do appreciate the effort the gentleman went to, it is a nice site and the disasters are interesting. Just not relevantly linked to software in a meaningful way.

To you. But they are.

> > We are tied 1:1 to the fate of our star and may well go down with it > I'm just going to claim that is false and live in the smug comfort that when circumstances someday prove you right neither of us will be around to argue about it.

So, you are effectively saying that you persist in being wrong simply because the timescale works to your advantage?

> And if you can draw lessons from that which apply to practical software development then that is quite impressive.

Well, for starters I would argue that many software developers indeed create work that serves just long enough to hold until they've left the company and that that attitude is an excellent thing to lose and a valuable lesson to draw from this discussion.


So the article had a list of disasters and some useful lessons learned in its left and center columns. It also had lists of truisms about software engineering in the right column. They had nothing fundamental to do with each other.

For instance, it tries to draw an equivalence between "Titanic's Captain Edward Smith had shown an "indifference to danger [that] was one of the direct and contributing causes of this unnecessary tragedy." and "Leading during the time of a software crisis (think production database dropped, security vulnerability found, system-wide failures etc.) requires a leader who can stay calm and composed, yet think quickly and ACT." which are completely unrelated: one is a statement about needing to evaluate risks to avoid incidents, another is talking about the type of leadership needed once an incident has already happened. Similarly, the discussion about Chernobyl is also confused: the primary lessons there are about operational hygiene, but the article draws "conclusions" about software testing which is in a completely different lifecycle phase.

There are certainly lessons to be learned from past incidents both software and not, but the article linked is a poor place to do so.


So let's take those disasters and list the lessons that you would have learned from them. That's the way to constructively approach an article like this, out-of-hand dismissal is just dumb and unproductive.

FWIW I've seen the leaders of software teams all the way up to the CTO run around like headless chickens during (often self inflicted) crisis. I think the biggest lesson from the Titanic is that you're never invulnerable, even when you have been designed to be invulnerable.

None of these are exhaustive and all of them are open to interpretation. Good, so let's improve on them.

One general takeaway: managing risk is hard, especially when working with a limited budget (which is almost always the case) and just the exercise of assessing and estimating likelihood and impact are already very valuable but plenty of organizations have never done any of that. They simply are utterly blind to the risks their org is exposed to.

Case in point: a company that made in-car boxes that could be upgraded OTA. And nobody thought to verify that the vehicle wasn't in motion...


There are two useful lessons from the Titanic that can apply to software:

1) Marketing that you are super duper and special is meaningless if you've actually built something terrible (the Titanic was not even remotely as unsinkable as claimed, with "water tight" compartments that weren't actually watertight)

2) When people below you tell you "hey we are in danger", listen to them. Don't do things that are obviously dangerous and make zero effort to mitigate the danger. The danger of atlantic icebergs was well understood, and the Titanic was warned multiple times! Yet the captain still had inadequate monitoring, and did not slow down to give the ship more time to react to any threat.


Good stuff, thank you. This is useful, and it (2) ties into the Challenger disaster as well.


The one hangup with "Listen to people warning you" is that they produce enough false positives as to create a boy who cried wolf effect for some managers.


Yes, that's true. So the hard part is to know who is alarmist and who actually has a point. In the case of NASA the ignoring bit seemed to be pretty wilful. By the time multiple engineers warn you that this is not a good idea and you push on anyway I think you are out of excuses. Single warnings not backed up by data can probably be ignored.


As I heard one engineering leader say, "it's okay to make mistakes — once". Meaning, we're all fallible, mistakes happen, but failure to learn from past mistakes is not optional.

That said, a challenge I have frequently run into, and I feel is not uncommon, is a tension between the desire not to repeat mistakes and ambitions that do generally involve some amount of risk-taking. The former can turn into a fixation on risk and risk mitigation that becomes a paralyzing force; to some leaders, lists like these might just look like "a thousand reasons to do nothing" and be discarded. Yet history is full of clear cases where a poor appreciation of risk destroyed fortunes, with a chorus of "I told you sos" in their wake.

It is a difficult part of leadership to weigh the risk tradeoffs for a particular mission, and presenting things in absolute terms of "lessons learned" rarely makes sense, in my experience. The belt-and-suspenders approaches that make sense for authoring the critical control software for a commercial passenger aircraft or an industrial control system for nuclear plants probably do not make sense for an indie mobile game studio, even if they're all in some way "software engineering".


Mistakes happen, things go wrong, for the vast majority of us a bug doesn't mean someone dies or lights go out or planes don't take off. For most of us here, the absolute worst case scenario is that a bug means a company nobody has ever heard of makes slightly less money for a few minutes or hours until it gets rolled back. Again, worst case. The average case is probably closer to a company nobody has ever heard of makes exactly the same amount of money but some arbitrary feature nobody asked for ships a day or two later because we spent time fixing this other thing instead.

It's really hard to drive a balance between "pushing untested shitcode into prod multiple times a week" and "that ticket to change our CTA button color is done but now needs to go through 4 days of automated and manual testing." I think as an industry most of us are probably too far on the latter side of the spectrum in relation to the stakes of what we're actually doing day to day.


Failure is not optional. Definitely true :)


(Root Cause Analysis)


I was excited that it might be about 1960s RCA computers being software incompatible with IBM s/360

https://en.m.wikipedia.org/wiki/RCA_Spectra_70

Then I finished the headline and opened this Wikipedia article


This whole RCA terminology has to die. The idea that there exists a single "Root Cause" that causes major disasters is fundamentally flawed.


Actually everywhere "defense in depth" is used (not only in computing, but also in e.g. aviation), there can't be one single cause for a disaster - each of the layers has to fail for a disaster to happen.


Almost every accident, even where there is no defense in depth has more than one cause. Car accident: person A was on the phone, person B didn't spot their deviation in time to do anything about it: accident. A is the root cause. If B would have reacted faster then there wouldn't be an accident but there would still be cause for concern and there would still be a culprit. The number of such near misses and saves by others is similar to the defense in depth in effect even if it wasn't engineered in. But person B isn't liable even though their lack of attention to what was going on is a contributory factor. So root causes matter, that's the first and most clear thing to fix. Other layers may be impacted and may require work but that isn't always the case.

In software the root cause is often a very simple one: assumption didn't hold.


I think if the people you're working with insist on narrowing it down to a single Root Cause, they're missing the entire point of the exercise. I work with large drones day to day and when we do an accident investigation we're always looking for root causes, but there's almost always multiple. I don't think we've ever had a post-accident RCA investigation that resulted in only one corrective action. Several times we have narrowed it down to a single software bug, but to get to the point where a software bug causes a crash, there's always a number of other factors that have to align (e.g. pilot was unfamiliar with the recovery procedure, multiple cascaded failures, etc)


Yes, that's true in the general sense. But root causes are interesting because they are the things that can lead to insights that can help the lowest levels of engineering to become more robust. But at a higher level it is all about systems and the way parts of those systems interact, fault tolerance (massively important) and ensuring faults do not propagate beyond the systems they originate in. That's what can turn a small problem into a huge disaster. And without knowing the root cause you won't be able to track those proximate causes and do something about it. So RCA is a process, not a way to identify the single culprit. So this is more about the interpretation of the term RCA than about what RCA really does.


This is a generic term. It does not imply that there has to be a single cause.


If we call it Root Causes Analysis can we keep the acronym?


Instead of attempting to design a perfect system that cannot fail, the idea is to design a system that can tolerate failure of any component. (This is how airliners are designed, and is why they are so incredibly reliable.)

Safe Systems from Unreliable Parts https://www.digitalmars.com/articles/b39.html

Designing Safe Software Systems part 2 https://www.digitalmars.com/articles/b40.html


Yet again another page that doesn't bother to explain what this has to do with the Radio Corporation of America, or what "RCA" might otherwise stand for. RCA's greatest disaster was Selectavision, but that's not on the list, and Wikipedia's disambiguation page doesn't have anything with anything relevant for what "RCA" might mean in relation to the things on this page. Explain yourselves, folks!


The Challenger disaster is on the list but should be expanded upon:

1) Considerable pressure from NASA to cover up the true sequence of events. They pushed the go button even when the engineers said no. And they failed to even tell Morton-Thiokol about the actual temperatures. NASA dismissed the observed temperatures as defective--never mind that in "correcting" them so the temperature at the failed joint was as expected meant that now a bunch of other measurements are above ambient. (The offending joint was being cooled by boiloff from the LOX tank that under the weather conditions at the time ended up cooling that part of the booster.)

2) Then they doubled down on the error with Columbia. They had multiple cases of tile damage from the ET insulating foam. They fixed the piece of foam that caused a near-disaster--but didn't fix the rest because it had never damaged the orbiter.

Very much a culture of painting over the rust.


It isn't just software.

The Deepwater Horizon was the result of multiple single points of failure, that zippered to catastrophe. Each of those single points of failure could have been snipped at little to no cost

The same with the Fukushima disaster. For example, venting the excess hydrogen into the building, where it could accumulate until a random spark.


The software industry has enough disasters of its own that we don't need parallels from other industries to learn. That actually makes it look like there are super non-obvious things that we could apply to software when it fact it's all pretty mundane.


There is or was an old-school website called something like "Byzantine failures" that had case studies of bizarre failures from many engineering fields. It was entertaining but I am unable to find it now. Does anyone know it?


I think you're talking about Risks Digest.

http://catless.ncl.ac.uk/Risks/

It's very much old-school.


Thank you, that is the site. I don't know where I got "Byzantine failures" from.


No idea either, but the 'old school' hint did the trick. It is one of my favorite internet hang-outs. It's either that or HN (or reading books or playing piano). Endless interesting material and very useful for me professionally.


Thanks for sharing this. I am reminded of the talks by Nickolas Means (https://www.youtube.com/watch?v=1xQeXOz0Ncs)


This is why software engineering is a protected profession in some parts of the world (Canada at least), as civil responsibility and safety, along with formal legal liability is part of licensure


Technically true, but most software developers in Canada aren't P. Engs, and tons of Canadian software companies use a "Software Engineer" title with no repercussions, so I'm not sure you can point at that as a success.


Care to elaborate? I know professional engineers in Canada get a designation but I’m not aware of anything similar for software engineers.


Engineer is a regulated term and profession in Canada, with professional designations like the P.eng - they get really mad when people the term engineer more loosely, as is common in the tech industry.

Because of this, there are "B.Seng" programs at some Canadian universities, as well as the standard "B.Sc" computer science program.

The degree was very new when I attended uni, so went for Comp sci intead as it seemed more "real". The B.Seng kids seemed to focus a lot more on industry things (classes on object oriented programming), which everyone picked up when doing internships anyways. They also had virtually no room for electives, whereas the CS calendar was stacked with very interesting electives which imo were vastly more useful in my career.

In practice, no one gives a hoot which degree you have, and we tend to just use the term SWeng regardless.

It honestly kinda feels like a bunch of crotchety old civil engineers trying to regulate an industry they're not a part of. I have _never_ seen a job require this degree.


Software engineers are the same as all other engineering professions, and regulated by the same provincial PEG associations. While most employers don't care about it, some software positions where the safety of people is in line (eg aeronautics) or there's a special stake do have requirements to employ professional software engineers.

I think you're actually not even supposed to call yourself an engineer unless you're a professional engineer.


The hindsight bias is a helluva drug.

Hoping the author of that page sees: https://youtu.be/TqaFT-0cY7U


ah, a topic related to organizational decay and decline. this is an area I've been studying a lot over the last few years, and I encourage the author of this blog post to read this paper on the Challenger disaster.

Organizational disaster and organizational decay: the case of the National Aeronautics and Space Administration http://www.sba.oakland.edu/faculty/schwartz/Org%20Decay%20at...

some highlights:

> There are a number of aspects of organizational decay. In this paper, I shall consider three of them. First is what I call the institutionalization of the fiction, which represents the redirection of its approved beliefs and discourse, from the acknowledgement of reality to the maintenance of an image of itself as the or­ganization ideal. Second is the change in personnel that parallels the institu­tionalization of the fiction. Third is the narcissistic loss of reality which repre­sents the mental state of management in the decadent organization.

> Discouragement and alienation of competent individuals

> Another result of this sort of selection must be that realistic and competent persons who are committed to their work must lose the belief that the organi­zation's real purpose is productive work and come to the conclusion that its real purpose is self-idealization. They then are likely to see their work as being alien to the purposes of the organization. Some will withdraw from the orga­nization psychologically. Others will buy into the nonsense around them, cyn­ically or through self-deception (Goffman, 1959), and abandon their concern with reality. Still others will conclude that the only way to save their self-esteem is to leave the organization. Arguably, it is these last individuals who, because of their commitment to productive work and their firm grasp of reality, are the most productive members of the organization. Trento cites a number of ex­amples of this happening at NASA. Considerations of space preclude detailed discussion here.

Schwartz, H.S., 1989. Organizational disaster and organizational decay: the case of the National Aeronautics and Space Administration. Industrial Crisis Quarterly, 3: 319-334.


The Therac-25 disaster should be on this list: https://en.wikipedia.org/wiki/Therac-25


[flagged]


Bruh! Does it matter if a sensible thought comes from a high scool kid or a parrot, as long as it is sensible?


I think the display of lessons shows a lack of experience in understanding software project management, which isn't surprising if they are a high schooler.


Nice Ad Homenim argument you've got there -- criticizing the person bringing the argument and not the argument itself.

If your point is that the person is likely to be ignored by his/her management, there's likely a better way to phrase it or worth adding a few words to clarify.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: