Just another plug for the comp.risks digest. This digest from the ACM Committee on Computers and Public Policy has been moderated continuously by Peter G. Neumann since its inception in 1985. If you don't frequent Usenet like you used to in the 80s, the web archive is here:
I've always taken that machine as a argument against event driven programming. Why? Well John Carmack articulated the problems very nicely when he wrote about inline code, covered on HN here:
The Therac problem was a result of states getting out of sync and into an undesirable configuration. I think reading about the machine and then the above Carmack will cause one to see the connection.
I see it more generally as the perils of unwarranted complexity. One of the bugs was a race condition that - I'm almost willing to bet - would not have existed if they didn't try to be "overly clever" and incorporate a crude approximation of a multitasking OS in their software.
It is mentioned almost summarily in the report - "Designs should be kept simple" is a phrase in there - but I think that this excessive complexity was one of the biggest factors.
This Hoare quote is relevant: "There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so complicated that there are no obvious deficiencies."
> "incorporate a crude approximation of a multitasking OS in their software"
Many embedded systems have a 'crude' OS library... and in many cases this makes them far simpler than including an RTOS. Not having seen the code here, I can't comment on this one, but just including a simple scheduler is not necessarily a bad design decision.
The the other aspects of your "Keep It Simple" answer, I fully agree with.
The Leveson paper is quite long, and not all parts are equally important:
Skim Sections 1 and 2. You should understand the basics of the Therac-25's design and how it was used. (You may also find this figure a helpful accompaniment to Figure 1 on page 4.)
Skim Sections 3.1-3.3, which detail a few of the Therac-25 incidents.
Read Sections 3.4 and 3.5. These detail a particular incident, the software bug that led to it, and the response to the bug. Pay close attention to 3.5.3, which describes the bug.
Skip Section 3.6. (It describes an additional incident and a different bug—feel free to read if you are interested, though)
Read Section 4 closely.
It's ironic that this article mentions the Toyota Production System as an example of a safe and defect-free system. Another article about Toyota was posted on HN today:
«Toyotas Unintended Acceleration and the Big Bowl of "Spaghetti" Code (2013)»
Apparently Toyota's software development doesn't follow the Toyota Production System.
OK, that was a flip comment; it's pretty clear that TPS isn't suited to software development. However, it does seem clear that Toyota's software development practices are deficient.
In my opinion, the software bug wasn't to blame but bad user interface design. When the error occurred that caused patients to get a direct blast of 10x rad more than what were supposed to get, the error was caught and displayed. But because there were so many erroneous errors, users were used to bypassing them. I go into much more detail in the book but I thought I'd chime in here. What do you all think?
Not to be self-serving, but I've always been fascinated by Therac-25. I ended up doing a deep dive a few months back and put together a short 5ish minute podcast episode about it:
While this is pretty much the Ur-example of faulty software design causing human injury, the fact is that the entire system failed. Had the Therac-25 not removed the hardware interlocks of the Therac-20, the accidents would've been much less likely to occur.
I also think that we should be careful in trying to draw too much caution in what we do from this accident--the majority of software (EHR systems, apps, etc.) being developed in the medical field today would not be served by the sort of scrutiny that would've prevented this accident.
In fact, one could (and I will) make the argument that simply having faster release cycles and better customer interfacing (instead of, say, custom consulting work coughEpiccough) would cause a better increase in quality than some insanely rigorous pile of paperwork.
A thorough review of a software production is not an insanely rigorous pile of paperwork. I think I'm going to have to disagree with you about the kind of caution that we can draw from this incident, in fact I think cases like these should be mandatory study material for anybody that makes or moves into making software for critical applications.
I've built some stuff controlling machinery that would amputate your arm in a split second and 'faster release cycles' would have caused accidents, not better quality.
Exhaustive testing, thorough review and extensive documentation of not only the code but also the reasoning behind the code saved my ass more than once from releasing something in production that would have likely caused at a minimum a serious accident.
One of my rules for writing machinery controlling software is that I determine when a new piece of software can be taken out of my hands to be passed up the chain. The only time someone violated that rule this happened:
It was around 6 pm when we finished working on the control software of a large lathe, a Reiden machine with a 16' bed and a 2' chuck. I put the disks with the new version on the edge of my desk for 'air' (machine otherwise not powered up), 'wood' and 'aluminum' testing the next day. In simulation it all looked good but it's easy to make mistakes.
When I walked back onto the shop floor the next morning it was deadly quiet. My boss was sitting in his office upstairs and I asked him what was up. He'd taken those disks to do a 'quick demonstration' for a prospect before I arrived to show them a new feature (thread cutting iirc). A subtle bug caused the machine to start cutting with a feed of 10mm instead of 1mm, the stainless steel he used for the demo got cut up into serrated carving knives spinning out of the machine at very high speed. Amazingly, nobody had gotten wounded or killed, mostly due the power of the Reiden (it never even stalled) and the holding force of the chuck (which had to keep hold of the workpiece during all that violence), the machine had actually completed its cycle and the customer had left 'most impressed' (and probably a few shades paler than they arrived...). They actually bought the machine on the strength of the demo and some showmanship of my boss, cheeky bastard, for all the same money there would have been a couple of ambulances in front of the building that day.
After that nobody ever tried to use any of the binaries until I had signed off on them on as 'safe for production'.
That mistake would have definitely been caught in the 'wood' testing phase and a 'faster release cycle' would have missed it entirely since it looked very good right up to the moment where the cutting bit hit the metal.
Test protocols exist for a reason, skip them and you're playing with fire, faster release cycles are great for non-critical software.
That's an excellent story, and something to remember when working on automated systems, especially industrial ones.
For something like, say, an automated surgery robot or da Vinci Surgical System, or the Therac here, or an implantable insulin pump, or whatever, it absolutely makes sense to be super vigorous in testing.
For something that's basically just a big document database, though? Or a glorified calculator? Or graphing and charting app? Or messaging app?
Hardly necessary.
In fact, the sort of testing and software rigor that makes sense for embedded systems (like your lathe or the Therac machine here) is pretty much the worst way possible to release one of the aforementioned systems on time and under budget and useful enough to actually make people productive.
Adding more "rigor" to these applications would only serve as a barrier to entry for folks trying to improve the industry. It wouldn't save lives and it would only increase the power of the monopolies of existing players.
Agreed. In fact, back at MIT, this was required reading in 6.033, one of the required classes for CS. We spent 1-2 weeks discussing it and its issues at great length. One of my favorite courses by far.
In fact, that whole course was brilliant. It was almost like a seminar where we read a ton of seminal CS papers (X Windows was one of my other favorites) and discussed them / studied them. Really one of my most memorable courses.
http://catless.ncl.ac.uk/Risks
The Therac-25 was discussed here many times, starting with Vol. 3 Issue 9:
http://catless.ncl.ac.uk/Risks/3.09.html#subj2