Hacker News new | past | comments | ask | show | jobs | submit login
Hands-on Assignment – Therac-25 (2007) (web.mit.edu)
106 points by luu 17 days ago | hide | past | favorite | 57 comments



http://web.mit.edu/6.033/2007/wwwdocs/assignments/handson-li...

>I fully recognize that there are dangers and risks to which I may be exposed by participating in Reproduction of Therac-25 accidents . The following is a description and/or examples of significant dangers and risks associated with this activity Acute gullibility, Failure to understand April Fool's jokes, Night terrors associated with medical radiation machines .


Leveson's fantastic Therac-25 paper is probably the most important document in the formative years of my young sweng career.

I still re-read it every couple of years, and it's held up a lot better than one of my other early favorites, which was Feynman's appendix to the Challenger report. In the sense that I still draw new thoughts and realizations from it as I re-read it with additional experience in some of the engineering and organizational disciplines it touches. Sad as it is, it's got a little bit of everything.

It's definitely got my vote for the I Ching of critical systems engineering.

Spend the time. Chances are you'll remember it.


Do you mind specifying the title of the paper? It appears there's quite a few papers[1][2][3] published concerning Therac-25 by an author named Leveson.

[1] http://sunnyday.mit.edu/papers/therac.pdf

[2] https://ieeexplore.ieee.org/document/274940

[3] https://ieeexplore.ieee.org/document/8102762


Gladly! It's the second of these, "An investigation of the Therac-25 accidents" (1993) w/ DOI 10.1109/MC.1993.274940.

#1 is a later version that was an appendix to her book Safeware (which I have not read), and [3] is a nice second read that follows up on [2] many years later but isn't quite the relentless engineering detective story that makes the original so poignant.



This is the one :-). If I recall right, the journal scan had some additional diagrams some uploads have omitted, so while a PDF isn't the most ergonomic thing it's probably the safe bet.


I will take the opportunity to ask if there’s a good way to read PDFs using phone. How terrible it is!


Probably a Nobel to whoever solves this.


How ironic that, at the time, she was in a professorship endowed by Boeing.


Safeware is good. I read it back in the day. Several good failure analysis.


Thank you!



As the kids say, this is correct. It's very good. If you have never read anything like it before, it could be mind-blowing.

Her later work like Engineering a Safer World (available as a free pdf if you poke around on the MIT Press website) is merely good.


Agreed on the Levenson paper!

Is there something specific about the Feynman appendix that you think hasn't aged as well, or is it more that you've squeezed all of the juice out of that fruit already?


More the latter! It's still a very charismatic text that I'm fond of, and it's of course also intensely quotable. But given its brevity it can only deliver so much.


If you want a much more detailed treatment that covers both every aspect of the technical issues, as well as the corporate and government malfeasance that lead to the disaster, consider “Truth, lies, and O-Rings” by Allan McDonald, the Thiokol program manager who would not sign of off on the launch and overruled by his executives. He had good things to say about Feynman, and definitely some axe grinding for other individuals.

https://a.co/d/1f6MBUb


Thanks!


That paper hit me like a freight train, and I'm still recommending it whenever related topics occur. I can't say enough good about this.


I remember reading --- and vehemently disagreeing --- with the report on the incidents, which danced around the matter but didn't point directly at the underlying cause: excessive complexity in the software, which easily created bugs and hid them. For example, they used multiple threads and a whole OS, when a simple loop would've been sufficient, perhaps in a misguided attempt at trying to keep the UI "responsive"; there would not be any race conditions if everything ran in the same thread.

As Tony Hoare says: "There are two ways to develop software: Make it so simple that there are obviously no bugs, or so complex that there are no obvious bugs."


Having written a lot of safety critical code, I don't think you could ever convince me that multithreading was the most important safety issue. It was a safety critical system written by a single anonymous person in assembly without adequate testing. The process was fundamentally broken. Changing one technical decision would have simply exposed another failure somewhere else.


It's been a while since I read the detailed analysis, but wasn't it basically this?

1. A bug in the software that was prevented from doing harm (not necessarily masked but reduced to a minor irritant) by a hardware safety interlock.

2. The hardware safety interlock being removed in a cost-reduction.

I know that "field mileage" is a powerful indicator of quality. Even if the code is awful and full of workarounds - if you have a lot of field experience you'd be loath to mess with it. So by the time the cost reduction came along the software would be fully trusted.


Yes, that's essentially the story. It misses a lot of safeguards that should have existed (from a hindsight is 20/20 perspective). There's no "fully trusted" in safety critical code, so the software should have gone through a complete review when the safety model changed from hardware based to software based, including a HARA analysis and single point of failure analysis. The software should have had a specification and testing that included things like counter rollover and variable race conditions. The single developer should have had someone else to review their work and catch mistakes.

This is standard practice these days, in large part specifically to prevent disasters like therac-25. While it's a little unfair to judge them by standards that didn't exist yet, it was still a broken process.


IIRC some of the underlying bugs came about because they used a much less dangerous device (X-Ray only) as the base and added the more dangerous system on top. The reused software components had different units of measure than the new mode which wasn't exposed until things were in "unexpected" error states.

Then there was the usual arrogant gas lighting of anyone who reported the machine injuring and killing people. So the standard mix of computers are infallible and everyone in the change covering their asses by covering their ears and eyes so that the machine could kill again and again.


I fully believe this software could've been written correctly by a single person; just one who knows the simplicity of the requirements and isn't trying to show off with complexity.

If you've written safety-critical code then you've probably worked in an industry that tries its hardest to dilute and reduce responsibility, and while imposing onerous processes may help, it doesn't fix the underlying problem.


> For example, they used multiple threads and a whole OS, when a simple loop would've been sufficient, perhaps in a misguided attempt at trying to keep the UI "responsive"; there would not be any race conditions if everything ran in the same thread.

Of course, there are also high-profile computer errors caused by UI unresponsiveness (in combination with inept programming).

Here in the UK a bunch of post office employees were convicted of fraud when they'd done nothing wrong, due to a computer system with hundreds of bugs.

"One, named the “Dalmellington Bug”, after the village in Scotland where a post office operator first fell prey to it, would see the screen freeze as the user was attempting to confirm receipt of cash. Each time the user pressed “enter” on the frozen screen, it would silently update the record. In Dalmellington, that bug created a £24,000 discrepancy, which the Post Office tried to hold the post office operator responsible for."

https://www.theguardian.com/uk-news/2024/jan/09/how-the-post...


That looks like a different type of multithreading bug, again caused by "decoupling" the UI.


I think about software reliability a lot and try to generalize failure causes (I have about 30 years of experience designing and building systems).

While I do have a number of top bug-causing patterns, the above comment is spot-on: unnecessary complexity, while not being the direct cause of problems, almost always comes up when analyzing failures. You have a much better chance of getting your system right if you avoid complexity.

If you want practical real-life examples: many engineers when building a small embedded device will use a Raspberry Pi or something similar that runs Linux. It's easier for them: they get a full OS that they are used to, there is Python, all familiar. And yet, quite often all this is overkill for the task at hand, which is monitoring temperature, or controlling a bunch of GPIOs. You pull in a humongously complex stack of software, with thousands of quirks, bugs, and quite a bit of emergent behavior that no one understands, even though the task could have been performed by a bare-metal program running on a microcontroller, or (if we want convenience) running under Zephyr.


An even worse example would be using a full-blown windows & small PC just to show a kiosk interface, for example to show ride information in public transit.

It's broken so often due to various issues (booting issues, windows main screen without programs, other random windows issues and so on)


At a Starbucks years ago, a woman next to me requested to plug into my USB charger, which had an open port. Since any interaction of mine with a new person was doomed to awkwardness, of course I ended up with the sadly comical three-try, two-rotation final insertion. She made a crack about Schroedinger's USB port and I smiled, then she began to explain to me that it was a physics thing. After I replied that, yeah, that was my undergrad. It turned out to be hers as well. Then we found out that we had both gone into programming. She was doing programming in medicine, and, when prompted for further details, mentioned working on the software for one of those multi-beam radiation gizmos.

I mentioned Therac-25 and we were off to the races. I think we talked for about two hours on programming in high-risk situations, hers in those machines, mine in location services and routing for emergency services. That maniacal technician who irradiated that poor boy (the names escape me) was brought up. At one point we hit on the Challenger disaster. I was comforted to meet another programmer who had certain philosophies about making reliable niche software. I'm not at that level, but I think it is something to which I ought to occasionally aspire.


Because the linked page doesn't include a description of what a Therac-25 is:

> "The Therac-25 is a computer-controlled radiation therapy machine produced by Atomic Energy of Canada Limited in 1982. The Therac-25 was involved in at least six accidents between 1985 and 1987, in which some patients were given massive overdoses of radiation. Because of concurrent programming errors (also known as race conditions), it sometimes gave its patients radiation doses that were hundreds of times greater than normal, resulting in death or serious injury. These accidents highlighted the dangers of software control of safety-critical systems."

https://en.wikipedia.org/wiki/Therac-25


You're missing one critical aspect of this nightmare. After an incident:

"The AECL responded in two pages detailing the reasons why radiation overdose was impossible on the Therac-25, stating both machine failure and operator error were not possible."


RBMK Reactors Do Not Explode


There's another massive thing to highlight with this. Atomic Energy had a death and "fixed" the issues. Then they had more deaths.

It really highlights the fact that proper and adequate testing is absolutely and unquestionably required for a system like this, and that if you don't have testing and issues do occur, then you're essentially just going to keep creating issues for yourself while fixing the issues you created in the first place.

Therac-25 is genuinely horrifying.


Not just testing. Any modern system this safety critical should be proven with formal methods.


Every time I read the story of Therac-25 I feel incredibly frustrated AECL never faced real consequences or (criminal) liability for it.

Maybe I'm retroactively imposing modern day safety culture, but reading the timeline and history, it feels like AECL was completely negligent in waving off the issue as more and more fatalities kept piling up.

Can't believe the devices weren't pulled offline to definitively solve the issue after the first death. Instead, they basically went "can't repro, oh well".


They should have faced consequences for their response, as much as for their error-prone device. Multiple patients had complained of extreme burns during their treatment, and autopsies later confirmed the cause of death to have been radiation exposure, yet AECL was still saying thing like, "damage could not have been produced by any malfunction of the Therac or by any operator error."

Sure, they were laying under our radiation cannon and then died of extreme radiation exposure, but they probably got it somewhere else.


From my blog[1]:

"In 2017, Leveson revisited those lessons[2] and concluded that modern software systems still suffer from the same issues. In addition, she noted:

* Error prevention and detection must be included from the outset.

* Software designs are often unnecessarily complex.

* Software engineers and human factors engineers must communicate more.

* Blame still falls on operators rather than interface designs.

* Overconfidence in reusing software remains rampant."

[1]: https://dave.autonoma.ca/blog/2019/06/06/web-of-knowledge/

[2]: https://ieeexplore.ieee.org/document/8102762


It took several decades to finally admit "unnecessarily complex" software was a problem? No wonder.


Recently I had to get a panoramic dental x-ray and I was making small talk with the person who was running the machine.

I joked that I'm always cautious about machines like this, even knowing the dosage of radiation is low, simply because of the history of software safety controls and the story of Therac-25. She hadn't heard of it before and I gave her the gist of it, that an issue with the programming made it so it was possible to accidentally dose a patient considerably more than the intended amount (in a few different ways). It was interesting to her but I then had to pause so she could run the machine. I shut up and she did her thing.

Then, after a few minutes of scanning, she sucked her teeth a bit and apologized, saying she needed to run it once more. No worries, let's get it done! She starts it again and as I'm getting scanned she explains that "for whatever reason I was getting an error so I just had to restart it, this happens sometimes and I'm not really sure why." I give a little half-nervous chuckle and then the scan completes. Once I pull my head out of the machine, I finally get to finish my lovely Therac-25 story wherein I explain that one of the issues was... a combination of non-descriptive error codes, insufficient failsafes, and operator error resulting in patient casualties as the procedure was restarted one or more times.

We shared a little laugh and discussed other things, cost of living primarily. I'm still alive so I'm at least 63% sure I didn't get megadosed or anything but its been a funny conversation to revisit now and then.


I had a similar experience recently, the dental assistant told me "we're going to do your x-rays now, but the controller isn't working right so I have to use a workaround, it'll take a little longer." I told her to stop what she was doing, there would be no x-rays for me, and explained why. I'm 99.99% certain I would have been fine, but the Therac-25 story is so horrifying, I decided to give in to my irrational fear.


I am a medical physicist. The Therac-25 disaster is retold and explained to every student of medical physics, of course. So I'm happy to inform you that it is not physically possible for the kind of error that occurred with the Therac-25 to occur with a diagnostic X-ray system.

You see, there are basically two kinds of X-ray machines used in medicine. One is just a standard cathode-ray tube (in fact, CRT televisions are basically the same design and both fall under 21 CFR 1020): a very high voltage is produced between a hot filament cathode and a tungsten anode, and electrons leave the cathode [1] and fly towards the anode where they produce X-rays. Because the energy that accelerates the electrons is entirely contained in the electric field between the cathode and the anode, the electrons cannot reach the required kinetic energy without moving towards the anode. All diagnostic X-ray imaging equipment is basically of this form, although there are a few unreliable machines that come with fancy sales pitches where the hot filament is replaced by some carbon nanotube thing. Nobody recommends these, but they aren't dangerous either.

The other kind of X-ray generator is an accelerator. Here there is also a cathode and an anode, but in between there are additional electric fields which take the form of standing radio waves. The released electrons are synchronized with the radio waves so that they gain far more kinetic energy from the oscillating fields than they do from the static field between the anode and the cathode.

In an accelerator, it is possible for the electrons to fly past the anode and hit the patient directly. This is what happened with the Therac-25. Only about one percent of the electron energy is converted into useful X-rays when fast-moving electrons strike a piece of tungsten (or any other material; tungsten is the most durable).

In some cases, you actually want the electrons to hit the patient. This is called "electron-beam therapy" and it is used to treat skin cancers and other shallow tumors because electrons do not penetrate as deeply as X-rays. In order to do this safely, the electron beam intensity is reduced dramatically, to about 1% (of course) of the "tube current" used to generate X-rays.

You may have already guessed the problem. In the Therac-25, it was possible for the electron beam to be configured at an X-ray intensity while the beam-directing magnets (we say "bending magnets") were aiming it at the patient. This causes a lethal overdose of radiation — a hundred times too much.

However, a diagnostic X-ray tube does not have any magnets directing the beam, nor does it have any standing radio waves ("RF oscillators") accelerating the electrons enough to escape the potential well created by the anode. These machines cannot produce electron beams outside the tube because the electrons are simply "falling" into the anode and there is nothing to "push" them away. It would be a little bit like dropping a rock down a well and seeing it fly back up into the sky.

1: https://en.wikipedia.org/wiki/Thermionic_emission


On the other hand, dental x-rays are overused and do not improve dentists' ability to detect cavities in most patient populations.

And even though a diagnostic x-ray machine can't point an electron beam at you, it can still give you a bigger than intended dose. (Yes, I know that dental x-ray doses are typically very low to start with).


I would not consider it operator error. Hitting backspace and re-typing the mode? That should be as obvious a change for that kind of thing as shifting between 1st and 4th gear.


IMO the operator error is seeing an error code in a safety critical system and simply choosing to ignore it because it happens too often.


Noticed any new abilities lately?


I guess fast dividing cells is a kind of ability ?


I’m told it’s the equivalent of a chest x-ray.


    Question 4: How many rads did you receive in doing this project?
Best question ever.


Well There's Your Problem Podcast (with slides)

Episode 121: Therac-25 https://www.youtube.com/watch?v=7EQT1gVsE6I


Assume from the paper this simulator does not contain the software patch that fixed it? I'd be curious to see the "patch" for this that would, spelled out in more modern code (not sure what the original Therac was written in).


I fully recognize that there are dangers and risks to which I may be exposed by participating in Reproduction of Therac-25 accidents . The following is a description and/or examples of significant dangers and risks associated with this activity Acute gullibility, Failure to understand April Fool's jokes, Night terrors associated with medical radiation machines .


I remember reading about the Therac-25 when the first post-mortem explanations appeared. Those poor patients, they could feel something was wrong but were told everything was fine because the operators didn't see anything out of the ordinary.


Struggling to trigger either of the malfunction cases from therac.c, any tips?


I got both malfunctions by running the machine successfully once, changing the Beam Type and then running again.


Set up your whole setup, and fire a beam.

Then within 8 seconds change the beam to the other type and fire it again.


They suggest the students should do this exercise in pairs. When doing so, the lucky partner is the one who doesn't have to sit in front of the radiation source.


Related. Others?

The Therac-25 Incident - https://news.ycombinator.com/item?id=38458448 - Nov 2023 (10 comments)

Therac-25 - https://news.ycombinator.com/item?id=37480795 - Sept 2023 (15 comments)

An Investigation of the Therac-25 Accidents (1993) [pdf] - https://news.ycombinator.com/item?id=34636130 - Feb 2023 (1 comment)

The Therac-25 Incident - https://news.ycombinator.com/item?id=26142432 - Feb 2021 (77 comments)

A Brief History Of: The Therac-25 (Short Documentary) - https://news.ycombinator.com/item?id=24388735 - Sept 2020 (1 comment)

The Worst Computer Bugs in History: Race conditions in Therac-25 - https://news.ycombinator.com/item?id=23200759 - May 2020 (1 comment)

The programmer behind the Therac-25 fiasco was never found - https://news.ycombinator.com/item?id=21679287 - Dec 2019 (121 comments)

Worst Computer Bugs in History: Therac-25 (2017) - https://news.ycombinator.com/item?id=17740292 - Aug 2018 (110 comments)

Killed by a Machine: The Therac-25 - https://news.ycombinator.com/item?id=12201147 - Aug 2016 (40 comments)

Medical Devices: The Therac-25 (1995) - https://news.ycombinator.com/item?id=9643054 - June 2015 (18 comments)

An Investigation of the Therac-25 Accidents (1993) [pdf] - https://news.ycombinator.com/item?id=8755469 - Dec 2014 (1 comment)

Therac-25 - https://news.ycombinator.com/item?id=7257005 - Feb 2014 (77 comments)

Therac-25: When software reliability really does matter - https://news.ycombinator.com/item?id=1143776 - Feb 2010 (18 comments)

An Investigation of the Therac-25 Accidents (1993) - https://news.ycombinator.com/item?id=202940 - May 2008 (2 comments)




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: