Hacker News new | past | comments | ask | show | jobs | submit login
Therac-25 (wikipedia.org)
84 points by markmassie on Feb 18, 2014 | hide | past | favorite | 77 comments



In the Computer Science degree I did, every course that was in the Software Engineering or Formal Methods track started with "Why Software Engineering is important", and then a group of very bad software bugs, this one was always one of them. The slides would then be followed by how the professors believed their course would have prevented them.

Especially funny was the formal verification course that mentioned the Ariane 5. Apparently all the new software in Ariane 5 was formally modelled and verified, but one part of the system was directly ported from the Ariane 4. Because the Ariane 4 mission had been successful they did not verify that (it's an expensive process). The bug that crashed the rocket involved the fact that the Ariane 4 was 16 bit, and the Ariane 5 was 64 bit, it resulted in an integer overflow somewhere leading to a crash.

You can spend millions in painstaking formal verification, and pay for the small part that you did not verify.


An erstwhile colleague of mine worked on the component that destroyed the Ariane 5; it performed exactly to spec, detecting that the vehicle was out of control - due to the integer overflow (in another, unrelated component nb) - and self-destructing it to prevent it crashing to earth. However explaining this subtlety to a rabid press was a different story, and the whole thing became a bit of a PR disaster


I thought the Ariane was destroyed by aerodynamic forces?


self-destroying components sound interesting. do you have any info/links to which type of technology is used in this component?


Nothing too fancy, just strategically placed explosive charges to break up the vehicle to stop it from traveling further down range and disperse the propellant to reduce the size of the explosion.


I think you misunderstood - the purpose was to self-destruct the rocket, so although I don't know the details I'd imagine it was a sensor to detect the vehicle exceeding its flight parameters coupled to either an explosive packet or something in the fuel system to make the rocket go bang


Javascript and PHP, mainly.


I opened the comments track to say exactly this. Every lecturer with any connection to software would bring this up. It felt like the bad (as in, people have died from Therac-25 malfunctions) running gag of my CS studies. Other popular bug choices were the Pentium Bug from 1994 or the Mars Climate Orbiter.


I thought the Ariane 5 problem was due to old code in the guidance systems that they thought would never be called but left in because they didn't want to risk unnecessary change. A "last minute" trajectory and launch timing change due to atmospheric conditions meant that this code did get triggered - it saw the new trajectory as a problem and tried to correct it pushing the rocket out of control (or, at least, outside acceptable parameters for the new launch plan) and causing the perfectly reasonable "I'm not sure what is going on, I'd better blow myself up before I hit something important on the ground" fail-safe to fire. Or am I confusing this with another rocket control error?


It seems you're confusing with an other one. Here's an excerpt from the wiki on flight 501:

> The Ariane 5 reused the inertial reference platform from the Ariane 4, but the Ariane 5's flight path differed considerably from the previous models. Specifically, the Ariane 5's greater horizontal acceleration caused the computers in both the back-up and primary platforms to crash and emit diagnostic data misinterpreted by the autopilot as spurious position and velocity data. Pre-flight tests had never been performed on the inertial platform under simulated Ariane 5 flight conditions so the error was not discovered before launch. During the investigation, a simulated Ariane 5 flight was conducted on another inertial platform. It failed in exactly the same way as the actual flight units.

> The greater horizontal acceleration caused a data conversion from a 64-bit floating point number to a 16-bit signed integer value to overflow and cause a hardware exception. Efficiency considerations had omitted range checks for this particular variable, though conversions of other variables in the code were protected. The exception halted the reference platforms, resulting in the destruction of the flight.

Although the article partially disagrees with tinco: it looks like formal verification was only implemented after flight 501:

> The launch failure brought the high risks associated with complex computing systems to the attention of the general public, politicians, and executives, resulting in increased support for research on ensuring the reliability of safety-critical systems. The subsequent automated analysis of the Ariane code was the first example of large-scale static code analysis by abstract interpretation.


The details are a bit unclear on Wikipedia, and it's taking tens of minutes to download the original report from the European Space agency, so below is my best understanding without re-reading the report.

My understanding was the routine in question was used for re-calibrating the inertial guidance system in the Ariane 4 in case of an extended hold-down period for up to 40 seconds after ignition. Presumably this routine integrates measured acceleration, which can be divided by hold-down-time to find the average error in inertial bias over the hold-down period. The average error in accelerometer bias (in other words, rate at which measured ground-relative velocity deviates from true ground-relative velocity) would then be subtracted from the previous bias estimate in order to get the bias estimate to be used for the flight.

Edit: even though recalibration was never intended to be used in the Ariane 5, the integration routine was left in and continued to integrate acceleration measurements for 40 seconds after ignition.

The Ariane 5 is capable of undergoing more acceleration than the Ariane 4, so it was possible within 40 seconds of ignition for the 64-bit velocity (integral of acceleration) to overflow the 16-bit variable it was cast into at some point. With no range checks implemented for this cast, this routine caused the computer handling the inertial guidance to crash and dump. The autopilot saw the crash dump, but misinterpreted it as a position and attitude update. The autopilot then adjusted the rocket nozzles to correct for the misinterpreted attitude, causing the rocket to start flying somewhat sideways through the air at high speed, leading to breakup due to aerodynamic forces.

Most (all?) modern launch vehicles contain small explosive charges (or lines of detonating chord) to burst fuel tanks, break up the solid fuel grain, and perhaps break up some of the more dangerous pieces if the vehicle deviates too far from the planned flight path. Shortly after aerodynamic forces began to break up the Ariane 5, some internal automated system detected things were going very very badly and triggered the auto-destruct sequence.


This is a tragedy that bears remembering. Many descriptions of it are cold and clinical, and mention the deaths only statistically. The Therac-25 is regularly used as a soapbox for various software engineering disciplines and machine-checked program verification techniques. But it’s worth keeping this in mind: humans suffered and died because of our mistakes. Whether due to underpowered tooling or mere human error, as a field we are collectively at fault. This is not a fucking game.


This is precisely why the current "hacker culture" scares the hell out of me. The vast majority of new programmers don't understand the basics of software design; we are entering an era where software is written by whoever will work for the lowest wage.


>The vast majority of new programmers don't understand the basics of software design; we are entering an era where software is written by whoever will work for the lowest wage.

It is really easy to pick up languages and libraries with so much information being freely shared. IMHO that usually comes with less regard for proper rigor in exchange for fun and keeping interest.

Although, I think the vast majority of new programmers aren't working on things like Therac-25 machines, but instead on Flappy Bird, JavaScript library #749237, and Bob's Website. You don't really need much software design knowledge to push products at that level.


This is why we have regulatory agencies like the FDA and development standards like ISO 13485 and IEC 62304. As someone who has worked for most of their career in medical device development, I can tell you that anyone who is a "hacker" rather than a software developer/engineer would not last a day in such a regulated workplace.


As somebody who is in that field for a little while now myself...it seems like they care more that you document how you intend to fuck up than that, you know, you don't actually fuck up.


Agreed. I've been working in the medical device field for about ten years now. When we hire people from other areas they are dumbfounded at the amount of verification and validation required to release a piece of software.


you and your op are so very wrong.

throwing together a website doesn't make you a hacker. allow me to rephrase.

using some tools a person built for you to throw together a website, and you using them together in exactly the manner that they were meant to be used makes you a user/consumer not a hacker.

quite a few of actual hackers i saw withered in university because they did what grad students did in their high school days at home.

i can tell you a lot of things that you use in production today even in your field is the result of a lot of these hackers you so disparage.

edit: rephrased last paragraph


I'm not sure why you are being so defensive. I never condemned (the word you're looking for, rather than "condoned") hackers. People who like to learn on-the-fly, tinker, make things, have an important place in this world, including software. I would never deny that.

Also note that the roles of "hacker" and "software developer/engineer" need not be mutually exclusive.

I was pointing out to the parent comment that someone who is not reasonably versed in proper software design, process, and documentation cannot just hack away at products such as Therac-25 anymore. At least, it would be much more difficult for them to do so and bring it to market successfully, due to all the regulatory red-tape.


nope, not condemn, but maybe reproach or disparage are better words for your attitude.

why don't you rephrase your original text? had you said "these new programming hipsters" i wouldn't have argued with you. the truth is that everyone nowadays can program, which is why people get so defensive about that "discipline"

but you guys completely misunderstand the meaning behind the word hacking. so i had to correct it.


the truth is that everyone nowadays can program

yeah, no, try again

get outside your bubble


I don't think that's necessarily true: software is as reliable as it needs to be. Yup, it's annoying when your favorite blog leaks your password. But since there are no legal consequences for doing that, why should they bother trying to prevent it?


Exactly the attitude that would be great to stamp out: "It won't hurt/cost me, so why should I care?"

It's obviously on a far different scale from software causing deaths, I'm not trying to make comparisons between the two. But the motivation of the software developer needs to be a little more moral than just 'will I be in trouble with the law?'


What I'm saying is that if you're the kind of person that cares, you'll find a higher-paying job using that knowledge for something more important.

The kind of programmer that stores passwords in plaintext in a publicly-accessible database is not one that says, "I'm going to do this because it's faster." He or she doesn't even know that that could be a problem. Similarly, those hiring such a person are ignorant of what experience costs, and so they get what they paid for, even if they didn't make the conscious decision to make an insecure website.

(You don't see Google or Microsoft or Apple posting ads on freelancer sites for $5/hour jobs. They know the results they want cost more than that.)


"What I'm saying is that if you're the kind of person that cares, you'll find a higher-paying job using that knowledge for something more important."

It is not just a matter of developers' professional and personal integrity. We have to live with the consequences of bad software produced through the employment of amateurs, even if we, personally, care about doing a good job.


You say that like it is at all a recent phenomenon.

That is the way it has been since computers went commercial.


I don't disagree in the slightest that the software had a deadly bug, or that that bug shouldn't have been allowed to make it into production. But I am surprised no one in this thread has yet identified the real source of the problem. From Wikipedia:

> The accidents occurred when the high-power electron beam was activated instead of the intended low power beam, and without the beam spreader plate rotated into place. Previous models had hardware interlocks in place to prevent this, but Therac-25 had removed them, depending instead on software interlocks for safety.

I am just a self-taught programmer, and certainly not any kind of trained engineer. But when there's something your machine must not be able to do, and it's possible to design it such that it physically cannot do that thing, what possible excuse can there be for not doing so?


That's definitely poor engineering, the exact kind of thing we were told not to do when I took engineering classes.


Another terrible tragedy caused by a software error:

https://en.wikipedia.org/wiki/MIM-104_Patriot#Failure_at_Dha...

Over 100 hours, the system's internal clock drifted by a third of a second, which was enough to introduce a significant error into the Patriot missile defense system. They updated the software the next day when the error caused 28 soldiers to lose their lives.


The error contributed, SCUD missile caused.


The Patriot missile probably would have failed to destroy the Scud even without the error.

The Patriot was designed to shoot down aircraft, but ballistic missiles fly much faster than an aircraft, and the warhead fusing system they had back then was too slow.

Most Patriots detonated behind the Scuds they were targeting.


This was one example given in a very early lecture about software testing and failures in our uni. This (and others) make me hope that I never have that much responsibility over human lives as a developer.


Unless your code is the most trivial ever, there's still a tiny possibility that it could result in a death.

Game: User goes into epileptic seizure, hits their head and dies.

Social media: User calls for help with it - "Please send help #carcrash #bleeding" and the message is lost.

As an engineer and a responsible human being, you don't ever want this to happen. But it will. You have to minimize the chances of it happening, with the realization that it will never go to 0.0%

The world runs on software today. With all that that implies.


It is this type of failures that drives me to push for software quality and moving away from languages like C.

As well as, consider that liability should be part of the industry as it is in other industries.

Sadly the drive for profit speaks otherwise.


The issue is not because of the language used. You can have bugs - potentially fatal as this case underlines - in any language.

When I studied the Therac case, there was also a study in the "Killer Robot" - see here : http://www.onlineethics.org/cms/5122.aspx. Well worth a read to understand how software can become dangerous.

Sadly nowadays, a search for Killer Robot turns up all sorts of stories of drones being used to kill people. When I first looked up killer robot a number of years ago, the results were pretty much all about the ethics scenario and not real life people being killed. How society has moved on...


Also relevant (and on HN newest as of now) - Toyota accelerator issue: http://embeddedgurus.com/state-space/2014/02/are-we-shooting...


This one isn't merely an example of a failure of engineering, but also of organizational failure at Toyota, and organizational failure at both the NHTSA and NASA.


> You can have bugs - potentially fatal as this case underlines - in any language.

That is undoubtedly true. However, the commonest classes of bugs can be ruled out by languages with more expressive static verification tools—type systems, effect systems, and so on—leaving us with a stronger baseline from which to work.


> The issue is not because of the language used. You can have bugs - potentially fatal as this case underlines - in any language.

This is true, however there are languages which allow for more programming errors than others.

I just mentioned C as one possible example.

The goal should be to have programming languages that reduce programming errors, not that make it easier to shot yourself.

Of course this is only a small step towards better software, as whole system design also plays a big role.


It's explicitly stated that the choice of language was not considered a root cause. And that was even asm - not C

The software was written in assembly language that might require more attention for testing and good design. However the choice of language by itself is not listed as a primary cause in the report.


C was just as example, as it is nothing more than a portable macro assembler.


This is a common mis-conception invariably produced by people who've never written a C compiler. C is not just a portable macro assembler.


> This is a common mis-conception invariably produced by people who've never written a C compiler.

Wrong on that one. I do have a good background in compiler design.

> C is not just a portable macro assembler.

Given what it offers when compared with safer systems programming languages, it is one according to my own definition. :)


As opposed to being written in actual assembler?


This is the bug that occurred:

> The defect was as follows: a one-byte counter in a testing routine frequently overflowed; if an operator provided manual input to the machine at the precise moment that this counter overflowed, the interlock would fail.[3]

Can you describe how exactly a language other than C would have helped with this?


A higher level language could have relieved the programmers from low-level details, thus giving them more cognitive space for avoiding logical bugs in other sections of their code.

In general, the more DRY the code, the more mind-space for review / testing.


Remember,this was a PDP11. Don't know what that was? Well, C WAS its high-level language.


> A higher level language could have relieved the programmers from low-level details, thus giving them more cognitive space for avoiding logical bugs in other sections of their code.

This can also be said of choosing simpler design, organizing the code in a better manner, or could have been avoided through more comprehensive testing -- all of which were pointed out by the official report.


I am not sure I understand your argument. Using a high-level language doesn't mean one wouldn't follow other best-practices.


Having the bug end up in production could have been avoided simply through adequate testing and verification. I fail to see how a higher-level language helps with that.

The fact that the programmers might have had less stuff to worry about and, therefore, saved their concentration for more critical parts of the code, is hardly direct evidence that the bug could not have been introduced. It can equally well be argued that a higher-level language would have attracted programmers with less experience in mission-critical code, who would have failed to grasp even the possibility of bugs more subtle than this one.


In general, the more DRY the code, the more mind-space for review / testing.

It's something that would be considered firmware today, written in assembler.

Trying to be DRY in assembly is a really great way of introducing subtle bugs.


Safer system programming languages cause a panic on overflow, instead of happily keep on running.

Unless the programmer has explicitly disabled them, that is.

Crashing is better that corrupt data.


"Crashing is better that corrupt data"

If the crashed program is controlling a device that is directing radiation at a person's body, then crashing is likely to be at least as undesirable as data corruption.

I understand your point about programming language choice, and your comments elsewhere about the undesirability of the C language. But the root causes of the Therac-25 malfunction wasn't technical, they were organisational. Specifically poor code review and code reuse practices. Changing the programming language wouldn't fix them.


> Crashing is better that corrupt data.

In embedded control software of medical systems that are directly affecting the lives of the patients involved crashing is definitely not an option and it is not necessarily better than corrupt data.

What is an option is to very carefully model the permitted states of the machine and then to exhaustively test those states and their transitions to make sure there are no undefined states the machine can be in.

One way to do this is to have a watchdog process (or even an entirely different system) that monitors the state of the machine and that initiates a safe shutdown if any departure of the defined operation is detected.


Kinda curious - which languages panic on overflow?


Ada, here is a small list of validations

http://en.wikibooks.org/wiki/Ada_Programming/Pragmas/Suppres...

Modula-2, a possible documentation link

http://www.excelsior-usa.com/doc/xds/isom203.html#15

Just to cite two examples.


So what happens when the Ada program runs into a failed compiler check? Does it:

* Raise an exception? If so, it's up to the programmer to treat it correctly. There are various ways to detect overflows outside Ada; it's still up to the programmer to treat them correctly. Runtime detection of the overflow (in a hopelessly untested firmware of the Therac-25!) would have brought the matter no closer to resolution.

* Abort execution? If so, is the behaviour of the program defined if execution is aborted? Because undefined behaviour from a control system is quite worse.


Normally, raise an Exception; the outermost levels of the program should probably put the machine in a "safe" state if an exception is raised during operation e.g. turn off the radiation beam etc. I would have thought this is infinitely preferable to silently continuing after an error...

SPARK Ada allows the use of source-code annotations to denote information flow in a program, and can vastly reduce the chances of these sorts of problems (at the expense of more programmer time).

Standard Ada compilers can do a reasonable job of detecting some problems at compile-time, but SPARK seems to be the favoured method for safety-critical stuff.


> I would have thought this is infinitely preferable to silently continuing after an error...

It definitely is! However, it still assumes that the exception is safely handled. The flaw in Therac 25's development process was that testing did not reveal this bug. Consequently, the error recovery process could have itself been incorrect and remain uncaught. It's tempting to think that, if error recovery is similar each time, testing it once is enough, but error recovery is as susceptible to being impeded by race conditions as any other part of the code. No uncertainty is removed this way.

It's also worth noting that a value of 0 (to which the counter overflowed) was actually a legitimate value in the Therac-25 program.


Any chance of an example from a slightly more recent language?

If you are going to say we shouldn't use C because of overflow handling, at least suggest a language that actually does it. Otherwise you haven't really made your case.


Ada is a recent language, just last year got a new standard revision.

Chances are, when you travel by train or take a plane, the control system was developed in Ada.

When human lives are at risk, C is becoming less and less an option.

And when the company still goes C, many countries have certifications in place like MISRA, that make C look like Ada with C syntax.


Language choice is secondary to methodology, but Ada does a darn fine job of checking an awful lot of methodologically useful things up front.

It's less the language than the ecosystem and the infrastructure. It's eminently possible to develop 'C' that's exhaustively tested - I've done it. I am not sure that is true of C++ or Java.


Fine, I'll grant you Ada, and it does seem like a good language for these kinds of things. But you have to acknowledge it's not going to become a popular language for regular software projects.


Sure I won't see the a SV startup using it.


> and moving away from languages like C

If you look at the reports of serious failures like this once (and other famous ones), none of them are related to the use of C.


A couple lessons:

Never underestimate the value of a good safety interlock.

This is not the correct way to check for integer overflows: if x + 1 > INT_MAX { ... }


Hardware interlocks are really important.


With outsourcing of medical software to overseas 'software factories' it still surprises me that we don't hear more about this. It does happen and you do read the stories if you search for them, but it's not big enough news. Anyone knows sources for this kind of news? I only know http://www.reddit.com/r/criticalsoftware/



Just FYI, comp.risks (and Peter G. Neumann's moderation) live on at http://catless.ncl.ac.uk/Risks/


The investigation makes for fascinating reading: http://courses.cs.vt.edu/cs3604/lib/Therac_25/Therac_1.html


We have read this story as part of the Parallel and Distributed course in for the MSc. of CS in Manchester university. The whole story with details can be read here : http://sunnyday.mit.edu/papers/therac.pdf Very sad story though.


Here's a podcast about the Therac-25 and software safety: http://disastercast.co.uk/episode-13-therac-25-and-software-...

I found it quite interesting.


Formal verification is all very well, but I find in practice that a lot of the worst problems are specification problems "of course it wasn't meant to do that"


band name. Called it.


Too soon


Very sad story.

This page introduced me to Nancy Leveson's work a few years back (her paper on the Therac is linked at the bottom of the Wiki page). She's written two excellent books (and a number of papers) on engineering, safety and complex systems which are well worth reading.

http://sunnyday.mit.edu/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: