I can sympathise with the Patriot Missile floating point issue. A bug like that ...

alephnil · on Feb 1, 2015

It is more subtle than that. The original code ran on the Ariane 4 rocket, and there the engineers had proved that the error could never happen with the first 80 seconds of the trajectory of that rocket, which was the period this code ran. The management of the Ariane program decided that the unit was going to be used in Ariane 5 as well, without certifying it for the new rocket. The Ariane 5 rocket is much more powerful, and will get much further in the trajectory than Ariane 4 in that time, so an angle big enough to get overflow will happen. This was never discovered, because the trajectory of the Ariane 5 was never tested with the code. Thus it can also be considered a management failure.

The code was also for stabilizing the rocket on the launchpad, and made no sense after that, but it was not shut down before after 80 seconds.

Gravityloss · on Feb 2, 2015

yes, since it was an overflow checker that caused the problem, it's a philosophical issue too.

Let's take a hypothetical: You're flying a rocket on a one-time mission. The rocket is not reusable and there are no redundant engines or any way to abort the mission in an intact way. You then detect an overflow in your control algorithm.

In practice, it almost never makes sense to do anything to these errors. If the error was spurious, the best course was to not do anything. If it was for real, the mission will be lost anyway so it doesn't make sense to spend effort to pay attention to the error.

Your only abort criteria might be if your rocket starts venturing to a path that will cause it to fly out of its designated safety zones.

However, if you have redundancy, then doing stuff like shutting down engines starts making sense (Like on Saturn V or the Space Shuttle).

rainforest · on Feb 1, 2015

The Ariane 5 failure is a little more nuanced than that though. The conversion bug was in the Ariane 4 too, they reused the IRS, so it was almost "proven in use" but Ariane 5's trajectories put it under much higher acceleration (enough to trigger overflow).

It's an embarrassing failure for sure, but it's more nuanced than a "dumb mistake"; the management, testing methodology, and code all failed when it exploded.

Sanddancer · on Feb 1, 2015

A first year CS student would have caught it, however, when you're dealing with the embedded world, sometimes you have to do things like convert 64 bit floats to 16 bit ints to get things to fit in the 2k or so of ROM you have to work with, or to run in a reasonable amount of time. The problem lies more in the blind reuse of code, and a lack of documentation as to the code's constraints. Something like that code, I'd have commented to the effect that it only works up to x meters per second and/or put something in my own code that if the value exceeds some certain amount, to return MAXINT or a value like that. Again, yes it's ugly, but given a the environment, sometimes you have to hold your nose while writing out the code.

Someone · on Feb 2, 2015

I think it is way more likely that they never read the documentation (or at least not all of it) than that there was no documentation.

Also, that 'clip to MAXINT' choice can be a very bad choice, too, so it would have to be documented and that documentation would have to be checked before any reuse of the code in environments with the constraint that the code cannot fail. Because of that, I cannot see how that choice would help to prevent such accidents.

ahelwer · on Feb 1, 2015

> Converting a 64-bit floating point number to a 16-bit int is something a first year computer science student would be embarrassed about.

Try enabling compiler warnings on a legacy system sometime and let me know how many of those embarrassing errors you find :)

cnvogel · on Feb 2, 2015

Probably something along the lines of...

    (...)
    writing legacysystem_firmware.hex (387123 bytes)

    Build finished successfully, with 71245 warnings.

Good look, finding the "unsigned short x = floatval;" line, that doesn't even trigger a warning in the first place ;-).