The Explosion of the Ariane 5

troymc · on March 6, 2013

The Ariane 5 rocket was carrying four ESA spacecraft known as "Cluster" (because they were to work together, in a tetrahedral formation). The bug and subsequent failure give another meaning to the word "clusterf%#k".

https://en.wikipedia.org/wiki/Cluster_%28spacecraft%29

Edit: The above Wikipedia article has the Ada source code that caused the problem.

tobych · on March 6, 2013

I worked on the Cluster project as a software developer for many years, at the University of Sussex in Brighton. My first job. And it blew up. Great first job. I was watching live with hundreds of engineers and scientists at Rutherford Appleton Laboratory. There was silence. All we could hear were birds singing, coming over the satellite link. Quite an experience. Amazingly, the Cluster project got up and running with replacement hardware a few years later, and I was back on the job.

troymc · on March 6, 2013

Indeed, Cluster II was a brilliant success, with over 10 years of successful scientific operations in space. Kudos for any part you played in that.

smackfu · on March 6, 2013

Don't pretty much all launches have insurance? The rate of failure is high enough that it seems necessary unless you are a gambler.

tobinfricke · on March 6, 2013

Monetary compensation isn't everything, if you've already put years of effort into a project and the failure means additional years of effort are required to prepare for a new launch. The wikipedia article notes that replacement spacecraft were not launched until four years later.

InclinedPlane · on March 6, 2013

Commercial payloads, yes, government payloads tend to "self-insure" or not insure at all.

obviouslygreen · on March 6, 2013

+1 for excellent reverse etymology.

Gravityloss · on March 6, 2013

I've heard from people in the space sector that it was the exception, not the overflow per se that caused the problem. Had it not been caught the flight could have made it to orbit (if there weren't other problems). Wikipedia says it was a hardware exception but http://www.ima.umn.edu/~arnold/disasters/ariane5rep.html says it was a software one, and it was only in code that was needed in pre-flight so it seems likely to not cause problems if there wasn't that crippling exception.

These systems have become so big and expensive. This was the case since ICBM:s and it got only worse with Apollo.

Yet they are so vulnerable since there is no way to abort intactly once you have flown something like 0.1 seconds. (At least Saturn V had some redundancy.) You do not get a second try.

Both issues create a perfect recipe for stagnation - everything has to be checked and rechecked for years before and after a software or hardware change. If someone tries something new, and there is a launch or spacecraft failure, it is a political issue and heads will roll. People's technical and political careers are destroyed.

In short, this way is not likely to reach real spacefaring.

A more organic approach with lots of smaller actors working in parallel and trying and failing a lot more - but with better processes built in to handle said failures (technical, political and cultural) could be much more conducive to real progress like increase in operational flexibility, shortened schedules, better reliability and lowered price.

Reasonable sized reusable rockets with good intact abort capability in a testing and development program could up the launch rate hugely, and all kinds of different solutions could be quickly tested. I find it likely that this will eventually happen, but it is frustrating how long it is taking.

In this "horizontal velocity overflow" case, you could do an intact abort if you had a fallback to some alternate control law or even manual control. Those are not incorporated to current expendable space launchers but they exist in aircraft. (Saturn V and also the Lunar Module did have manual backup. You could fly the Saturn to orbit. The LM was hard got get to the right orbit where the CM was waiting...)

callahad · on March 6, 2013

The Therac-25 is also a fascinating case of software failures causing tangible loss: http://courses.cs.vt.edu/cs3604/lib/Therac_25/Therac_1

nraynaud · on March 6, 2013

You know, also I'm slightly tired of that story (I mean it stings, my family works in the field), sometimes I feel like it's a good thing. Here in France, and with the elite political clique at the power even more, are afraid of risk. Our constitution was even emended towards risk-averseness. I think blowing up the GNP of an african country had various positive side effect: 1) risk is there, wether you have correctly signed the process paperwork or not 2) innovation feeds on blowups 3) be humble, stop being cocky on TV before a test launch, sending back the champaign and buffet was very painful to watch and there is no need for that.

avar · on March 6, 2013

They should have celebrated that they learned something that day with that champaign and buffet.

XorNot · on March 6, 2013

I'd really love more details about this. What did the surrounding code look like, why wasn't there a compiler warning being produced by this code etc.

There's certainly a much larger - and probably quite informative - story here.

lmm · on March 6, 2013

It was ADA, there was a trap that would have caught it but had been disabled (I believe for performance?) AIUI the decision to disable it had been made when it was Ariane IV code (and if an Ariane IV had ever been travelling that fast then it would have already been lost) - and then never reevaluated when the code was reused for Ariane V.

wazoox · on March 6, 2013

Indeed there are code samples on the WP page: https://en.wikipedia.org/wiki/Cluster_%28spacecraft%29

ctdonath · on March 6, 2013

I've told the story to my students many times (probably mangling some minor details over the years, but getting the point across).

AFAIK (correct me if wrong)...

- The code was re-used from the Ariane 4, which was designed for launch only in the northern hemisphere. The Ariane 5 was launched from south of the equator, hence the importance of unsigned vs signed.

- A run-time warning was generated. Problem was where to put the warning, which wasn't specified, and which thru dangling pointer overwrote current position/velocity data. Watchdog code noticed the now-impossible trajectory, decided the system was very very confused, and initiated self-destruct for (relative) safety.

InclinedPlane · on March 6, 2013

Both Ariane 4 and Ariane 5 launch from the same facility (in Kourou, Guyana, almost right on the equator (5.2 deg. N latitude)). The specific problem was that the Ariane 5 had more horizontal motion than the Ariane 4. Within the code there was a conversion of a 64-bit float to a 16-bit signed integer, which overflowed on the Ariane 5 flight causing the inertial measurement unit to start spewing out diagnostic information which the flight control software merrily interpreted as data and drove the vehicle off course.

HeyLaughingBoy · on March 6, 2013

Nitpick.

Kourou is in French Guiana (Cayenne), not Guyana. They are different countries. Similarity in name is due because three neighboring colonies used to be called "the Guianas." French, British and Dutch. British Guiana became independent in 1966 and changed its name to Guyana.

Minor, and rather pedantic, detail, but seeing the place you were born being constantly mischaracterized is a bit irritating after a while :-)

david_p · on March 6, 2013

A friend of mine did a presentation about this. What he told me is that apparently, the developpers who wrote and tested the code that overflowed were designing for a value in miles.

When stored in miles, the value would never got outside the range of a 16 bit unsigned int, but the actual value used was in kilometers, and when converted to kilometers, the value would overflow.

Tloewald · on March 6, 2013

This seems unlikely for a European/French system. Are you perhaps confusing this story with the NASA Mars Observer mission, mentioned elsewhere in the thread, which failed because of a metric vs imperial error?

The value was horizontal velocity. I imagine in metric it would be expressed in m/s while in imperial it's going to be feet per second (mph seems highly improbable).

marco-fiset · on March 6, 2013

If a simple multiplication by 1.60934 (miles to kilometers) can create an overflow, than you are not using a big enough data type. I can't believe that's the real reason.

david_p · on March 6, 2013

Apparently I mixed up the story with Mars Climate Orbiter ...

More details about the Ariane 5 failure here: http://people.cs.clemson.edu/~steve/Spiro/arianesiam.htm

ctdonath · on March 6, 2013

The Mars Climate Orbiter mixed up feet with meters. Seems it calculated how far off the surface it was during landing, and when it decided it was at 0 units above the surface the retro-rockets were shut off as intended. Except that it was, in fact, still a few miles up. Code re-use struck again.

eps · on March 6, 2013

Reminds me of the (alleged) reason why first Soviet Mars missions missed the planet - there was an erroneous period instead of a comma at some part of its nav program written in Fortran.

jacquesm · on March 6, 2013

Reminds me of the (confirmed) reason why US mission http://en.wikipedia.org/wiki/Mars_Climate_Orbiter hit the planet at the wrong angle and desintegrated in the atmosphere - there was a mix-up in imperial and SI units.

Uncompetative · on March 6, 2013

Incorrect.

According to Expert C Programming : Deep C Secrets by Peter Van Der Linden page 61:

    'Table 2-2 The Truth About Two Famous Space Software Failures

    WHEN            MISSION     ERROR                        RESULT                               CAUSE

    Summer 1961     Mercury     . used instead of ,          nothing; error found before flight   Flaw in Fortran language

    July 22, 1962   Mariner 1   "R" instead of "R̅"           $12M rocket and probe destroyed      programmer followed error
                    (to Venus)  written in specification                                          in specification'

I have no idea whether the formatting of this table will be preserved, its inclusion is to save you looking for the book, you could either download the following .pdf and search for 'mariner' due to the lack of page numbers

http://www.e-reading-lib.org/bookreader.php/138815/Linden_-_...

or just use what little Google Books provides

http://books.google.co.uk/books?id=4vm2xK3yn34C&q=fortra...

As you can see neither mission involved Mars and the faulty . being used instead of a , was found before it caused any harm.

More information follows on the failed Ariane 5 launch - which Bertrand Meyer attributes to "insufficient specification":

http://se.inf.ethz.ch/~meyer/publications/computer/ariane.pd...

http://en.wikipedia.org/wiki/Ariane_5_Flight_501

FoeNyx · on March 6, 2013

or was it NASA Mariner 1 ? http://en.wikipedia.org/wiki/Mariner_1#Overbar_transcription...

Anyway, as always wikipedia has an interesting list of software bugs : http://en.wikipedia.org/wiki/List_of_software_bugs

I found interesting the computer crash of F22-Raptors after crossing the International Date Line.

huhtenberg · on March 6, 2013

They don't have a Mercedes Smart bug where they mixed left and right causing the car to throw itself on its side when going through a turn.

(edit) ... and I'm downvoted. Lovely. Totally makes sense.

CanSpice · on March 6, 2013

You probably got downvoted because you just listed this bug without offering any type of proof that this bug actually happened.

crististm · on March 6, 2013

When did HN turned into Wikipedia?

Someone · on March 6, 2013

I think you mean a Mercedes-Benz A class, in the moose test (http://en.wikipedia.org/wiki/Moose_test)

I have never heard that was software related, but it also is a decent bet software was involved, e.g. for stiffening the suspension. What top of the line car did not have software, in 1997?

nawitus · on March 6, 2013

To have software involved with the suspension requires an active suspension, which was rare in 1997. Mercedes added an electronic stability system only after the Moose test failed, so I don't think it was a software problem in the first place (rather software fixed the problem).

huhtenberg · on March 6, 2013

That's the one, bingo. It was a new line of smaller MBs, so I misremembered it being Smart. The issue though was most certainly with the software overcompensating the roll in the wrong direction. I would've not remembered it otherwise :)

lttlrck · on March 6, 2013

The A class didn't have active suspension. I believe it was a mechanical issue and solved with a stiffer front anti-roll bar and other suspension geometry tweak. Though there is a possibility the ESP was tweaked to apply the brake under such circumstances I think that could do more harm than good. I could be wrong, it was 1997.. IIRC the non-ESP MB Sprinter also had stability issues.

huhtenberg · on March 7, 2013

@sebbi - you are shadow-banned.

rbanffy · on March 6, 2013

The A was as far from top of the line as MB goes. It was a cheap, entry-level model.

Aardwolf · on March 6, 2013

Don't they test these programs in virtual cases with simulations of realistic data, speeds, angles, altitudes, etc...?

jlgreco · on March 6, 2013

I don't know if it is just a mis-telling of the F-22 bug (I suspect it may be), but I have heard of a bug in the autopilot software of some fighter plane that would cause the plane to flip upsidedown when it was south of the equator on autopilot. Presumably this bug was discovered during simulations, and was never actually accidentally triggered in the wild.

InclinedPlane · on March 6, 2013

In 1962 there wasn't exactly enough computational power anywhere to do a reasonable simulation of a flight.

ctdonath · on March 6, 2013

There was a very expensive stock-trade mistake when someone issued a sell order, typing 'b' instead of 'm' for 'million'. Market dropped a significant fraction in seconds.

fr0sty · on March 6, 2013

If you are talking about the US "Flash Crash" that theory is little more than urban legend.

Here is one of the "b" instead of "m" stories published on the day of the crash:

http://www.cnbc.com/id/36998463

Here is the Wikipedia summary of the event which lists "fat finger trade" as one of the discredited theories:

http://en.wikipedia.org/wiki/2010_Flash_Crash#Early_theories

jhonkola · on March 7, 2013

Another (probable) software failure due to unexpected scenario was the Mars polar lander http://en.wikipedia.org/wiki/Mars_Polar_Lander#Loss_of_commu...

The failure review concluded that the probable cause of loss was that the landing system software apparently interpreted the deployment of lander's legs as touchdown and shut down the descent engines. The vibrations caused by the deployment of the legs was not taken into account when designing the software.

neurotech1 · on March 6, 2013

There has been several references that SpaceX "fly" their Falcon 9 computer systems to test for bugs like this. The idea being that as far as the computer is concerned, it is a real flight and should act accordingly. Most of the problems to date, have been related to a mechanical problem. During the first docking, there was a minor issue with the sensor "field of vision" but this was fixed.

The point is that SpaceX procedures seem to be able to prevent similar software bugs in the Ariane 5 from causing a catastrophic abort or failure.

axusgrad · on March 6, 2013

They have to fool the sensor inputs to look like a real flight, right? They can't necessarily think of every realistic but unexpected input. Definitely worth trying, though. I'd actually be surprised if they hadn't done that with the Ariane.

kiba · on March 6, 2013

How come the flight system at Ariane 5 wasn't tested like this?

InclinedPlane · on March 6, 2013

Resources? Afterward they simulated flights using the Ariane 5 systems and duplicated the errors.

kiba · on March 6, 2013

I don't think resource is an issue given that they spend a lot of money on its development.

InclinedPlane · on March 6, 2013

Most big bureaucratic organizations tend to be "penny wise and pound foolish". They might spend billions on developing a new launch vehicle but balk at spending a few million on a HIL simulation of a real launch.

3327 · on March 6, 2013

happens to the best of us... If its any condolence my first game app crashed after 10k points for a similar reason. check it out it should still be on the android store - Alliegator

nernst · on March 6, 2013

There is a good article examining the various possible causes of the Ariane-5 disaster by Bashar Nuseibeh: "Ariane-5: Who-Dunnit?". See PDF here: http://www.inf.ed.ac.uk/teaching/courses/seoc/2007_2008/reso...

webreac · on March 6, 2013

"R1...More generally, no software function should run during flight unless it is needed."

This means, that even using the most reliable language and trying to test as much as possible, there is always a risk of an overseen bug.

jaxb · on March 6, 2013

For more stories like this, get the book by David M. Harland, "Space Systems Failures: Disasters and Rescues of Satellites, Rocket and Space Probes"

youngerdryas · on March 6, 2013

From the linked James Gleick article:

"the programmers had decided that this particular velocity figure would never be large enough to cause trouble. After all, it never had been before. Unluckily, Ariane 5 was a faster rocket than Ariane 4. One extra absurdity: the calculation containing the bug, which shut down the guidance system, which confused the on-board computer, which forced the rocket off course, actually served no purpose once the rocket was in the air. Its only function was to align the system before launch. So it should have been turned off. But engineers chose long ago, in an earlier version of the Ariane, to leave this function running for the first 40 seconds of flight -- a "special feature" meant to make it easy to restart the system in the event of a brief hold in the countdown."

michielvoo · on March 6, 2013

That seems like a contradiction: it was caused by a calculation regarding velocity, but that calculation served no purpose once in the air.

Based on the quote I* have to agree with the developer then: this particular 'horizontal velocity' figure would never be large enough to cause overflow, it would always be zero since the rocket should still be on the platform.

So maybe the existence of the routine was the root cause, and not so much the potential for overflow inside the routine?

* I am not a rocket scientist

InclinedPlane · on March 6, 2013

> "this particular 'horizontal velocity' figure would never be large enough to cause overflow, it would always be zero since the rocket should still be on the platform."

Except the Earth is not stationary, nor is the surface of the Earth moving at a constant velocity.

CognitiveLens · on March 6, 2013

Regardless, the Earth didn't substantially speed up between Ariane 4 and 5, so although the horizontal velocity figure might not be zero, it would at least be approximately constant on the platform.

InclinedPlane · on March 6, 2013

The point I was making is that it is necessary to calibrate an inertial guidance system to the ground, you can't just pretend that the launchpad is stationary. More so, because a launch vehicle's guidance will become entirely internal (dependent only on the on board systems and commands sent from mission control) well before launch it's not so easy to have some software routines running on the rocket which don't run during an actual launch. I imagine it was far easier to simply use a timer to allow the routine in question to run up through T+40s than to attempt to programatically trigger deactivating the routine in the event the vehicle actually left the pad. More so given that running the routine through part of an actual launch, on an Ariane 4 at least, had never been problematic before.

The problem was that nobody had formally detailed all of the assumptions and risks for every part of the code, so when the conditions changed and those assumptions became faulty nobody was the wiser because nobody was aware they were actually making such an assumption about the speed of the rocket.

Martimus · on March 6, 2013

Yeah it certainly seems like that based on the way it's written. However, it does say "horizontal velocity of the rocket with respect to the platform".

Perhaps it had something to do with calibration prior to launch. The report states: "The Operand Error occurred due to an unexpected high value of an internal alignment function result called BH, Horizontal Bias, related to the horizontal velocity sensed by the platform. This value is calculated as an indicator for alignment precision over time."

flagnog · on March 7, 2013

compare this to the failures experienced recently with SpaceX: Elon's launch, while not perfect, recovered. I think this shows the power of E's vision, and how he's going to change the launch market.