Hacker News new | past | comments | ask | show | jobs | submit login
Are we shooting ourselves in the foot with stack overflows? (embeddedgurus.com)
267 points by nuriaion on Feb 18, 2014 | hide | past | favorite | 140 comments



If I'm reading the testimony correctly, there is actually no evidence that a stack overflow caused unintended acceleration. The idea is that Toyota used 94% of the stack, and they had recursive functions. If (big if) the recursive functions used enough stack to cause an overflow, memory corruption could happen. If (big if) that memory corruption happened in exactly the right way, it could corrupt variables controlling acceleration. And then, maybe, unintended acceleration could occur.

But that's a far cry from the stack overflow actually causing any cases of unintended acceleration.


I didn't read the linked article but I did look through the slides, which were very interesting and talked about

- the abhorrent state of the engine control module code,

- the RToS's design,

- the critical data structures right above the stack,

- that those critical data structures weren't mirrored to detect corruption as is standard and as they did for other data,

- that single bit changes in that critical data structure right above a stack can cause the death of tasks in the RToS whose failsafe capabilities were located in those same tasks, and whose death was tested and confirmed to cause unintended acceleration consistent with accounts and descriptions of the event

- that the failsafe monitoring CPU was not designed to detect this failure, and in fact Toyota outsourced its design and didn't even have the source code to it...


Effectively this is the total point of the transcript - that the system was fatally flawed in software design. Any number of bad things could have happened to cause catastrophic faults. Unintended accleration was demonstrated to be a result that could happen on a fault occurance.


> Unintended accleration was demonstrated to be a result that could happen on a fault occurance.

Where was that demonstrated? It was hypothesized that it may be a result, but I'm not seeing anything other than speculation.


Again, in the linked slides....

The article itself was basic and didn't really talk to the specific Toyota case, but the linked slides did... if you're going to speculate about the trial, look at Barr's slides, not that article..

http://i.imgur.com/IGWUXAS.png

http://i.imgur.com/1ZiVJRC.png


It was done in a lab by Barr & his assistants. See the court transcript.


As far as i understood the slides the stack overflow part is only one critical problem.

The v850 has 1...256kBytes of RAM. Let's assume 256kBytes of RAM. Then lets assume they have a Stack of 16kByte. => There are only 983 Bytes of free stack. Then a function call with 5 Int parameters and 10 Int variables needs around 64Bytes. => Only 16 recursive calls are needed to have a stack overflow.

Further because the critical data structure in the os are not secured the os will have no idea that something is wrong.

(In several projects we had to have our critical data 2 times in RAM, once normal and once inverted. In a critical project we even had to do all computations two times, once with the normal data and once with the inverse data. This way you will also find some bit flipps of the ALU, RAM etc.)

edit: In the slides i found that the stacksize is 4kByte => only 245Bytes are free! So you can't make many recursive calls!


i know you don't have any reason to trust me, but I own a Toyota minivan (in Thailand) and I had this unintended acceleration happen to me. I was in stop-and-go traffic, so I only ended up rear-ending a taxi at <5mph but its incredibly scary to be behind the wheel of a car you have no control over. The car's engine was revving+downshifting to overcome my foot slammed on the break.

taking the car to the toyota repair place, they said there was nothing wrong with the car, and nothing strange in the computer's logs.


The problem with these stories is you omit important details - i.e. what was the year make and model of the car, did you have carpets or floor mats in it etc.

And they always always focus on the car manufactuer of the moment. It used to be Audi's, and it's confirmed that a whole bunch of Jeeps actually will do this at car washes due to static discharge (problem's never really been fixed).


I think the most important detail is that he slammed on the brakes and the car continued to accelerate. Cars with a computer controlled accelerator should be able to stop accelerating if the brake pedal is pushed.

I would also expect that he would recognize it if the accelerator pedal had been stuck on a floor mat.

I think it's interesting that even in a small forum such as HN, we still have an anecdote. It's also interesting that car owners still claim to experience the acceleration problem after the two (mechanical) recalls.

It reminds me of the Therac 25 incident, where there were only "anecdotes" to be found for _years_, and the company concluded that a mechanical switch had to be the fault. It's very interesting that we now have evidence that the software _can_ be at fault.


> I think the most important detail is that he slammed on the brakes and the car continued to accelerate. Cars with a computer controlled accelerator should be able to stop accelerating if the brake pedal is pushed.

Actually this points to his story being incorrect. If he slammed on the brakes and the car continued to accelerate than he probably slammed on the gas by accident instead. Testing shows that if you slam on the brakes while the engine is pegged at full throttle, you'll still come to a screeching halt. The brake system vastly overpowers the engine even in high powered cars. In a Camry V6 it actually takes just 16 feet further to stop from 70mph with the throttle wide open at 190ft vs. 174ft.

And the brake system is not drive-by-wire like the throttle is, so it's not susceptible to faulty ECU problems.


you don't have to talk about me in the 3rd person man. I'm right here. you can ask.

If i pushed the accelerator instead of the brake, you can sure bet I would have done more than just "bump" the taxi in front of me.

The brake was fully pressed to the floor. The engine downshifted and revved to try to move forward. Dunno what else to tell you. Sure seemed like the computer going crazy to me when it happened.


That's possible, but I don't think it's likely. The feel of an accelerator is very different from a brake pedal which normally will not go all the way down. Of course, more details about the story would be nice.

Btw, there have also been Prius recalls due to faulty ABS software, so while you'll probably always have some control, braking action can certainly be affected by software issues.


ABS is an independent pump. It can fail and you'll still be able to slam on the brakes.

Also hitting the wrong pedal is not exactly uncommon. People don't like to admit they just made a mistake, hence claims of "unintended acceleration". But just like the claims against Audi back in the day, the modern claims against Toyota appear to be unsubstantiated.


I don't think they appear to be, after numerous recalls affecting millions of Toyota cars. That there have been real cases on unintended acceleration related to Toyota is not really in dispute, unless you think Toyota did the recall mainly as a PR effort.

Some people might be unwilling to admit that they stepped on the wrong pedal, but the likelyhood that some random HNer would do that and then make a post about it?

ABS brakes work by relieving _or_ reinforcing brake pressure, so depending on the system you could end up with uneven, little or no braking action after a failure. Or you could end up with locked wheels.


You might want to actually take a look at some of those recalls, they're pretty bogus (like the whole "bring it in so we can adjust the floor mats" thing). And yes, I do believe Toyota did the recall as a PR effort. Take a look at what Audi did instead. They, correctly, stated there was nothing wrong with the cars and they took an insane PR hit and sales plummeted. There continues to be no actual evidence of unintended acceleration in Toyota vehicles.

And I don't think HNers are somehow less likely to hit the wrong pedal or refuse to admit it. There's always that possibility that they don't even know they hit the wrong pedal.


So, there is NO actual evidence? As in, this is just anecdotal, as in the driver, dealer and technician were all wrong in this instance: http://www.leftlanenews.com/toyota-avalon-displays-unintende...

..and the 400% increase of unintended acceleration events (from the testimony) starting in 2004 is also a complete accident, or brought on by some witchhunt?

Aren't you stretching the limits of what is a reasonable assumption in order to maintain that opinion?


The problem is cars have weird incentive schemes. Namely: fault, because it's related to insurance and they're expensive. If you have an accident, the absolute last thing you ever do is admit fault.

There's also no specific reason to think the driver would identify various mundane fault conditions - i.e. if your car lurches forward and you hit a taxi, then such an impact would be enough to unwedge a stuck carpet.

There's also lots of unknown detail - i.e. if the car was accelerating out of control while you road the brake, and you hit something at 5mph, then did the car stop accelerating at that point? Was the engine damaged and it stopped then? Why not shift into neutral if you're cognizant of the fault condition? Did the car not shift or what happened etc. In these cases make and year model - in Thailand - is pretty important too, since its a market which would have a lot of old used cars. A 1980s model minivan is going to have rather different throttle control.

It's all the hallmarks of urban legend, replete the buy-in line of it being foreign cars - which is how the story always goes and is played with the media. Sure, I'm willing to believe there are real control system issues, but it seems odd to me that no recall notices go out for ECU updates.


Well, if you read the testimony, it seems clear that Toyota is overconfident in their software. They have updated the ECU software, but still as of 2010 they managed to implement the brake failsafe in the same software task that senses the accelerator, so if that task dies the brake failsafe won't work either.

That there have been tens of deaths and millions of car recalls because of unintended acceleration in Toyotas cars is at least not an urban legend. Of course it will be very difficult to locate the exact problem, but this testimony is interesting in that it shows how that can be the case (contrary to what Toyota has claimed) [Edit: And they have actually been able to reproduce unintended acceleration by memory corrupption].


> I think the most important detail is that he slammed on the brakes and the car continued to accelerate. Cars with a computer controlled accelerator should be able to stop accelerating if the brake pedal is pushed.

The transcript from the toyota case reveals that a crash on the control thread (managing acceleration, say) would also crash the brake control system until the car was restarted.


In the car at issue in that lawsuit, it didn't have a brake control system in the ECU at all. But yes, in later (2010) models that was the case.


In Toyotas, the accelerator and the breaks are (were, before this) separate systems.

The downside is that hitting the break would not cut the gas.

The upside is that you would need two separate isolated systems to fault in order to have out-of-control acceleration. Even if the accelerator system is broken, slamming on the breaks will generally always win at the hardware level.


> The upside is that you would need two separate isolated systems to fault in order to have out-of-control acceleration. Even if the accelerator system is broken, slamming on the breaks will generally always win at the hardware level.

IIRC, I think for the court case, the brakes for the particular model of car could physically not stop the engine if the engine was going 100% acceleration at highway speed.


indeed. I tried to recreate the scenario afterwards by pushing both the brake and acclerator at the same time (on an empty road of course). The car didn't try downshifting to compensate for my pushing the brake and it stopped the car.

my minivan is a toyota innova. dunno the year, a 2011 or 2012. And this happened BEFORE i heard of any software glitches. And right when it happened my first thought was "MY GOD THE COMPUTERS GONE CRAZY"

edit: oh and one more thing: I'm a 1 foot driver, and i could tell if my foot wasn't on the brake! ;)


my car is a toyota innova minivan. year is either 2011 or 2012, dunno which (no, I didn't buy it for my pleasure, thankyouverymuch).

i mention in another reply: i thought it was the computer's fault, and only months later did I see articles about other toyotas having the same problem due to a software bug.


This is safety-critical software, and the burden of proof rests on the manufacturer. One of the things that distinguishes an engineering culture from a tinkering one is that in the former, it is not acceptable to assume everything is OK just because we haven't noticed anything going wrong yet.


I'm very nervous because in the Toyota case I have previously encountered people posting "proof" that things happened that seemed more to be marketing for their own company.


Isn't 96% of stack usage a bit high? When you are dealing with recursions?

You pay a lot for those cars, can't they at least put better electronic hardware. They probably have less than my phone from 5 years before


> You pay a lot for those cars, can't they at least put better electronic hardware. They probably have less than my phone from 5 years before

Car electronics have a very long development process. When the cars in question (models from ~5 years ago) were designed (~10 years ago), the hardware they chose was probably quite decent for that era.

When the next model of the car is designed, they will most likely end up using the same model of computer (or a successor with conservative upgrades) to avoid having to redesign the hardware and software that much.

The cost of the actual hardware is negligible compared to the cost of the redesign.


What cost? Lives, recalls? This probably is the motivation, but it is a false myopic accounting that makes this calculation. I have personally witnessed way too many embedded engineers cost reduce themselves directly into this situation and do it as a matter of pride. They will use a 2.75$ part instead of a 3.80$ part that forces design trade-offs that cause errors like this. The WRT54 was released in 2002, it had 16MB of ram and 4MB of flash running @ 125Mhz. There is absolutely no reason that the ECU couldn't have been running something similar.

I would argue that Toyota squandered billions with this failure. The ECU needs to be a protocol that can be swapped out at any time to decouple the evolution of the machine.

It is a shame, morally and fiscally that embedded development isn't using safe, provable and verifiable languages.


The WRT54 was released in 2002, it had 16MB of ram and 4MB of flash running @ 125Mhz. There is absolutely no reason that the ECU couldn't have been running something similar.

The WRT54 also runs in a comfy corner of your living room, fails every few years, and crashes.


> When you are dealing with recursions?

They shouldn't be dealing with recursions. If stack corruption is what caused their failures, inappropriate testing played an important role IMHO.

> You pay a lot for those cars, can't they at least put better electronic hardware.

This isn't how it works for cost-sensitive designs. You don't hear people boasting about how they have a quad-core car computer and how the touchscreens from their motor control are perfect for Facebook interactions.

The way people think about this is, if half your RAM memory never gets used, then you used twice more than you need and your module is more expensive than it should be. CPU use never increases past 20%? It's about 80% more powerful than it needs to be. And so on.

"Better electronic hardware" (in the sense of "more powerful" or "faster") also introduces additional complexity. This means more difficult constraints in testing, longer and more expensive verification processes, additional non-deterministic behaviour and so on.

Not that their system wasn't at fault. It was, but throwing more hardware at it wouldn't have made it better.


> This isn't how it works for cost-sensitive designs. You don't hear people boasting about how they have a quad-core car computer and how the touchscreens from their motor control are perfect for Facebook interactions.

I work for a company that makes quad core computers for automotive use and they do end up being used for Facebook interactions among other things like the dashboard etc. The engine management computers will be a separate entity, though. If you look at the big auto shows from the past few years, the car manufacturers clearly do think that this going to be a major differentiator in the next years and it's going to be average consumer models too, not just premium sports cars like today.

But the quad core chips we sell today will be on the road in five or more years. By the time they roll out of the assembly line, the computers will not be spectacular by the standards of that day. A smartphone is 6-18 months from design to production, a car is several times that.

It's not like the car manufacturers were cheap on the hardware.


In a sense they are rather cheap on the hardware.

Instead of using stuff like netbook chipsets they tend to gravitate towards mobile chipsets. The difference between IMX.6 and AMD Jaguar (just examples, you can also look at Intel's chipsets and other boards like Tegra etc) is like night and day. Why isn't the Jaguar used in Mobile phones? Because it can use tens of watts compared to just few watts of IMX.6.

So at least for me it seems the companies wish to save few watts of power usage and few tens of dollars per car.


My car was parked outside in sunny upstate NY a few weeks ago. I started it up, warmed up the engine for 5 m and drove 70mph to work in the cold. The temperature? -5 F. In a few months, it will be 110+ F in the sun and humidity when the car is parked outside.

Things like engine control modules and even entertainment systems in cars operate outside of the environmental ranges that a netbook is designed to survive in. I don't want my car sidelined because of some unreliable computer.


The safety regulations, standards compliance requirements, mechanical, thermal and electrical parameters are significantly more strenuous for automotive ICs than for consumer devices. Does AMD manufacture Jaguar chips that can be used in automotive?

There are also issues of logistics at stake, such as maintenance. Unless AMD is willing to manufacture a certain Jaguar chip for the 5-8 years that automotive manufacturers typically require it, the Jaguars wouldn't even be considered for many systems.


But having some additional capacity available gives them the ability to do field upgrades. The firm I worked for a while ago had to undertake a very expensive hardware refresh because there just wasn't any way to get any additional bug fixes into the field -- they were down to 20 bytes free. In something like a car, that you know people are going to drive for 10+ years, you need that extra space not only to make bug fixes, but also to comply with new legislation (such as brake override), and also to offer a few new features to your customers.


Another valuable engineering lesson; Always Have Headroom. If one designs or operates to the limit there is no margin for error. Resilient systems can get pushed beyond their acceptable limits and recover.


1+.

IMO, You need several sets of limits: standard limits posted to the consumer, engineering limits posted to the techie/maintenance guy/developer/etc, and actual limits.... each of those is comfortably beyond the others. Know the actual limit, but design well under that if at all possible, because the system will be misused.


It would be stupid of me to disagree. One should always leave headroom for bugfixes, future expansion and so on; it's only mindlessly throwing hardware at problems that I disagree with, not futureproof engineering.


Actually I slightly disagree with your "You don't hear people boasting about how they have a quad-core car computer and how the touchscreens from their motor control are perfect for Facebook interactions." statement.

Of course you're right if we're talking engine management/internal stuff, but complaining about the laggy/slow/annoying performance in all things entertainment is quite a well-known first world problem in my circles.

In fact, I'm annoyed by each car I drove over the last 10 years due to their inability to provide sensible hardware (and charge a huge markup for all these 'official' components on top: Think navigation: You get a 3rd party system for a fraction of the cost of the supplier provided one, often providing better features, decent updates, extensibility - while you're stuck with whatever your manufacturer grabbed for pennies).

So in context ("Increase RAM so that the stack doesn't grow into the area where my acceleration value is stored") you're right, this isn't an issue. In general though I haven't seen a car manufacturer that gets consumer electronics/entertainment etc. right.


Mission critical applications tend to use working, proven, components. Why buy more complexity?

What I find confusing about the article is that it describes how to avoid problems on a completely different architecture. ARM (Von Neumann) vs v850 (Harvard), hard stack exception vs none, etc.

These differences and resulting inapplicable recommendations confound what could be an interesting article.


The article is indeed a mish-mash. Not everything that uses the ARM instruction set and architecture is Von Newumann though. The Cortex M3 and M4 MCU series are Harvard.


Good point, although it was a bit cheeky of me to generalize anything these days as Von Neumann vs Harvard. Thanks for keeping me honestish. :)


96% may look high but it's still <= 100%, meaning if their analysis was correct, there's no way it could overflow.


Similar situation to why you still get spacecraft using chips from the 90s. They need chips which will remain reliable for a significant lifespan, in hostile environments. Way beyond what you're going to expect for a phone or PC.

Quite a challenge when the consequences for failure could be extreme.


The failures were extreme, both in terms of lives and brand.


Their software had bugs that could cause a stack overflow and critical data wasn't protected. Sure, it might not be the real cause, but it's a pretty likely candidate. It's not like these errors are unlikely in general. Apart from the occurrence of the stack overflow itself (which is hard to directly observe), the rest of the problem was demonstrated in testing. Memory corruption in the wrong place kills a certain task and that causes uncontrolled acceleration.


Not only that, but the software complexity and amount of globals was so high that thorough analyses or exhaustive testing of the key logic was impossible.


Even if you work in a gc language with a vm and all memory errors are checked, here is the major, MAAJOR, wisdom you should take with you:

The crucial aspect in the failure scenario described by Michael is that the stack overflow did not cause an immediate system failure. In fact, an immediate system failure followed by a reset would have saved lives, because Michael explains that even at 60 Mph, a complete CPU reset would have occurred within just 11 feet of vehicle’s travel.

We have seen this scenario played out a million times. Some system designers believe it is acceptable to keep the system running after (unexpected) errors occur. "Brush it under the rug, keeping going and hope for the best." Never ever do that. Fail fast, fail early. If something unexpected happens the system must immediately stop.


When you're talking about a control system "Fail fast, fail early" isn't necessarily an option. In this particular example, while it's arguable that temporary loss of control might be better than the actual outcome, it's still a pretty unacceptable thing to happen.

At the point where you are trading off one failure state for another, you need to think very carefully about which one is truly worse, and sometimes "Fail fast, fail early" is a much much worse choice than the alternative.


When you're talking about a control system, you need to be able to safely recover from a hard crash anyway, because one could happen at any time for purely hardware reasons. Thus, forcing a hard reset in case of inconsistency should be safe.


Experience (which always should trump theories about how things should work) proves that you are wrong. What would have happened in this case if fail fast was used and the hardware had immediately resetted upon the first byte being written outside the stack area? Likely the bug would have been caught and fixed during testing because debugging a randomly resetting cpu is comparatively easy. It's a hundred times better than people being killed because small unexpected errors are allowed to accumulate.


>Likely the bug would have been caught and fixed during testing

Sure, for testing, you should absolutely fail early and loudly, because there are no consequences for doing so. But in the real world, and especially in control systems, "failing" can cause more damage than persisting in a corrupted state.

> small unexpected errors are allowed to accumulate.

I'm not arguing that you should just ignore errors completely, just that the correct response to a particular error must be very carefully considered, and your blanket suggestion to reset immediately is often a terrible idea.

I think a far more appropriate response to a single byte being written outside of the expected area, is to shut down all unnecessary processes, and tell the driver to pull over. This would be far safer than just randomly disabling the controls, however temporarily.


Sending uncontrollable accelerations for hundreds or thousands of feet is safer than a maximum of 11 ft of coasting?

Yes, there are always trade offs to be made. But a corrupted system is utterly unpredictable - if a control port can do something, it eventually will, at the speed of electronics. On the other hand, a shutdown is entirely predictable. We design for it, because every system does eventually fail and shut down. It's a 'normal' mode of operation in this sort of system design. That tends to weight the design heavily towards very fast resets and redundancy in mission critical (people die if you mess up) systems. I can design a system to handle a shut down control board (limit the robots arm's travel, etc). I often can't do anything if the arms are waving about randomly, under power.

I'm not opining randomly; I've worked in flight control, robotics, UAVs, and factory machinery. Until my current job, I've always had to worry about killing somebody. You really can't let software that controls dangerous equipment continue to run in a damaged state unless the rest of the system has a supervisor mode that can override and limit the system's behavior. Even then, I'm having trouble thinking of a scenario where I would prefer to leave that process running vs shutting it down.


What does it even mean to shut down the control systems on a moving car? All stop? Coast? What's the physical behaviour that isn't likely to kill someone?


At most, it means you're coasting with no power assist for steering or brakes -- you'll need some extra braking distance, but that's what the shoulder is for, and it'll take a bit more than the usual effort to turn the wheel, which means that getting onto the shoulder, especially if you're in a middle lane when all this goes wrong, is going to need to be smartly done.

Taken all together, the impression I get is of a problem in driving skill that's roughly as difficult as a blown tire at highway speeds -- possibly a bit easier, actually, considering the deleterious effect a blown tire has on steering control. An alert and competent driver should be able to handle either situation without posing a deadly danger to herself or anyone else.

The same, I think, cannot reasonably be said of uncommanded acceleration. With a blown tire or a failed ECU, all you have to do is use your ordinary driving controls to get off the freeway and bring the vehicle to a safe stop. With a throttle stuck wide open, you are suddenly a race car driver, only you're neither in a race car, nor in a race. You can't bring the vehicle to a stop with the ordinary driving controls, at least two of which -- the accelerator and the brake -- are no longer responding properly or at all, which is itself frightening and disorienting to the driver. In order to bring the vehicle to a stop, the driver must shut off the engine with the key, at which point the problem reduces to our failed-ECU worst case above, just with some more speed to burn off.

But there is no circumstance, in either the normal driving regime or even any other abnormal one, where turning off a moving car is a proper or safe response to any situation, which is why almost no driver has ever given the slightest thought to doing so -- and when your car's speeding up past ninety all of a sudden, and you're not telling it to, is maybe not the best time to be thinking up new ways to interact with your car that you've never thought of before. At a guess, I'd say some people whose Toyotas ran away on them were able to come up with the idea in time, and they mostly survived. Others ran out of time before they thought of it, and those unlucky souls mostly died.


I agree with most of what you've said. But point out that you didn't go with the worst case example, loss of steering / brakes on ice.

Unwanted acceleration on ice would be worse though. I can't imagine, in rush hour conditions, being unable to control my acceleration. At best, you'd rear-end someone, and their car would stop you.


Yes, it is an option. It should fail as fast as possible, that's why you have contingencies in place. For instance, the Space Shuttle's computers checked their results against one another. If a discrepancy was detected, the computer would be immediately voted out. In case all computers were compromised, there was yet another computer with different hardware and software.

The point is that any erroneous result would cause the offending computer to be voted out immediately. And that's what you want, otherwise the system will appear to work fine, while being in an undefined state.


It is an option, but not always the best option.

I think it was one of the Mars landers where they disabled error detection for landing. If the system reset near the ground then the lander would crash before the controller had rebooted, whereas a small error might not have disrupted the landing at all.

Fail fast is good advice, but should be weighed against other strategies to choose the most appropriate one for the circumstance.


Control system doesn't mean steering. The _unsafe_ choice is to continue. You are speaking in generalities when Toyota software was not constructed to carry on in the face of failure, it just pretended it didn't happen. Huge difference.


> a complete CPU reset would have occurred within just 11 feet of vehicle’s travel.

Then it fails again. Resets again. Fails again...

It may be preferable to restart, but in many failure scenarios, if the source of the failure is unknown, it is not a given that you'll be able to restart cleanly again.

I agree that failing fast to an extent is preferable, but that too requires careful testing and consideration.


Except in a modern car the effect would be the vehicle stops accelerating and starts coasting - brakes and steering are mechanically connected, so in the event of no control system the default situation is safe.

The missing bit is generally that if it fails, it needs to set a note that it should not resume normal operations.


If it keeps failing, then the car does not function and it gets towed to a repair shop. It could be failing for completely valid reasons, not just a programming error. Failure is a expected, we need to account for it.


What makes you think it would fail again right after the reset? It hasn't necessarily been put through the same sequence of state changes.


And the second wisdom to take from this is make your control system boot up fast, just a few hundred ms. Maybe you have time to load a few necessary parameters from flash memory but after that everything should be good to go. You don't want to wait 1 minute for windows to boot up.


First I was annoyed at yet another upvote fishing blog post on stack overflow. Then I read it, while I was annoyed at getting caught by the catchy headline that I conciously despised. Then I saw that it was not at all about some forum on the web and now I cannot stop smiling.


Isn't recursion—even if it's indirect—disallowed completely when doing embedded C programming for safety-critical devices?

UPDATE: Yup, #70 on the MISRA C rules: http://home.sogang.ac.kr/sites/gsinfotech/study/study021/Lis...


So what's the catch? We have been developing memory architectures, and embedded systems, and OS's for decades now. So if the solution is as simple as this post says, why hasn't it ever been implemented before?

I am hoping there are experts here that can shed some light on this


This has been, and typically is, implemented. There are various ways to ensure your stack doesn't thrash your data, and this is one of them. Why Toyota hasn't is not something I can answer, but the principle that the article describes is well-known.

As for the last part of the article, there are ways to get around this issue when using operating systems, too. Some of them depend on certain hardware features being present, of course, and what should be inferred from this is that, where such protection features are critical, hardware should be picked accordingly.

Edit: I've seen a lot of posts here blaming C and the unmanaged runtimes. While a managed runtime can provide some amount of protection, it's worth noting that:

* It requires a MMU. These aren't smartphones folks. An MCU with a MMU isn't cheap.

* In the absence of correct memory management from the kernel (and that also requires a MMU!), it's perfectly possible to smash data regions with overgrown stacks, even on managed runtimes.

* You can achieve a good amount of protection using special compiler features. That requires no special hardware support.


A mix of technological and cultural path dependence, cowboy attitude to performance/optimization, and inability to quantify risks.

(Somewhat interrelated, eg cowboy attitude -> C, path dependency in C usage -> hard to reason about programs -> hard to quantify risks)


This. I've done consulting in embedded for years and years. The discipline of modern desktop/web development has not reached the embedded world, except in a few high-margin areas (such as military and transportation). There are 1,001 reasons for this, ranging from people who didn't learn source control, open source and unit-testing doing much of the coding (iow people from other disciplines besides software engineering and people who don't keep their skillset current), to the fact that cross-compiling is much harder/slower to debug than workstation debugging to the fact that resources are more limited in embedded, so code is more terse, with fewer convenient debugging functions.

Other problems are with management- they see embedded code as a "write it and forget about it" schedule item, not a continuous improvement one. Much of this is due to the fact that embedded code is not as portable as workstation code.

All of these shortcomings sufficed when the number of time pressure of embedded projects was much smaller 5-10 years ago. Now that embedded has exploded in ubiquity, its requirements are increasing and it's getting less time to be perfected in most markets.

So yeah, I guess it's safe to say the embedded world is in a bit of crisis at the moment.


Having work in a similar environment, I know at least one company where most managers have a hardware/electronics background, because their product was mostly HW when they started to work 20 years ago. But now, the split is more akin to 80%SW/ 20%HW and they are managing SW developers without having much idea of what SW development consists of. I know at least of one example of a top engineering manager not understanding what version control was : "Please merge all the features but not the bugs from branch X". (Which is a noble goal in itself, but might be hard to attain !)


> The discipline of modern desktop/web development has not reached the embedded world, except in a few high-margin areas (such as military and transportation).

My experience is that it hasn't penetrated very far there, either. Sure, we had version control and code reviews. The process was so warped and bastardized from industry standards, though, that it became a CYA/box-checking burden to keep the SQAs who really didn't understand code happy than a tool to improve development. Plus, most of the managing and systems engineers were hardware types[1] who didn't really understand what the software groups were doing or how they operated.

[1] Important note: I'm a hardware type, too. I'm just a hardware type that actually got and wrote software. Made the situation doubly frustrating.


I liked your "cowboy attitude -> C".

Everyone is a C expert, except when they happen to be a tiny cog in a big coding machine.

"I know where all my pointers are memory allocations are!". Sure, but does everyone on your 20+ people team from the intern up to the senior guys know?

Then things like this are bound to happen.


C is used in such applications precisely because it is one of the few languages that allow you to reason about things like stack usage.

Try that in Python, Ruby (or ML for that matter).


Talking about how to catch stack overflows and protect your data against them isn't useless, but it misses the point. There are rules/guidelines, like MISRA[0] (which the testimony mentions 54 times!) for the automotive industry that prohibit recursion, and tools that will check for conformance.

Toyota should not have been using recursion in the first place, and it seems they were too cheap to invest analysis tools like Coverity.

[0] http://en.wikipedia.org/wiki/MISRA_C


To me banning recursion in order to prevent stack overflows is like banning arrays to prevent buffer overflows. It misses the problem, doesn't it?


Sort of, but not really...

If you statically allocate an array, the compiler will ensure that you get the amount of space that you asked for, or raise a compile-time error. If you dynamically allocate an array (which you probably shouldn't be doing in this case anyway) then you'll either get a pointer to an array, or NULL. Either way, you'll know when it's safe to use the array. With a little bit of discipline, it's not difficult to avoid buffer overflows.

Recursive functions don't have a guarantee of safely running. Yes, there are ways to show that certain kinds of recursion will always terminate, and it might even work when you're calling the function at the top frame, but what happens if it's called further down the stack? What happens if the data structure guiding the recursion changes and now it takes a deeper stack than before?


Most recursions can be written as tail-recursions. If the compiler can optimize tail-recursions, in which case it will behave like a loop and throw warnings when a recursion is not a tail-recursion, then the point is moot. With algebraic data-types and pattern matching, often used for indicating the stage of a recursion which includes the exit condition, the compiler can even warn you that you missed a branch. In fact I find it easier to come up with complex loops expressed as tail-recursions, because in time and with practice, it gets easier to reason about all the possible branches and invariants.

The real problem is that we need a higher-level systems programming language.

> Recursive functions don't have a guarantee of safely running.

Neither do loops for that matter. A loop doesn't have any guarantee that it will ever terminate. Most stack overflows that happen are due to recursions with bad or missing exit conditions, but you can have the same problem with plain loops too.

> With a little bit of discipline, it's not difficult to avoid buffer overflows.

Buffer overflows is amongst the biggest, most expensive problems in this industry and the primary reason for all the invulnerabilities you're seeing in the wild.


Completely unrelated to yesterday's "I No Longer Need StackOverflow" https://news.ycombinator.com/item?id=7251169

I was all excited to defend StackOverflow.com.


My first thought was: How could stackoverflow.com be responsible for car crashes?


I suggest we sing one tune to the song of another and re-purpose this thread to be about stackoverflow.com, as if nobody read the article!


I give it at least a 30% chance that this will happen automatically with nobody doing it on purpose.


No soap, radio!


I am ashamed to admit that it took me a full minute to realize that they are not talking about the website.


I was thinking the exact same thing, and only saw the "s" at the end of Stack Overflow"S" when I came back to Hacker News


I was thinking the same thing until I read past the first 2 paragraphs.


What's amusing is that the correct title of the piece (no plural on 'overflow') is more ambiguous. I wonder which way the HN title mods will rule.


I had an "unintended acceleration case" in my old Austin Morris 1300; the cable connecting the pedal to the throttle snapped, jamming it at a fixed (fairly high revs) level, requiring me to control the speed using the brake.

The solution was to pop open the bonnet and swap in a replacement cable, which probably cost a couple of quid.

This recollection combined with the Toyota story merely convinces me that automobile automation has got completely out of control.


I had almost the same problem and just hit the brake and pulled the car out of the road. Then I turned the engine off and saw that the plastic end of the cable is blocking the wire. I just removed it. No danger at all. It could be deadly if you have weak breaks and can't remember which pedal is the clutch :D


I think this issue is more to do with complexity than anything else; my car has an ECU, it's based on an 8-bit CPU from the 70s, never had any problems with it.


So does mine, and it is fuel injected. A cable controls a plate that controls the amount of air that gets into the engine. The only way for that design to fail WFO is for the return spring that keeps the plate closed to snap. At that point one would press in the clutch and turn off the ignition.

This kind of user corrective action is not possible on modern cars which I consider a huge engineering flaw.


I'm not tremendously knowledgeable about automobiles, but in a car with an automatic transaxle, shouldn't dropping it in neutral and switching off the ignition do essentially likewise?

I ask because that's how I practice responding to uncommanded acceleration, which I've done on occasion since I first heard of the failure mode. I've done this a few times a year in each of several cars, and as far as I can tell it's had no ill effect, but if "as far as I can tell" isn't far enough then I'd like to know it.


Switching off ignition is usually also done in software.

Many ECUs store data (user settings, logging and fault codes) to non volatile memory before shutting down so the key only sends a signal. It's not a hard switch. But it could still help if all other ECUs except the faulty does shut down correctly.


I know of no specific danger in shifting an AT into neutral. You need to be EXTREMELY careful when switching off the ignition that you don't lock the steering column. It would be better to

* switch to neutral

* safely exit the road and stop

* turn off engine

I am glad you have practiced it.

My concern is over the code running tiptronic transmissions, they are computer controlled manual transmissions where you no longer have an effective physical connection.


There was a programme on TV here in the UK a while back which featured an incredibly cost-conscious (read: mean) guy who turned off the ignition of his holiday hire car "to save money" while going down a winding mountain road in Spain. The steering locked and the car went over the side. Luckily he and his family survived and they were able to slaughter him on national TV over the incident.


My experience has been that switching the ignition from "run" to "accessory" doesn't risk locking the column, but now that I think about it, dropping the transmission into neutral should suffice -- the only concern I'd have there, with a runaway throttle, would be that it'd wreck the engine to have it running flat out with no load.


The engine will only be run at max for 15-30 seconds until you pull off, it will be fine. Better that than a potentially fatal accident. Only hardware after all.


How do you figure? Mechanical failures like that are no better than ECU-based failures. Indeed, doesn't matter how old the car is, the software in the ECU isn't going to develop new bugs over time. But mechanical pieces fail over time, as your story demonstrates.


Actually, the ECU might very well do that if it gets in field updates :)

Mechanical pieces fail over time, but we know a whole lot more about inspecting for, finding and preventing mechanical faults than we know how to create defect free software.


Exactly; and the mechanical case is often "field fixable" by a person with a modicum of experience learned while tinkering with cars (a mate of mine in a few minutes in my case), whereas now a tech needs to connect a laptop somewhere on the car to even begin a diagnosis.

Marginally relevant sidepoint: I remember reading that the one of the advantages enjoyed by the Americans over the Japanese in WWII was the formers' significantly greater expertise in mechanical maintenance of front-line equipment gleaned from their culture of fixing up and maintaining automobiles at home.


Example 7 on page 18 of "UNIQUE ETHICAL PROBLEMS IN INFORMATION TECHNOLOGY" by Walter Maner seems quite appropriate here:

Quote:

"A program is, as a mechanism, totally different from all the familiar analogue devices we grew up with. Like all digitally encoded information, it has, unavoidably, the uncomfortable property that the smallest possible perturbations -- i.e., changes of a single bit -- can have the most drastic consequences."

http://faculty.usfsp.edu/gkearns/articles_fraud/computer_eth...


Stack overflows hate the elderly:

http://www.forbes.com/2010/03/26/toyota-acceleration-elderly... (forbes.com)


Yes, I thought the whole issue was myth.

Which does lead into, why are we trying to learn about stack overflows and critical system issues about imaginary subjects.

And even if it's not myth, as above there's no proof that a stackover flow was the supposed issue.


The least specialized in SW a company is, the worse the software is.

What we are accustomed to in discussing in HN for example does not exist in these worlds. Continuous integration? Unit test? Even complexity analysis.

And very very old code that's patched over and over and shipped "when it works"

It's usually people who have had an academic contact with programming languages and embedded development and don't know anything about code quality. But you can bet their bosses incentive CMMI and other BS like that. (Yes, complete and utter BS)

Not to mention ClearCase which seems to be a constant, the worse the company the more they love this completety useless piece of crap


does not exist

I've worked in that world for a long time, and I assure you we did continuous integration, unit tests, and complexity analysis." Way back in the early 90's, long before it made it into the general population, so to speak.

I agree that there are terrible groups out there, but in general there is a far greater emphasis on safety, quality, and correctness than in the non-mission critical world.


Yeah, the software methodology at car companies makes a lot of the seat-of-the-pants just-ship-it stuff that HNers are used to look like kindergarten.

The car companies know how to do this. Maybe they messed up in this case (I'm skeptical of the article), but it's not because they don't know software.


Yes and no.

The transcript is very enlightening. It was extremely clear that on this particular project, the software development process was a total trainwreck. No one who was familiar with the SW dev literature had technical leadership and authority over the codebase. As a matter of fact, the transcript is so shocking it could be used as a manual of antipatterns for SW development both in embedded and out of embedded. A friend and I (we used to both work at an embedded systems company) spent an evening going over the transcript and mocking the errors. :-) By and large, the errors were of the design form. E.g., too much work on the critical threads. Not separating brake and acceleration threads. Four thousand globals. I think the cyclomatic complexity was something like > 1000 for the control path function. Etc.

One of the remarks is actually that Toyota had taken some lessons learned from the time the codebase was developed and had been working on improving since then. So that's good.


It's sad that recursion is considered dangerous. Tail calls have been known about for a very long time, and the duality between stack and heap for just about as long.


Recursion is a tool, and like any other tool, you can use it to destroy the windows. It's not more or less dangerous than string manipulation in C.

Transforming recursive algorithms to tail calls only isn't always an option, or it makes the algorithm unnecessarily complex. The problem here is that in any critical control unit, recursion must be verified safe and/or there must be an external measure to detect and recover from stack overflows.


Even if you allow tail recursions, you still may want to disallow general recursions.


Tail calls are only helpful if optimized into jumps instead of function calls.


And this only happens if your compiler will recognize it. Not all compilers are smart enough. And embedded compilers often don't have the love that mainstream compilers get.

In the absence of specific & concrete evidence that your compiler does this optimization and that you have and will tested this (including checking the emitted assembly code), it is correct to assume that TCO does not happen and to perfom stack depth analyses based upon that.


If you also built the computer, wouldn't you already have that specific and concrete evidence?


Beg pardon? I don't know what you mean. Most embedded companies purchase a chip... sometimes the compilers come from the same company that made the chip, sometimes not.

Regardless, it's the engineer's responsibity to not assume about such a critical part of the design.


Would it be considerably expensive to check in runtime that SP is in an expected range every time it gets moved? This'd work with multiple stacks, too.


Well assuming you run in virtual mode you could always leave the page/segment directly below the stack unmapped, this way if the stack overflows it'll trigger an exception by accessing invalid addresses.


There is an enormous class of embedded devices not running on chips capable of address virtualization.


You don't need virtual memory, just memory protection. Many microprocessors and microcontrollers offer just an MPU, without the full fat of address translation machinery needed for an MMU.



Unless you move the stack pointer by more than a page (the stack in Linux grows automatically: IIRC any access to at most 64k below the stack pointer is considered to be an attempt to grow the stack, because the compiler may very well reorder a change in SP in a function's prelude and an access to the newly-allocated stack frame).

Edit: I'm sorry, I was somewhat wrong. Accessing stack below SP is considered a bug, but x86 enter instruction can make it seem like such an access has taken place if it faults:

http://lxr.free-electrons.com/source/arch/x86/mm/fault.c?v=3...


Some very small micro-controller didn't even had a readable stack pointer. The usual trick in this case is to fill the bottom of the stack space with a predefined, well chosen value(s) and to check that those values are still here.


Another example of C's impact into our daily life.


I haven't waded into all this, and it's been years -- and years -- since my education touched upon systems that physically separate operating instructions from data memory.

But... sooner or later, it seems, we are going to go (back) there.

Instructions will become truly privileged, physically-controlled access. Data may go screwy -- or be screwed with -- but this will not directly affect the operating instructions.

Inconvenient? As development becomes more mature, instructions will become more debugged and "proven in the field". Stability and safety will outweigh ease and frequency of updates.

My 30+ year old microwave chugs along just fine. It doesn't have a turntable nor 1000 W, but I know exactly what it will do, how long to run it for various tasks, and how to rotate the food halfway through to provide even heating.

My 34 year old, pilot-light ignited furnace worked like a champ, aside from yet another blower motor going bad. I listened to the service tech when he strongly suggested replacing it before facing a more severe, "winter crisis" problem.

The new, micro-processor based model is better in theory (multi-stage speeds, and longer run times for more even heating). In practice, it's been a misery. The first, from-the-factory blower motor was defective. When that was replaced, the unit started making loud air-flow noises periodically.

Seeing the blower assembly removed, its constructed of sheet metal. The old furnace, by contrast, had a substantial metal construction that was not going to hum and vibrate if not positioned absolutely perfectly and with brand new, optimized duct work.

Past a point, reliability starts to -- far -- outweigh some other optimizations.

This is going to become true in our field, as well.


Are there any downsides to having the memory set up the "safe" way that they describe? It seems like a win-win situation.

[edit] I guess I was thrown off by the shoot-yourself-in-the-foot scenario, where the stack grows toward fixed data structures. If the heap and stack grow towards each other, you have quite a bit of flexibility (though with some danger of collision). If you have the stack grow towards fixed data structures, its size is fixed and it can cause a dangerous overflow. The only disadvantage of the safe example is less flexibility, but for a critical embedded system, that is fine.


Although I use managed languages, I wouldn't want my code audited by the NASA.

When 180+ IQ brains analyze your work they're bound to find "horrible defects" that no "competent" programmer would ever make.


It is probably too easy to confound the NASA engineering achievements with just raw intelligence. They have a huge body of standards and methodology on about every level of software engineering: time-proven coding standards, systems design guidelines, testing procedures, quality assurance, and the talent to execute on it.

I just remembered an old article [1] about the software development process at NASA and how it is not putting reliance on the kind of rockstar-programmer-genius culture that is so common in other parts of the industry.

[1]: http://www.fastcompany.com/28121/they-write-right-stuff


I just refactored some code yesterday, I feel pretty confident, I'd love to know what they think :P Though my app is just some silly ruby code, not a realtime human life critical C embedded software.


Awesome, let's walk through every library, system, and machine call exercised by your ruby code. What happens if you gc while hitting a chunk of ram that was just hit with a corruption causing cosmic ray while your OS paged out your app because your intrusion detection/prevention service went defcon two on a leap millisecond that was announced since your last OS patch and your locale was set to C?

Please support your answer with tracing analysis over a set of one billion Monte Carlo simulations and present an accurate and up-to-date IDEF6 model of the application system.


Parking an SUV on Mars is a lot more complicated than parking one at the grocery store.


Plus, if the one on Mars kills anyone, we all know how that movie goes. One does not simply walk away from killing an extraterrestrial.

I'm going to lick someone with a very bad cold now, just in case.


If anything weird happens I blame Matz, and after that Linus >_>


Universal law: look at your code again in a year and you'll wonder what the hell you were thinking.


2 weeks


I'm always skeptical about non-trival recursive calls and generally pass a "depth" variable in as the first param, increasing it each time I do another call, with some sane cut-off point where it just returns.


I think cars should just come with slots where we could put in our phones and bam!, powerful computing that you could carry along.


tl;dr: make the stack bigger! (then is it really a stack overflow?, oh by the way, this won't work in most systems due to virtualized stacks on top of the physical memory making concepts such as order of memory meaningless... but nevermind that)

The obvious solution to stack overflows is to make the stack bigger. The obvious problem with this solution is that it just kicks the can down the road.


Before you wonder about stack overflows, as yourself why the occupants never applied the brakes.


[deleted]


Could you at least try to read an article before you comment? Like, at least the first 5 words?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: