I read the article from tip to tail and as a professional engine mechanic, It reminds me of Fords "myford touch" platform.
the system was rolled out across newer fleets without much testing, and in some models controls practically every single feature in the vehicle from climate control to the onstar SOS. There was a recall for platinum model F150 trucks because the system could glitch out after so many hours of continuous operation and trigger a fault in the 4 wheel brake force distribution system. This in turn either completely arrested the brakes, or caused them to quietly apply themselves at around 15%...you couldnt undo this unless you pulled the fuse. Even worse, collision detection would be disabled because the system thought you were aware of a potential crash and were braking.
If a certain bluetooth phone were paired, it could cause the trailer load and position sensor to erroneously predict grade downshifts. The result was that incoming calls on the highway would either wreck the rear differential or put the truck on the side of the road.
Chrysler has had similar issue[0] this last year, their Pacifica Minivan had a software crash that would cause the engine management computer to crash and the vehicle to completely lose power while operating (inc. at highway speed). In effect you'd have complete loss of electrical power and even power steering, brake assist, and hazard warning lights(!) would be lost.
Took them six or more months to fix it as they weren't able to reproduce the software glitch.
On the Stanford Solar Car Project -- a mostly-undergrad student group -- we always built our cars with two physically separate CAN buses, one for nonessential features and one for the motor controller.
That way, we limited our surface area for catastrophic bugs: the only things connected to the safety-critical CAN bus were the motor controller itself and the throttle board.
This for a one-off experimental vehicle.
CAN is a very simple, unauthenticated bus. Any device can send a message that every other device on the bus will receive. Messages are typically just a few bytes long, binary encoded.
The idea of attaching an internet-connected infotainment computer to the same CAN bus as the brakes is absurd. Doing so on a production car, even more so.
--
The Ford story is disappointing but not totally new.
> CAN is a very simple, unauthenticated bus. Any device can send a message that every other device on the bus will receive.
True, but in practice messages that are deemed important are secured at higher OSI levels, and the identity of important bus participants is cryptographically checked during vehicle startup.
> The idea of attaching an internet-connected infotainment computer to the same CAN bus as the brakes is absurd. Doing so on a production car, even more so.
Which is why all vehicle architectures I'm familiar with have a bunch of CAN (and other) buses.
Maybe it used to be different, and I certainly know only a few bus architectures, but they make a honest attempt at securing the important bits at least.
Industry engineers may not get everything right, but they're not that stupid. Cut them some slack.
IMO there should be separate infotainment and hi-speed system CAN buses. It's lunacy to place engine control on the same bus as something with internet access that doesn't get upgraded, like, ever. I don't care how well it's tested.
Agreed – FYI: Tesla Model S (and likely X and 3) vehicles use a locked-down gateway (offering specific API actions, which result in CAN communication, etc) between their Ethernet network and CAN [1]
That's how it is on my car, but at the same time my car basically has no fancy electronics.
I understand software engineers pulling stunts like this, because there isn't actually a regulated engineering license, but I'd honestly have expected automotive engineers to be vehemently opposed to putting the stereo on the same system as the brakes.
It's not the software engineers that are the problem it's the automotive design managers who decide that there will be one computer and one buss.
The DOT should have standards for this stuff. Infotainment system shouldn't have control authority or share critical buses with the engine and braking/traction control system.
I wonder if some of this is to do with weight reduction. To have seperate buses must add a lot of extra equipment and wiring.
This means more weight and more power required to operate the electrics and in turn means lower MPG, higher fuel consumption and a more expensive car to design and manufacture.
The net outcome is a less competitive car in a competitive market.
I've worked with software engineers working on the CAN bus, and I had the same questions as you. According to them it has to do with the amount of cabling needed as well as electrical interference. A car has a lot of cabling going all over (a couple of hundred meters in total was the number I got), and if you can put as much of it as possible on the same bus(es), you save yourself a lot of problems.
So I suppose weight could be a factor, but space seems to be the big one.
I mean "a lot" in computer terms is like 4 lbs of equipment but since the brake pads on a truck weigh in at more than that I assume its not weight related.
Most modules have some feature or other that needs to communicate with infotainment or another module that needs to communicate with infotainment. Error reporting is the big one, but there's lots of little ones too. The brake module, for example, may be involved in tire pressure sensing, and also takes configuration for traction control and reports activation.
It's hard to find an appropriate segmentation for an air gap without sacrificing features, and most customers won't trade features for "security".
You are right of course. But still. Once you can talk over the CAN bus you can spoof pretty much anything, if I recall correctly. You could at least have some intermediary module on both buses that acts as a proxy. Then if someone hacks your radio they can't pretend to be the ECM.
The infotainment system should only be allowed to read values from the modules related to the actual driving. Configuring things like traction control will just have to be done via a system separate from the infotainment system.
I understand that it a little wasteful to not use that big touch screen in the center console to do configuration of the driving experience, but there's no safe way to do so.
> Modern electronic cars use a single bus (CAN bus) that connect all electrical systems. The brakes, the engine, the wipers, the stereo...
Not true.
The automotive bus architectures I know first hand (admittedly far from all of them of course) have multiple CAN buses, plus other bus technologies as well (MOST/LIN/Ethernet).
I don't think it would even be possible from a bandwidth/latency standpoint to push everything across a single CAN bus.
I had Hughes as a lecturer in University (Chalmers, Sweden) and we heavily utilized quick check in our Haskell courses. He’s a great lecturer and a brilliant mind, I do think they still teach new CS students Haskell with QuickCheck as a first course.
There should be at least two if not more CAN buses in a modern vehicle. The NHTSA should mandate that safety critical systems be on their own (redundant) network.
Yeah, that's a terrible idea. You should have at least three separate communication channels, and the most critical should be as isolated as possible from the others.
Critical (e.g. brakes, steering, engine control) important (AC, heat, dashboard) and then everything non-essential.
The people that design cars need to be software engineers but instead they're behaving like mechanical engineers.
This seems like just a mind boggingly bad architecture — is there any real reason that a system should be designed this way or is this organization simply the result of poor thinking applied incrementally to inconvenient realities?
In most cars this is not the case. For example, in my 2005 Opel Astra (Saturn Astra for US) there are three buses. High-speed CAN used for critical systems, Low-speed single wire CAN used for other vehicle systems and a mid-speed CAN for the entertainment and climate control stuff. All traffic between these buses are "firewalled" by CAN bridges that should only forward relevant frames between the nets.
VW for example have a firewall in front of the OBD connector, only allowing traffic for the diagnostic addresses to pass.
However, I expect that in newer "cloudy" cars, they need so much data that these "firewalls" have become very permissive. Remote start via Apps, triggering signal horns from the Internet, OnStar telemetry reporting etc.
Traditionally the car makers have been completely terrible at tech security but they are slowly improving on this front. In fairness they've also been to some degree hampered on this front by regulations protecting local small garages, stating that the diag stuff cannot be locked down too hard.
The trend towards monolithic hub systems is baffling. Didn't we learn those lessons while we were struggling with mainframes and timesharing? You don't lock critical systems, ever. You don't even directly interact with them, you send them messages and if they don't receive them they should continue to function as the operator would expect. The collision avoidance system failing to function correctly is no where near as dangerous as arbitrarily applying the brakes.
> Didn't we learn those lessons while we were struggling with mainframes and timesharing?
I feel like there are really three kinds of programmers today: those who learned first-hand the aforementioned lessons, those who were taught second-hand these lessons, and those who are destined to learn these lessons first-hand.
There are very few people relatively who experienced the lessons first hand. They taught the lesson to a larger group, who will hopefully do their best to heed its warning.
But what's happened? A whole lot of people came into programming in the last couple of decades. I feel like there literally aren't enough "elders" in the field to teach the influx of new programming ability. They're making the same mistakes again because for everyone one you tell not to, there are 10 more who are willing to try. And to some of them we're giving millions of dollars and they're forming companies with it, who go on to sell products about which exactly 0 people have asked "Is this a good idea? Has it been tried before? What was the result? What can we learn from that? Has anything changed since then that would lead to a different result?"
Thats pretty much the same story as usenet’s eternal september: as the rate of newcomers increases, it eventually outpaces the rate of teaching. At that point, no culture can be sustained longer than a year; a new batch will arrive, and learn things from scratch, and come up with their own solutions... until the next batch, which overrides any lessons learned by the prior
At the very least, here, the knowledge learned is sustained by individuals, and acknowledged (so you still have “experts” and “elders”), but as a general community, you can safely assume that everything is forgotten very rapidly.
Another example of it I’ve seen in video games, where modern designers seem to have forgotten that games existed before 2005: the same design mistakes solved in 1990 crop up again in 2018.
My preferred example being star control 2 vs mass effect: very similar games in design and spirit, two decades apart, and mass effect features many of the same mistakes, and even mistakes solved in star control. As if they had started the design from scratch.
Note that I'm talking about ME1; I never played ME2/3, since I never cared about their combat systems (and it's not very interesting anyways)
Also note that in some fashions ME1 does succeed in its goals, and outdoes SC2, but it doesn't really matter for this; whats interesting is what it failed to learn from its 20 year predecessor
tldr; SC2 does less work and gets much further with it, both mechanically and narratively, in a lot of ways that ME should have learned from. Instead, it seems more like they weren't even aware space operas existed outside of film.
the same mistakes include things like the landing/mining minigame, which is terrible in both games in similar fashion: it's kind of interesting for the first couple rounds until you realize its repetitive, mechanically simple and somewhat poorly controlled. In ME, its a tiny-bit alleviated by being in 3D, so driving around is more fun, but the geology remains dull and generally pointless.
Both feature the nuisance of having to iterate over every world, looking at the stats, and dropping it for 80% as its unlikely to be worth exploring. SC2 also acknowledges its mediocrity and generally avoids requiring it, with very few quests requiring it (and the coordinates usually given), and it becomes mostly unnecessary for mining further into the game as you find massive resource deposits. ME's 3D becomes a negative in this regard, as they still use it for questing throughout the game, and while SC2 has a small 2D screen to parse for the quest location, ME requires "exploring" the dead lands to find whatever marker. The lack of variety becomes more obvious due to the 3D environment (and the amount of time you're stuck in it) compared to SC2.
Notably, the existence of shit exploration in SC2, and how memorable it ends up being, should have been a strong indicator to ME not to pull the same shit
Probably much more based on personal preference, but the much more open, broad and simple narrative of SC2 is more compelling than ME's, at least partially because there's less for them to do a poor job in. ME has bigger and more varied politics between the races, and by far is the more serious game, but this also leads to a lot more stupid politics. Trying to give a bigger background to the races just leads to each race feeling not much different from humans, because, well, they basically are humans in a different skin. Same politics, same motives, different colors. SC2 "cheats" by simplifying races to really distinct, bare-bone traits, and going from there.
The spathi are absurdist cowards, and that's all they are. Their background stories, and all their operations, derive almost entirely from their extreme fear of everything. The Zoq-Fot-Pik are a strange, friendly symbiotic trio of aliens, operating by weird language rules, and odder political preferences. A lot of races lack a clear background, told only in minor hints, and this makes them feel more alien than ME could hope to accomplish. Compounded with the lack of requiring them to walk around and such, primarily being differentiated by their voices and tiny gif-like animations, leaving it to the imagination.
ME makes the mistake that SC2 didn't: aliens are interesting by their very nature; the haze of information is what fuels it. SC2 also has the benefit of being comedic, so it can come up with less plausible stories. ie from SC2's wiki: "For another fifty thousand years, the Zoq, the Fot and the Pik relaxed in the forests, until one day one of each race was walking up a steep path looking for something to eat, when a bolt of lightning struck nearby. The bolt of energy carved a wheel-shaped chunk of granite out of a cliff. As the rock began to roll down the hill, some dry grass got caught in its hole, and since the rock was still hot the grass caught on fire. Thus the Zoq, Fot, and Pik simultaneously discovered the wheel, fire, and religion"
But regardless, it puts out the feeling of Star Trek far better than ME could ever hope to accomplish, at least partially because the game doesn't try as hard. It might be argued that ME was leaning more towards Star Wars (space drama), but even then, it fails by virtue of expunging too much information (and not being grand enough)
And there are other smaller things like SC2 granting a greater degree of freedom, again by virtue of doing less work on its weaker parts. The combat is simpler, and doesn't require it for the most part, and is generally better off for it (ME loves to pretend it has a good enough system; it does not.) SC2 still has the fault of having an annoying amount of combat for what it is, but still far less so than ME (and doesn't really require it for any narrative events). Exploration is more interesting in SC2, again because of the freedom, narrative structure, and it wasting less time on its weaker components.
And of course there are those mistakes that just come out of modern game design, but these are unsurprising:
fast travel: kills any sense of distance and exploration (in ME's case, all travel is fast-travel, so space doesn't feel at all ... distant. The citadel feels larger than space.)
quest-logs: kills any sense of autonomy in the narrative, and background-discovery, and even exploration itself
emphasis on combat missions: ME does not have a good combat system. There's absolutely no reason for it to emphasizing it.
emphasis on player character: It's a space opera, and its focused on upgrading your player character & co? This is just...wrong. Personal preference, but focusing on ship upgrades is much more sensible. But then, the game barely involves space in the first place, so maybe not. It could have been a medieval fantasy and it wouldn't have been too different, in a lot of regards. (ofc, it is bioware, so maybe it really did derive from medieval fantasy)
dialogue wheel: adding morality to the choices, and then making it obvious to the player? It's just absurd. narrative branching on action alone is sufficient, less work, and obviously more correct.
combat-wise there's a lot of mistakes as well, but they're uninteresting for this discussion, and have a lot of other predecessors they failed to learn from. Suffice to say, its not a shitshow, but its not well done. Whats more interesting is that the designers failed to acknowledge its lack of quality, and account for it (or maybe they did so intentionally; its an EA title, so publishers are likely a good deal at fault for that one).
I'd say a bigger problem from my perspective is all the non-programmers who get into management and project management roles who never learned these lessons and who refuse to listen to programmers trying to teach them. They'll ignore you, and if you don't 'be a team player' you'll end up replaced by someone who will.
Because car companies have always built cars, not secure networks. Those executives have never been to DefCon; they have no idea what level of risk they face from the unsecured electronics components in their cars. They are not just unaware that putting critical systems and non-critical systems on the same bus is a bad idea, but they are also completely ignorant of the difference between real-time systems and the operating systems on their phones and computers.
They need to hire some folks with aviation backgrounds, who can explain to them why the plane does not fall out of the sky when the in-flight entertainment system chokes on a scratched DVD. Even when they get it wrong, it is still less wrong than the auto-makers.
[Edit:]
Aviation people are the only outsiders they're likely to listen to.
They're not perfect. But they are better. They at least have an awareness that security is an issue, even if the ways they handle it are... well... let's just say they're not ideal.
Even though the tech folks who frequent this site are knowledgeable, the people who build stuff don't always respect our expertise. Sometimes they don't even realize that our expertise might apply to their problems. This is how we get black-box electronic voting machines and CANBUS2 and wi-fi light bulbs or security cameras that inadvertently open a back door into your LAN.
I'd be surprised if aviation systems are much better. They're really into putting everything on the same physical network too, but employing these things called "data diodes". As if that's a real thing, practically speaking.
It absolutely could be, though. Or you could use half an ethernet port. The idea of a data diode is fine, it's the [lack of] implementation that's at fault.
I mean, those movies being played on the back of seats aren't being run through a serial port. Given the move to an "on demand" scheme, they're not one way. The data diode is clearly meaningless when there's obviously two way traffic.
I don't understand. You wouldn't store the movies on the avionics systems in the first place. That's all inside the infotainment system, on one side of the diode. The things going over the diode would be stuff like current location and tire pressure.
> "data diodes". As if that's a real thing, practically speaking.
On CAN bus, it actually can be.
I've seen CAN bus participants with the transmit pin not connected. They were physically incapable of writing to the bus (granted, this drastic solution only works in very simple cases).
Maybe we shouldn't use avionics (the article we're commenting on is about a buffer overflow bug in new Boeing planes that could drop them from the sky) engineers to train the auto industry on system security
The other comments have effectively lambasted the manufacturers for their incompotence and foolishness in putting these on the same system, so I'll skip that bit.
The practical reason is that the big central touchscreen is a convenient place to put stuff. On an old truck, you selected 4WD by moving a big transfer case lever, which indicated you were in 4WD by its position. No computer to get in the way. Slightly newer trucks have physical buttons with lights that connect to electrical solenoids, often directly. Tractors have brake pedals for each wheel, but old trucks don't have a similar braking force distribution system - you got locking diffs or dragged all the brakes and hoped it helped. With a touchscreen and a computer in place, it's easy to make compelling sales pitches for putting complex features there instead of on expensive, complicated physical controls.
Complicated on a per-unit basis, that is - I understand that a computer has far more moving parts than any plastic-and-wires assembly, but when it works it's easier to draw buttons and graphics than to make mechanical actuators.
Whether or not the processing happens on the same CPU or communication happens on the same physical bus doesn't matter that much when you have the ability to make selections that affect the engine and drivetrain on the main display.
let me explain. attaching a non-essential sub-system with a large attack surface to a mission-critical network is not malicious (I hope) it is incompetence. I would accept a unidirectional serial link with a validated protocol as a sole connection between the two.
I think it is the same kind of thinking that goes into other bad ideas: flickering stop lights, red turn signals. The auto makers just don't really care or think. This is something I don't understand.
There's also a ton of academic lurkers -- I've always though that this due to the facts that 1) the border between academia and industry is pretty porous in technical fields (super obvious in ML research, but I think holds more generally) 2) academics borrow---and contribute to---the same set of methods and technologies 3) most individuals have an interest in complex systems (and their hilarious, tragic faults) such that they find articles like this one interesting.
I think it's also useful to look at preceding communities like Slashdot — in that case it started out pretty tightly scoped for Linux enthusiasts, gaming, programming, and the then-nascent social internet but then the cohort of regular readers and contributors got older and they found themselves in policy, academia, and managerial business roles; as long as people remained involved in the community, the content of the site expanded to reflect the (extremely rapidly increasing) role of tech in these domains. (Obviously, Slashdot suffered from several ownership changes, a lame commenting system, and most everyone moved on to other venues).
~30% of submissions are software-dev-only (new typescript 3.0 release), but most submissions are just high quality, interesting posts. And the level of discussion is excellent - no memes, no jokes. Its like the opposite of reddit.
our heavy diesel emissions performance machine runs Linux, and after a vendor training on how to use it I became totally fascinated with it. A lot of Linux professionals like HN a lot, and the articles are a good break from the norm on lunch break.
I came across HN by accident, and what keeps me coming back is the interesting, informative links & threads, and the quality of discussion. Threads about workplace issues are of interest to me, as are general scientific and social topics.
I'm a "scientific" programmer, meaning that I use programming as a problem solving tool, but I don't write commercial software for a living. I turned away from that path in the early 80s.
I've been reading this site since 2009 and can barely write anything more complex than Hello World. The guidelines on submissions say nothing about software engineering, and the definition of hackers as written by Paul Graham does not limit it to software engineers.
The comments are informative, there's no politics, it's more interesting than the news. I'm a sales recruiter, so I don't read the in the weeds software articles, but there are still 2-3 good articles on the front page at any one time.
HN is not a monoculture the appropriate stuff on the site can interested anyone and no one is required to be a software developer to get on this website.
Nobody is required to be anything on any public forum; that doesn’t change the fact that communities exist, and the vast majority of people on a forum are there due a particular taste, and share a common preference for discussion.
That anyone of any interest can visit the webpage is not a sufficient explanation as to why anyone of any interest do visit it. The question is whats so special about HN that draws such a varied background, despite it clearly having its strongest preference towards software engineering (on any given day, if you counted by category, it would be surprising to find software engineering / programming not as the highest).
A reddit forum has the same public property, but the software engineering channels presumably do not draw such varied backgrounds. So why?
Maybe granularity -- there's only one front page for HN? I usually find that I want to read 10% of articles on the front page, but I'm often surprised by which 10%.
A lot of the content is still relevant for those in other IT roles, or even just hobbyists. I'm a security engineer rather than a dev but I still find many of the posts on the site useful and/or interesting
That's pretty spectacular, I'd love to read some more sources on it.
Makes me glad Tesla did their stuff the way they did. You can reboot the cenelter console and/or days while driving and not much happens other than A/C turning off for about 20s.
Tesla pushed out a software update to AutoPilot (Lane Centering) that caused a fatality[0], I hardly think they should be held up as an example of good software hygiene.
There was recently an interesting twitter dump by a former Tesla engineer whose NDA was up; as of a few years ago their software development practices were... pretty bad.
Apparently every Tesla vehicle runs a Kubernetes cluster and is/was remotely accessible to Tesla via SSH?
> I don't know how they can relax at all relative to their work.
They strictly follow procedures. If faulty code gets through to production and somehow causes a fatality, it's a failure of the procedures, not the individual developer. This works so well in aviation and yet seems completely non-existent in the automotive world.
The procedures exist and are 'followed', but there's a lot more time-to-market pressure. Manufacturing lines dictate the schedule and software must keep up no matter what.
When I worked on safety-critical products, it was actually really nice to be able to push back.
"We can't satisfy the safety case yet" is all you need to say to get your manager to cave.
If they want to take the risk upon themselves to sign off on a known-unsafe (technically; in practice it was already pretty good, just not good enough yet) device, and go to jail if something goes sideways, they can be my guest...
In practice, they preferred to come back next week and ask if we were done yet.
The project went wayyyy past the deadline, thanks for asking :-)
If you as an individual are taking on that level of personal responsibility, your organization is broken.
Part of the regulatory process is to demonstrate sufficient levels of resilience in both the systems produced and the development process used to produce them. It should be a really boring process with a lot of crosschecks. You should be sleeping well knowing that your work is being reviewed and tested and attacked by other people.
(Reality is often different, of course. Just don't try to be the hero.)
I recommend reading the "NASA Manager's Handbook for Software Development". It contains some great guidelines for writing safety-critical software. AFAIK, no Space Shuttle mission ever suffered a serious safety incident due to a software defect.
I'm with you, I'm nowhere near confident enough in myself to work directly on safety-critical systems. I imagine an important part of those organizations is removing SPOFs with respect to the engineering team itself; i.e. nobody fails alone, and everybody tests the hell out of everything. Could be wrong though.
If you work for AirBnb your bad software is definitely changing lives and putting people in danger. Software doesn't have to be embedded to materially affect someone's life.
Saying it caused "a fatality" is being disingenuous. It's along the same false pretense as "we shouldn't use self driving cars at all, if they lead to even one fatality".
Fatalities aren't good things, but they're inevitable consequences of driving. And it's disingenuous to expect anyone's code to be guaranteed to work on every mile of road in the US, under all conditions, at all times. If autonomous vehicles substantially reduce overall fatalities we are better off.
(Nobody talked about banning combustion vehicles after the pinto or banning Fords over their side mounted fuel tanks or their explorers that rolled over when the defective tires blew out)
> Saying it caused "a fatality" is being disingenuous.
It is accurate.
They pushed out an update that changed how AutoPilot behaved, someone had AutoPilot enabled, AutoPilot accelerated straight into a concrete barrier, and the occupant died.
Here is the NTSB's initial report on the incident:
> It's along the same false pretense as "we shouldn't use self driving cars at all, if they lead to even one fatality".
I didn't say anything remotely like that. I stated what had already occurred. Trying to put those words in my mouth doesn't seem like you're responding in good faith.
"You have no proof that the accident wouldn't have occurred anyway without the update."
Well, the NTSB preliminary report doesn't make any statements about what the software update did, per se. But it does indicate that the driver's hands were not on the wheel at the time of the accident and that the "autopilot" software was activated. So it's fair to say that the "autopilot" software was responsible for the crash.
"Or that their updates didn't save one or more lives."
This is a red herring. It could be true. But there's no evidence for it so it's not worth thinking about. Our system of moral judgements rightly puts a lot of weight on demonstrable causes and effects, and generally ignores hypothetical, but totally unproven speculation like this. Otherwise anyone could get off the hook for anything, by saying, "I may have done bad action X but you can't prove I didn't also do good actions Y and Z which could well outweigh X."
"Not an "accurate" but misleading accusation at the software without considering the hidden variables."
In ordinary life we never know all the hidden variables. We just make judgments based on the best available information (or if necessary, decide to postpone judgment until better information is available). The NTSB preliminary report seems credible and I see no reason not to draw reasonable conclusions from it.
Toyota had a similar brake software issue in 2010 [1]. They had a fix available 3 days after the NHTSA announced the issue. Toyota issued a voluntary recall for over 130,000 cars in the US. Since the recall was voluntary, it likely took years until some cars received that update when they were serviced for other unrelated issues. I have no idea if Tesla is doing things the best way, but Toyota having the ability to push that update out immediately would have indisputably prevented accidents and possibly saved lives.
I’m very glad that Tesla’s instrument cluster is semi-independent. Because mine crashes every few minutes while driving due to a “high priority” firmware regression.
On the latest ADB Podcast, they mention that certain car infotaiment systems just go "blue screen" if you feed them a more recent Bluetooth version than what they have hardcoded.
John Hughes likes to talk about the CAN bus when discussing Quickcheck and test generation.
Learning about the crazy complexity and potential for unforeseen interactions made me very unhappy when security researchers remotely controlled a car on the highway as a publicity stunt. There’s simply no way they can know with absolute certainty what might happen.
Uhh that's kind of terrifying. I'm due for a new car after 10 years, and I think I might specifically try to avoid the systems that centralize things like breaks with entertainment systems...
> Your options are to increase the number of bits used, which puts off the overflow, or you could work with infinite precision arithmetic, which would slowly use up the available memory and finally bring the system down.
Yeah, no. Doubling the amount of bits to 64 while keeping the same precision gets you about 3 billion years worth of time, which is probably enough. And I'm going to leave calculating how much time it'd take to fill up any reasonable amount of memory with a single arbitrary-precision integer as an exercise to the reader.
Even if you do use arbitrary precision arithmetic and count nanoseconds, the heat death of the universe is more likely to occur before your number takes 1KiB of RAM.
Probably the deeper problem with using arbitrary precision arithmetic is that you end up with a variable-sized datatype, which I believe means at least a modicum of extra hassle & complexity for any language that the control software is likely to be written in. And less predictable timing, which might be a big no-no if this is something that needs to be used in timing-sensitive places.
I was going to comment how you should probably still account for an overflow condition by warning at 70%, beeping at 80%, refusing to take off at 85%, etc., but it turns out that even with nanosecond precision timekeeping (1e-9), a signed 64 bit integer is enough for 292 years of not rebooting.
(2^63)/1e9/3600/24/365 = ~292
Yeah, just go for that 64-bit int and call it a day.
Specifically, 32 bits of PDP11 core memory cost on the order of $2, and that was per timestamp. Pretty obvious why you wouldn't go 64-bit unless you had to.
I don't think this cases needed to use 64bits for the clock. Because realistically I would hope that the plane would be serviced more than every 248 days. Therefore each service the computer could be reset or go into a self-check which would reset the computer and therefore prevent the overflow.
The thing about the error is that it's very standard sort of error but an error which apparently formal methods didn't catch.
I've just been looking up TLA+ (model verification language) and I'm not sure how it would deal with avoiding overflow; it's approach is proving that a system will always be in a required state but avoiding overflow in most practical approaches involves effectively staying far enough away from the overflow condition that it never appears (as the other comments here go into). Done right, overflow will never be impossible, just unlikely.
64 bits is certainly the solution I would go for as well. Arbitrary precision numbers are going to involve heap allocation which is something I would normally like to avoid entirely in a system like this.
This is a little familiar with the rocket failure at Dhahran[1] of 1991 resulting in 28 deaths. The Patriot missile battery at Dhahran had been in operation for 100 hours, by which time the system's internal clock had drifted by one-third of a second. Due to the missile's speed this was equivalent to a miss distance of 600 meters.
Two weeks earlier, on February 11, 1991, the Israelis had identified the problem and informed the U.S. Army and the PATRIOT Project Office, the software manufacturer. As a stopgap measure, the Israelis had recommended rebooting the system's computers regularly. The manufacturer supplied updated software to the Army on February 26.
While it is a good lesson from a software perspective- your bits are going to overflow, make your software handle that gracefully- I've always been a bit uncomfortable with the blame for deaths being placed on that bug.
The bug didn't kill anyone. The scud missile fired with intent to kill did. And of course, it was fired because it was the Gulf War and countries were attacking each other. Blame who you like for that situation. But all anyone talks about is how it was the Patriot Missile bug that lead to the deaths.
The software bug failed to prevent the deaths that were going to happen anyway. Lessons learned, yes, but I'd hate to imagine a programmer somewhere living in guilt over it.
I agree. Really there are two alternatives to what happened.
1. The patriot battery does not have the bug, (or was recently rebooted) and now has a BETTER chance of hitting the scud missile. (but not guaranteed) People may or may not die.
2. The patriot battery is turned off, or not there, and nothing stands in the way of the scud missile. People most likely die.
> The software bug failed to prevent the deaths that were going to happen anyway.
It MAY have failed to prevent the deaths. I don't think Patriot has a 100% success rate. If the normal success rate of that intercept was 1%, 50%, or 99% - it changes the wording a bit I think?
> Official assessments of the number of Scuds destroyed by the Patriot missile system in the war have fallen from 100 percent during the war, to 96 percent in testimony to Congress after the war, to 80 percent, 70 percent and, currently, the Army believes that as many as 52 percent of the Scuds were destroyed overall but it only has high confidence that the Patriot destroyed 25 percent of the Scud warheads it targeted.
> Independent review of the evidence in support of the Army claims reveals that, using the Army's own methodology and evidence, a strong case can be made that Patriots hit only 9 percent of the Scud warheads engaged, and there are serious questions about these few hits. It is possible that the Patriots hit more than 9 percent, however, the evidence supporting these claims is even weaker.
>the FAA's new rules require operators to reboot the plane's electrical system every now and then because "all three flight control modules on the 787 might simultaneously reset if continuously powered on for 22 days." The effect of this simultaneous reset "could result in flight control surfaces not moving in response to flight crew inputs for a short time and consequent temporary loss of controllability."
IIRC, 64 bits give you 200 years when you're counting nanoseconds. Why on Earth would they use a 32-bit integer? I doubt this was some kind of microoptimisation. My bet is on some sort of legacy component that is 64-bit-o-phobic.
I was in training brushing up on my embedded programming skills, which thankfully I don't have to use anymore, and the instructor told a story about a bug in a missile that was on the verge of delaying its deployment due to tests failing.
While on the bench and being subjected to a bunch of tests to validate the seeker components against simulated targets and countermeasures, the flight control surfaces would start spazzing out after a few minutes in the test jig.
The problem had something to do with IR (or something) tracking and attitude control, with the tracker rolling over the odometer and sending spurious data to the flight guidance system which caused the tests to fail and would have lead to an in-flight breakup of the missile real-world.
The onboard hardware was highly resource constrained, and engineers and developers worked for weeks trying to fix the problem, going as far as contemplating a complete redesign of the seeker system.
Then somebody pointed out that between the time of launch and the missile's two end states: impact with the target or fuel exhaustion-- was less than 60 or so seconds.
The only reason the bug was causing problems was that tests were being run back-to-back-to-back in order to speed up things, and the seeker subsystem was being powered on for way longer than it ever would be in the real world.
A missile shouldn't have to be on for more than a couple of minutes, and a Dreamliner shouldn't have to be on for more than 248.55 days so I'm willing to bet they stuck some ancient 8-bit micro in there, reused ancient code that was already tested, and called it a day.
> a story about a bug in a missile that was on the verge of delaying its deployment due to tests failing.
A Patriot Missile in Saudi Arabia wasn't launched to intercept a SCUD, which then killed 28 soldiers. The reason was poor handling of rounding errors [1]
The maiden flight of the Ariane 5 rocket resulted in a RUD due to an integer overflow [2]
To be fair, there is little public evidence that Patriots ever killed a SCUD (but plenty of public evidence of the Army lying about their effectiveness).
Why stop there? Why on earth would anyone use an integer instead of a double, given the inherent risk of truncation error? Or for different arguments: Why on earth would anyone use a 16 bit wchar_t? Why on earth would anyone make char unsigned (or signed)? Why on earth would anyone put the little end first?
Machines are machines. They have fixed representations for different types, with tradeoffs. And you have to pick one. And the thing about timeout handling specifically is that everyone along the path from the timer driver up through the app needs to agree on the precision needed, or you'll get an overflow condition.
Arguments of the form "Bugs are bad and we shouldn't write them" have not historically helped with improving software quality.
I think the comment you're replying to makes a solid point, still.
If this bug were to appear every 200 years, that's substantially longer than the lifespan of any single airframe currently in existence (and nearly twice as long as the existence of powered flight) -- and if a Dreamliner were to actually survive that long (most likely as a historical artifact doing heritage flights, maybe), doing this kind of reset would just be part of the routine of getting the "old frame" up and running, not unlike timing certain steps of a WWII fighter plane's manually rather than having them done automatically.
Without saying that increasing the width of that variable is the _optimal_ solution, in this case deferring the error leads to safe and predictable operation over the airframe's nominal lifespan.
For what it's worth, quadrupling the variable's width to a int128_t would mean you can store over 10^22 years at nanosecond granularity, effectively future-proofing this bug out of existence, and I'm reasonably sure a modern system should be able to spare 3 extra bytes.
They have fixed representations for different types, with tradeoffs. And you have to pick one. And the thing about timeout handling specifically is that everyone along the path from the timer driver up through the app needs to agree on the precision needed, or you'll get an overflow condition
I agree, but the difference between 32-bit counters and 64-bit counters as a precision refinement is very special. Upgrading from 32-bit counters to 64-bit counters, even when counting nanoseconds, lifts you out of human timescales. No electronic system has ever or will ever need to maintain a microsecond count for 600000 years, for example.
Fun fact: the C standard does not specify the width of wchar_t. While MSVC uses 16-bit wchar_t (UTF-16), gcc (even MinGW on Windows) uses a 32-bit wchar_t.
> Why on earth would anyone use an integer instead of a double, given the inherent risk of truncation error?
Depends on what you're doing.
> Why on earth would anyone use a 16 bit wchar_t?
That's a very good question! Either use something that can hold at least 21 bits, or use UTF-8.
> Why on earth would anyone make char unsigned (or signed)? Why on earth would anyone put the little end first?
These have nothing to do with space, what weird analogy are you trying to make?
> They have fixed representations for different types, with tradeoffs. And you have to pick one.
For most purposes, there is zero benefit for going too small. Pick a number that can't break under the use it's getting. If a fixed-size number can't do that, change your algorithm.
> And the thing about timeout handling specifically is that everyone along the path from the timer driver up through the app needs to agree on the precision needed, or you'll get an overflow condition.
Making sure your data types are compatible should be one of the easiest pieces of analysis you're applying to your safety-important code.
> Arguments of the form "Bugs are bad and we shouldn't write them"
Yeah, this isn't that at all. This is "think about the limits of your data types as you pick them". A computer being left on for a year should be an expected use.
"Well, I figured....that bridge is gonna fall down in 200 years no matter what, so I went ahead and used rubber bands to hold the suspension lines in place. Only cost me $2!"
No, just no. Don't be a hilljack. Build something to spec. 32 bits is clearly not to spec.
Both you and the other replier seem to have missed the point. I'm not arguing for the use of a 32 bit time value. I'm pointing out that the drive-by "why on earth" framing of the bug is unhelpful and probably harmful. Sure, maybe you'd never make this "simple"[1] mistake. But there are a thousand others you would.
You fix that with careful design and testing, not with "why on earth".
[1] As pointed out, it's not simple. Again, the whole stack need to agree with you and your concept of "to spec", not just the code you're personally typing.
How do you know? What's the spec? At first glance, it seems totally reasonable that the vehicle would be completely rebooted at least once every n << 248 days.
I dunno but clearly running more than 248 days was not in it, otherwise this would hopefully have been caught and tested. If it was specifically specced to last N days and N < 248, personally I would have asked for my money back. As this appears to come as a surprise to everyone (caught late), I'm calling this one as I see it--a facepalm.
Of course, in security-relevant areas, prooven legacy code always trumps the new code. Noone wants to lose an airliner because Phil right out of a CS course positively needed a ruby interpreter in the main flight computer...
My favorite recommendation to people. Take a C or C++ program they have written and compile it with full MISRA. Definitely makes things more interesting.
The hardest thing to change after the program is already finished is getting rid of allocations. But other than that MISRA compliance is not that difficult and avoiding allocations is not that difficult either if you plan for it from the beginning.
If you know what you are doing it is not difficult, but if you are doing it for the first time it a good tool to learn how to think about all the repercussions of how you program something. MISRA isn't really to catch problems with c it is to catch problems with people using c. So knowing what those problems are and how to avoid them is incredibly useful.
200 years sounded low to me and so I did the math and it is indeed pretty spot on. 63 bits, i.e. signed integers, will last 292.27 years and you obviously get twice that if you are not interested in negative numbers.
Right, thanks. My estimate was closer to 200 because I was thinking of Go's time.Time.UnixNano method[1], which returns undefined results when used before the year 1678 or after the year 2262. But yeah, if you count since the system restart, it's bigger.
Well, the sort of engineer working on this might very well have gotten their start in environments where they were restricted to 5k of RAM. In that case it's easy to paint yourself into a corner by choosing integer representations that are larger than you need.
The space shuttle had a related problem, although not caused by overflow. Some parts of the STS's avionics used a clock that would reset to zero at 00:00:00 on 1st January, while other components had clocks that would continue to count up. If a shuttle mission spanned the new year boundary then systems would panic if they could no longer agree on the time.
"One interesting fact is that the FAA claim that it will take about one hour to reboot the GCUs - so there clearly isn't a reset button."
I am incredibly surprised by this, most higher level flight control systems have power up requirements in the seconds. Then lower level actuator controls or engine controls have power up requirements in the milliseconds.
A few years back I was travelling on a Virgin Pendolino train (an Alstom Class 390, probably) in the UK that was having a few problems. Speed changes were causing a lot of juddering, and as I remember the interior lighting and air conditioning was being unreliable. After limping on for a while, the train stopped at a small non-scheduled station and the crew announced that we'd be stopped there for a while, and not to be alarmed if the lights went off (it was night). This duly happened, the ventilation flans stopped, the interior doors all moved to the open position, seat booking indicators turned off, etc. After a pause that was probably a minute or two, everything came back to life and we resumed the journey.
Basically, they rebooted the train. It was interesting to realise that such a large, complex, fast-moving machine could be reset like that, with several hundred people embedded inside it.
On trains with the original version of the UK's TPWS protection system you have to reboot them if somebody mistakenly leaves them on top of the TPWS transmitter when shutting down, for example stopping slightly short of the usual position in a terminus station. In this state when the train boots up TPWS is in an error condition, so you have to override that, move it a few metres, then reboot it to get rid of the error. Newer models recover without being rebooted.
Failing to reboot caused at least one accident because the TPWS was just left overridden and so wasn't protecting anything.
Since a series of UK rail disasters involving trains whose safety systems were switched off, running in passenger service without the safeties is prohibited.
So that depends on if you mean a power source temporarily goes down or if a box experiences a power blip. So in the first case each box usually has 3 different sources of power, something like the generators on the engines, then backup on the 24 volt bus, then finally a backup battery.
In the second scenario it depends on how long the blip is. Usually there are holdup requirements that a box will not reset if power is lost for x amount of time. If power is lost for longer it will save state and leave itself in a state that will come back up quickly.
What I am thinking is happening in this case is that some value is getting messed up in NVM and it must be reset by the maintenance crew, so the "reset" they are talking about isn't just rebooting after the error occurs. But if you reboot before the error happens the NVM doesn't get messed up and the value is just updated with the correct number.
So you think the one hour quoted by the article isn't the time it takes for the system to boot but rather the time it takes for the maintenance to access the device, reset it, and close everything?
That seems much more reasonable to me. Mostly because planes are rebooted for maintenance every so often between flights and customers(Airlines) would be not be okay with a 1 hour reboot time.
There are also usually maintenance tests that can be ran to reset a box on the plane. So the technician would have to put the plane in maintenance mode and run through the test to get the box reset or something.
Yes they might not be okay with that but airlines apparently can hold the passengers for 8 hours because of delays. To the point of toilets overflowing. I think if you run an airline, you are legally required to go out of your way to to provide terrible service.
Hard to say what the actual procedure is. Might need a cold dark airplane for an hour for the GCU to clear memory or you might have to open panels and go back there and physically disconnect power to GCUs to get it done. It seems to me opening panels would probably take more than an hour.
I’m currently working on an aircraft that has IMUs rather than AHRS/RNAV/GPS. I cannot recall ever seeing them take more than 5 minutes on first start up in the morning. Manuals specify 2.5-10 minutes. This is on a 1980s model. I’m sure newer designs are quicker.
>A simple guess suggests the the problem is a signed 32-bit overflow as 2^31 is the number of seconds in 248 days multiplied by 100, i.e. a counter in hundredths of a second.
The question that comes to my (perverted) mind is what is the counter for, or more strictly why it needs an accuracy of 1/100th second?
If it is related to a "periodical" action (a time interval) it makes little sense to have that degree of precision, and on the other hand, if the precision is needed, why not calculate it from a base point?
I mean, if it is related to "boot" time, I presume that noone would ever use a counter, rather a log of some kind with a timestamp and calculate (properly) the time elapsed ...
If the clock is being used for real-time dynamic model propagation, the 100 Hz speed is enough the capture the physical phenomena in the system, such as control system response. 10 Hz is a bit too slow, and 1 kHz is overkill.
During my graduate work in aerospace engineering, 100 Hz was pretty much standard for sensor filtering and simulation work.
The systems are controlling generators, which produce AC current. It's not unreasonable to think the controllers are monitoring the AC waveform, which has a frequency of 50-60Hz. Ergo, if the controller is checking the generator is producing the expected waveform (because a lot of sensitive things depend on the waveform being accurate), it makes sense that it could be using a 100Hz polling interval, which is where the counter comes into play.
It's possible that it's using a system-wide timer for convenience since the embedded hardware is very limited compared to a full computer, where spinning off a separate timer is trivial. When the requirement for different timers hit the program designer's desk, they probably took the most precise use case and designed one timer around it, and neglected to take into account the overflow.
Yep, but purely anecdotal data, I once was involved in the construction of a (small) hydroelectric plant (I was responsible for the construction/building, not for the electrical parts, our construction company had a couple electric partners for that).
For some reasons the tender was for the building and plant, but the client made a separate tender for the actual turbines.
Though I tried (vainly) to convince the manufacturer of the turbines to use a controller from the same firm we had as partner, they decided to use "their" way.
In theory not a problem as the turbines and their controllers would "talk" to the external control system (SCADA, etc.) through a "standard" RS422 connection.
After a couple days of testing, it was clear that the SCADA was going periodically "beserk".
Though not at all my field/responsability, I was willing to have the problem solved and after 3 (three) days of the engineers from the turbine manufacturer and from the control system manufacturer finding nothing (and BTW largely failing to communicate between them) I started looking with them, one by one, at the signals that were exchanged on the connection.
It was evident that there was some form of overflow, the on-turbine controller sent way too many data and "clogged" the receiving part.
There was a sensor for rotation speed and another one for oil pressure that were polled (and sent data) at a rate of 1000 per second.
The turbines were of the Pelton type, with an external (large) flywheel, and it took (literally) tens of seconds since you stopped the waterflow to have the actual wheel slow down and after several more tens of seconds stop, and viceversa once it got to the intended speed, it had in practice no variation.
So you had these two sensors polled 1000 times a second to measure something that would change - maybe - only after several seconds intervals and that could be as well corrected with a delay of tens of seconds.
Reducing the polling rate from 1000 to 10 times a second (still way overkill for the proper functioning of the system) all errors went away.
It came out that these sensors were a new type/make/model, the first to be capable of Khz polling, and the turbine engineers set them to the max only because they could.
Aircrafts operate at 400Hz to reduce the size and weight of transformers, motors and power supplies. Its less efficient than 60Hz, but planes aren't that long anyway.
Anectdotal like my brother, an interactive inverter product I developed sampled the 60hz voltage and current at 19.2khz to more accurately determine RMS voltage and current in the face of some really nasty distortion. It was probably overkill but another guy on the team who worked on revenue-grade watthour meters thought it was just fine.
I'd assume the software is not event driven but cyclical. In each cycle you gather inputs, compute, produce outputs. You need a cycle counter because event validity and response time limits are likely defined in terms of cycles (e.g. peripheral sensor A has to report at least once every three cycles or sound alarm). Bad things happen when the cycle counter overflows.
At the very least, the HUMS and CV/FDR systems need sub-second precision for maintenance and crash investigation reasons.
As for why they don't calculate it in a particular way, probably just an amateur mistake.
Lots of aerospace-related applications are moving towards the methodology where you cook up a model in matlab and then hit the AutoCode button. Things like this are the reason why.
The thing is, do they just count all the lines of code in all the libraries they use, as well as the OS (all of it compiled from source, I would guess)? I imagine they do, since that's the only reasonable way that 100 million lines of code could be reached.
So on a project like this, you do not use many standard libraries. It is usually explicitly stated that you can't use anything from the std c library and such.
Also source lines of code are usually calculated as lines of code that need to be verified. So if a line of code is in the software, it needs to have a test that covers that line of code.
Then that really does raise some questions as to what 100 million lines are for. What operations exactly does an airplane need to execute that require that much code?
So this lines of code estimate are spread across all the systems on the aircraft. One reason for all of the lines is that everything is done in C.
Then really the more exact reason is that every command on an aircraft is monitored. You have complex control laws to control a surface and then probably 4 or 5 different monitors that make sure the surface isn't oscillating or the actuators are not fighting each other and all kinds of stuff. It is because the mentality that not only do you design something to not go wrong, you monitor for failures when it does and handles them accordingly.
Finally you have maintenance where an technician can run a test to determine any issue on the aircraft. So there has to be tests and monitors built into the software to determine root cause of an issue down to a single box/actuator/sensor/or wiring problem.
Modern avionics is insanely complex. This isn't something you throw 1-year self taught JS devs on.
When I worked on next-gen turbofans, we had multiple dozen engineers working on managing the requirements of the software, much less the software itself.
You have the main avionics software managing the flight itself, the engine software managing fuel consumption, and the 3x safety factor required to be certified.
"Everything" isn't telling much. I'd love to, out of curiosity, browse through a codebase of an avionics system. Is anything like this publicly available?
Most likely there is not much that you can look at but this overview of the space shuttle can give you a lot of insight into how software and hardware are designed for flight critical applications. I believe this was posted a while ago here.
One take away is that software and hardware are nearly impossible to separate in a flight critical application.
Also one note, the space shuttle was incredibly complex for its day. So I would say the complexity of the space shuttle would give you a decent idea of what commercial aviation does now.
> Also one note, the space shuttle was incredibly complex for its day. So I would say the complexity of the space shuttle would give you a decent idea of what commercial aviation does now.
I definitely missed this if it was posted here. Thanks!
I was kind of hoping that some legacy avionics system somewhere would have sources available on-line, but I guess companies writing this code don't open-source projects just because they get old.
In addition to every component that has some kind of data bus you also have the fly by wire system which includes all of Airbus’ ‘laws’ for the different flight envelopes. You also have modern navigation gear and the displays and interfaces that come with that. Just look at the cockpit and realize everything in there is software and it becomes clearer.
Think about it as a 1000 independent sometimes redundant embedded systems. 10000 lines of code you can write alone in 2 years for middle complexity project. It’s not single data bank application.
I don't work in this field, but my understanding is that code for critical systems like avionics isn't permitted to (among other things) allocate/free memory dynamically. That probably nixes a lot of functions in the standard library.
It would bring in whatever crap the standard libraries from the compiler or the system happen to have.
And those libraries aren't written with the mindset of "when this code runs on an A380".
Even most non-certified projects will grow their own standard library once they get past a certain size, and this is merely for convenience, consistency, and features.
Well in this you get into a circular argument, because if you did include Linux then you would need to verify every line of code that is being used. Which is why you would never use linux and why it wouldn't be included.
*This is in reference to DAL A software. DAL D might be able to run linux.
But if you want to write a sensational "100 million lines of code" then you count the code that's running the in flight entertainment system (and others) which doesn't have the same safety constraints.
The code would most like still be counted but that software would probably be either DAL D or DAL E.
Here is a the wiki for the standard that software has to meet for commercial aircraft. It explain the DAL (Design Assurance Level) levels and what goes into them.
https://en.wikipedia.org/wiki/DO-178C
On a recent family trip a few miles from our destination, all my vehicle's dash controls went out, then reappeared with a charging system error indicator.
Everything seemed fine - I watched battery gauge and hoped I'd make it. When I got to the destination, I stopped and restarted the engine, and everything looked fine, and the charging system indicator went back to normal.
I noticed afterward that the "Engine Hours", which had been getting close to 10,000, was now in single digits. No other internal counters were reset.
I wondered if it was an overflow condition, but it appears more mundane - many vehicle owners report seemingly random resets. The surprising thing seems to be that it hadn't reset before getting close to 10,000 hours!
30 odometer miles per engine hour seems about average (to one significant digit - varies with proportion of highway vs city vs idle hours). That would suggest all your 20+ year old cars are under about 120K miles? Or they have a very high mix of freeway miles. Either way, they're not likely getting the average 10-15k miles per year.
Conversely, that rough evaluation is making me question whether my recollection of nearing 10K hours was correct - the vehicle is under 200K miles, which would suggest <20mph average.
My office building has a computer system for elevators that directs the traveler upon selecting the level to an elevator preselected with the destination floor. Every month or so, the system begins to slow down or lag, e.g. user enters floor, system pauses (during morning rush hour this causes delays), flashes to the selected elevator, and returns back to the user screen. The longer the interval since reboot, the longer the pause. Obviously there is some sort of memory corruption in the system causing a buffer overflow and the routine programmed in the software is not clearing it. So a manual reboot is required to make it work as normal.
Do civilian aircraft not have requirements for maximum reboot times?
Naval vessels usually have sub-minute times from power restored to shooting back and they're legacy code and legacy hardware nightmares. A 787 doesn't have layers upon layers of legacy code that most naval applications do. While the stakes are definitely a little lower you'd think they would be able to do a full reboot in less than the time it takes to fall out of the sky by a comfortable margin.
You would certainly hope; and yet, long uptimes unforeseen by designers can have devestating consequences:
> A government investigation revealed that the failed intercept at Dhahran had been caused by a software error in the system's handling of timestamps. The Patriot missile battery at Dhahran had been in operation for 100 hours, by which time the system's internal clock had drifted by one-third of a second. Due to the missile's speed this was equivalent to a miss distance of 600 meters...the Scud impacted on a makeshift barracks in an Al Khobar warehouse, killing 28 soldiers.
> As a stopgap measure, the Israelis had recommended rebooting the system's computers regularly.
There are several types of routine checks, I think Boeing calls them A/B/C/D (with subtypes, depending) Airbus has a different nomenclature but the idea is the same.
Forgive me if it's a stupid idea - but why not shutdown these plane computers every night when they aren't in service and boot them again in the morning before the first flight? Let all the counters reset, all memory creep and leaks go away, start fresh....
I remember an issue with the Oracle DB client libraries for Linux (RHEL 2 or 3 I think) where if the system had something like a 160 day uptime any software using the library to connect to the DB would just hang. I saw the effects with strace where it would just get stuck in a loop doing some time related syscall. We managed to get an unofficial patch from Oracle that I could wedge into our servers, since reboots were under strict change control. I remember this issue came up when I was being interviewed for a position at the Guardian a couple of months later and they seemed amused by what I had to do to fix it since they also encountered the same issue but fixed it by making sure they reboot more than twice a year.
Remember the Ariane rocket - also a piece of verified code running in verified OS, running on verified MCU, made in verified silicon, but the obvious flaw was still missed.
The problem with formal verification that for a complex system, the amount of constraints goes over the roof, and it is no longer possible to humanly understand if one of them makes sense in complex.
It might very well be possible that something "register A should never overflow when input B is below value C" was put in validation rules, but nobody gave importance to understanding what it was. Or worse, somebody was simply lazy to change it, fearing it will upset 20 some even more obscure validation constrains.
I wonder why the GCU needs a time counter in the first place and how it is used so that the whole controller shuts down on overflow. I bet, 16 bits would be enough if handled properly.
Software with airplanes is usually periodic and not event driven. So every 10 ms or some interval of time it will compute all of the control laws and monitors and such. So that 32 bit counter is most likely the period counter or a maintenance interval counter.
I imagine, the software needs to remember some state variables with time stamps, to ramp power up at a certain rate or to implement PID control of the power level. Crucially, it never needs an absolute time stamp, it only needs to know how old the data is.
The most simple(!) approach is to use an unsigned integer as timer/counter and let it overflow all the time. Age is computed by unsigned subtraction, ignoring wraparound. With a 16 bit timer and 10ms resolution, ages up to almost 11 minutes can be represented. Why would a turbo-generator even need to remember what it did 11 minutes ago?
In other words, the mistake is probably not the narrow counter, it's the signed arithmetic and the subsequent failure when it computes a negative age. Someone else said "they probably stuck a crappy old 8-bit micro in there." That would surprise me---people programming 8-bit micros tend to know how to use unsigned arithmetic where it makes sense.
Naive question: Would a reasonable way to avoid this scenario be, to increment a secondary counter when the primary counter reaches max-1, then reset the primary counter to zero?
1: Size the type used to store the value so that it cannot overflow even in corner cases such as a plane being on for several months. Ex. going from a 32 bit int to a 64 bit.
2: Have a flag that you set when the counter overflows, so you can calculate the actual time. Given the length of time that planes are on this would give you enough time.
3:Lessen the precision of the counter (Probably not an option since the precision is usually a requirement)
What if the secondary counter overflows? And should all software now check two counters when they want to know the actual time?
Just use a 64 bit integer, at least for this purpose (since this apparently needs only 10ms precision): that is twice as large as a normal integer, so basically the same as using two counters, except you don't need extra logic in your code.
To be honest this isn't quite a big deal. Why would anyone let an airplane running for days continuously.
Just a quick calculation. If it takes 248 days = 248 * 24 * 3600 seconds to go to 1 to 2^32, then the sensors have a sampling time of 5ms so it takes a measurement at a 200Hz frequency.
Not related to anything but it's nice to know I guess.
This is quite old news though. I remember it was mentioned in last year's CS50 lecture about integers.
> Your options are to increase the number of bits used, which puts off the overflow, or you could work with infinite precision arithmetic, which would slowly use up the available memory and finally bring the system down.
Pedantically true for an infinite universe, but merely moving the counter to 64 bits would give one 2^32 * 248 days, lets say we only use 30 bits, we still on the order of a billion years for 4 extra bytes.
> Your options are to increase the number of bits used, which puts off the overflow,
Just adding one more byte puts the overflow off for almost 200 hundred years, should be long enough.
The question that comes to my mind is: now that they know that Boeing is sloppy are they going to thoroughly audit the code to see if any other overflows are lurking in it? And will they do the same to Airbus? And if not then why not?
> We are issuing this AD to prevent loss of all AC electrical power, which could result in loss of control of the airplane
So the 787 doesn't have a mechanical backup for the flight controls? So much for the Boeing fanboys talking about their mechanical yokes. Through realistically that says very little about actual reliability and safety, just look at the 737 rudder issues.
When you say "mechanical", in a 787 or any large airliner it doesn't mean straightforwardly what it might mean in a small plane.
In any large airliner the pilot's controls are indirectly linked to the flight surfaces by either hydraulic or electrical systems. There's no way a person could produce the forces needed to handle a plane that large adequately.
So I suppose it depends on the specific system, but if electrical power were to be totally lost, perhaps they would be dead in the water. (air)
> In any large airliner the pilot's controls are indirectly linked to the flight surfaces by either hydraulic or electrical systems.
I understand the 777 and 787 are special cases, but my impression of the rest of their line was that the flight controls are directly linked by control cables and are assisted by mechanical-hydraulic servos, a bit like hydraulic power steering on cars. So with a total loss of electrical systems they would still have normal primary flight controls.
Of hydraulic servos can fail (for example the aforementioned 737) and even triple hydraulic systems can be taken out by something like the uncontained engine failure of United Airlines Flight 232.
I tend to think that electrical and electronic systems can ultimately be more reliable and maintainable, but I worry that the safety culture is much worse among the designers of such systems than their mechanical equivalents (no wonder, we've been tinkering with the latter for thousands of years and the former only a few decades).
A company I worked for licensed a managed language VM, which was being used to operate an airport people mover, which was crashing after just shy of 50 days. (Crashing software-wise, not physically. It would just awkwardly stop.) It turned out to be an integer overflow. If this was a 32 bit register for milliseconds, what would be about right.
"You may be used to rebooting a server every so often to ensure that it doesn't crash because of some resource problem"
Is this something that people have to do? I maintain a few Linux servers, and I never need to reboot them. They only go down when the hosting company needs to do some kind of hardware maintenance.
That was once the reason put on a support ticket for a Windows 2000 Server which I reported as running slowly - something like: "Server had been running for over 30 days - needed reboot". I suspect this was down to the bespoke app, and not Win2K.
While this is pretty sketchy, I DO like the design principle of ephemeral subsystems. It just makes sense to assume a service can go down at any time for any reason, from its conception, and bake that into how it's built. You could argue periodically restarting things is just a natural extension of that.
In this case, does it really matter? I'd be very surprised if it's running nonstop for more than a couple days, let alone almost a year. Or is this some system that normally doesn't power down when the rest of the plane does?
Planes like the 787 are probably very rarely completely powered down. At the gate, they're on ground power and there are probably some power buses left energized even if the plane is going to sit unused until the next day. These generator control units (since the article says they take an hour to boot up...running through a ton of tests, I guess?) are probably energized even when most other aircraft systems are turned off. If the aircraft is being towed to a hangar or away from a gate for parking, they probably want its lights on while in motion on the field so will stay powered from battery or fire up the APU (auxiliary power unit) before disconnecting from ground power to tow it around. When it makes it to the hangar, perhaps it gets plugged back in to ground power. So it's conceivable that this particular component could remain powered up continuously for a looong time.
That said: as others have stated, some forms of maintenance would power down the plane completely, and these likely happen more often than this overflow bug occurs. If I remember from the articles when this bug was first discovered, it was discovered through simulation or specific testing and not because an operator encountered it, which would imply that yes, operators do "reboot" the whole airplane often enough where this doesn't naturally happen.
This kind of plane isn't "not in use" very often. In fact, any time it's not flying people around is costing the airline loads, and they try to minimise that.
Plus, when it's not flying passengers, it's being fueled, cleaned, maintained, and inspected. All of those are easier if there's power and systems online.
Unfortunately, evidence [1] suggests that the rate of bug detection by reviewers does not scale linearly with the number of reviewers, adding more than 4 reviewers uncovers bugs at a much lower rate [2].
However I absolutely agree that better / more extensive / diligent code reviews are part of the solution to improve code quality and eliminate these kinds of defects. It's tough to create the right kind of incentive structures for reviewers internally at a company; maybe the future will have specialist firms that provide review-as-a-service for a fee, or perhaps firms could trade review-hours (all under strict NDAs I'd imagine).
Well usually the difference between open source and proprietary is that the proprietary vendors are slow to fix security issues because it doesn't increase their profit. In some cases they even try to hide the fact that the software is insecure or even sue the person who reported the vulnerability. Meanwhile most OSS software gets fixed as soon as the vulnerability is found.
In the case of aircraft though a company could find a bug and do their best to fix it quickly. But if the plan is already flying any changes to the software require a re-certification of the software which can take months.
It should be fairly easy for anyone to figure out that planes generally aren't used around-the-clock. For example, I routinely take a flight which arrives at 11pm, and I know that nothing is flying out of that airport again until the morning.
Having seen a similar bug on humble Netscalers, and found a much shorter reboot quite difficult to schedule, I do believe the operational challenge here.
The difference being that setting up NS HA, then failing over to the secondary and rebooting in turns is a lot easier than doing the equivalent on a plane. So even considering the operational requirements are the same (no time to reboot but if you don't, people die), you still have a lot more flexibility.
The problem with an airplane is that by the time you find such an issue it becomes prohibitively expensive to fix if an "architecture" change is required. This is why workarounds are provided instead.
I know it's a joke but the mental picture of an airplane with its wings removed showing up at a bus stop, dropping in for a broken bus, is very amusing. Would probably be national news and good PR if they actually pull this off, might be doable with a short enough plane. One can dream...
United doesn't, as far as I can tell, do maintenance until something breaks.
In order to be allowed to fly in the US, and other countries, you better believe United does maintenance.
This is one of my favorite reddit comments ever, and goes over aircraft maintenance from a car analogy, explaining what you'd have to do to a car in order to meet the mandatory maintenance standards imposed on airlines as a matter of law:
JFYI, I often use as a reference the actual VW Owner's manual for the Beetle, you can find a copy of the 1952 version here, relevant pages are 7 and 8 "Operating Instructions":
the system was rolled out across newer fleets without much testing, and in some models controls practically every single feature in the vehicle from climate control to the onstar SOS. There was a recall for platinum model F150 trucks because the system could glitch out after so many hours of continuous operation and trigger a fault in the 4 wheel brake force distribution system. This in turn either completely arrested the brakes, or caused them to quietly apply themselves at around 15%...you couldnt undo this unless you pulled the fuse. Even worse, collision detection would be disabled because the system thought you were aware of a potential crash and were braking.
If a certain bluetooth phone were paired, it could cause the trailer load and position sensor to erroneously predict grade downshifts. The result was that incoming calls on the highway would either wreck the rear differential or put the truck on the side of the road.