Reboot Your Dreamliner Every 248 Days to Avoid Integer Overflow (2015)

nimbius · on Sept 4, 2018

I read the article from tip to tail and as a professional engine mechanic, It reminds me of Fords "myford touch" platform.

the system was rolled out across newer fleets without much testing, and in some models controls practically every single feature in the vehicle from climate control to the onstar SOS. There was a recall for platinum model F150 trucks because the system could glitch out after so many hours of continuous operation and trigger a fault in the 4 wheel brake force distribution system. This in turn either completely arrested the brakes, or caused them to quietly apply themselves at around 15%...you couldnt undo this unless you pulled the fuse. Even worse, collision detection would be disabled because the system thought you were aware of a potential crash and were braking.

If a certain bluetooth phone were paired, it could cause the trailer load and position sensor to erroneously predict grade downshifts. The result was that incoming calls on the highway would either wreck the rear differential or put the truck on the side of the road.

Someone1234 · on Sept 4, 2018

Chrysler has had similar issue[0] this last year, their Pacifica Minivan had a software crash that would cause the engine management computer to crash and the vehicle to completely lose power while operating (inc. at highway speed). In effect you'd have complete loss of electrical power and even power steering, brake assist, and hazard warning lights(!) would be lost.

Took them six or more months to fix it as they weren't able to reproduce the software glitch.

[0] https://www.nytimes.com/2018/01/12/business/chrysler-pacific...

colejohnson66 · on Sept 4, 2018

Why is the infotainment system even on the same computer as the brakes!?

dtech · on Sept 4, 2018

Modern electronic cars use a single bus (CAN bus) that connect all electrical systems. The brakes, the engine, the wipers, the stereo...

Luckily it is a very well-tested system, it was one the first commercial use cases for QuickCheck [1], but still...

One of the bugs that John Hughes found was an integer overflow that transformed a lowest-priority message into a highest-priority message. [1] https://www.cs.tufts.edu/~nr/cs257/archive/john-hughes/quviq...

dcposch · on Sept 4, 2018

This blows my mind.

On the Stanford Solar Car Project -- a mostly-undergrad student group -- we always built our cars with two physically separate CAN buses, one for nonessential features and one for the motor controller.

That way, we limited our surface area for catastrophic bugs: the only things connected to the safety-critical CAN bus were the motor controller itself and the throttle board.

This for a one-off experimental vehicle.

CAN is a very simple, unauthenticated bus. Any device can send a message that every other device on the bus will receive. Messages are typically just a few bytes long, binary encoded.

The idea of attaching an internet-connected infotainment computer to the same CAN bus as the brakes is absurd. Doing so on a production car, even more so.

--

The Ford story is disappointing but not totally new.

A few years ago, Jeep committed similar design malpractice, resulting in a bug where Jeeps could be crashed remotely. https://www.wired.com/2015/07/hackers-remotely-kill-jeep-hig...

I'm surprised that regulators still allow these kinds of designs.

pipelineist · on Sept 5, 2018

> CAN is a very simple, unauthenticated bus. Any device can send a message that every other device on the bus will receive.

True, but in practice messages that are deemed important are secured at higher OSI levels, and the identity of important bus participants is cryptographically checked during vehicle startup.

> The idea of attaching an internet-connected infotainment computer to the same CAN bus as the brakes is absurd. Doing so on a production car, even more so.

Which is why all vehicle architectures I'm familiar with have a bunch of CAN (and other) buses. Maybe it used to be different, and I certainly know only a few bus architectures, but they make a honest attempt at securing the important bits at least.

Industry engineers may not get everything right, but they're not that stupid. Cut them some slack.

rthomas6 · on Sept 4, 2018

IMO there should be separate infotainment and hi-speed system CAN buses. It's lunacy to place engine control on the same bus as something with internet access that doesn't get upgraded, like, ever. I don't care how well it's tested.

sneakymichael · on Sept 4, 2018

Agreed – FYI: Tesla Model S (and likely X and 3) vehicles use a locked-down gateway (offering specific API actions, which result in CAN communication, etc) between their Ethernet network and CAN [1]

[1] https://www.blackhat.com/docs/us-17/thursday/us-17-Nie-Free-... (page 19), https://blog.lookout.com/hacking-a-tesla

Declanomous · on Sept 4, 2018

That's how it is on my car, but at the same time my car basically has no fancy electronics.

I understand software engineers pulling stunts like this, because there isn't actually a regulated engineering license, but I'd honestly have expected automotive engineers to be vehemently opposed to putting the stereo on the same system as the brakes.

Gibbon1 · on Sept 4, 2018

It's not the software engineers that are the problem it's the automotive design managers who decide that there will be one computer and one buss.

The DOT should have standards for this stuff. Infotainment system shouldn't have control authority or share critical buses with the engine and braking/traction control system.

pjmlp · on Sept 5, 2018

Sure there is, just not in US where everyone can make cards with software engineers written on them after a bootcamp.

http://www.ordemengenheiros.pt/pt/a-ordem/colegios-e-especia...

https://blog.vdi.de/tag/informatik/

meguest · on Sept 4, 2018

I wonder if some of this is to do with weight reduction. To have seperate buses must add a lot of extra equipment and wiring.

This means more weight and more power required to operate the electrics and in turn means lower MPG, higher fuel consumption and a more expensive car to design and manufacture.

The net outcome is a less competitive car in a competitive market.

chucky · on Sept 4, 2018

I've worked with software engineers working on the CAN bus, and I had the same questions as you. According to them it has to do with the amount of cabling needed as well as electrical interference. A car has a lot of cabling going all over (a couple of hundred meters in total was the number I got), and if you can put as much of it as possible on the same bus(es), you save yourself a lot of problems.

So I suppose weight could be a factor, but space seems to be the big one.

irrational · on Sept 4, 2018

Well the obvious solution it to go wireless! What could possibly go wrong?

swirepe · on Sept 4, 2018

It looks like a CAN bus hub weights about a pound. That's negligible, considering I have about twice that weight in garbage sitting in my cup holders.

[1]: https://www.autozone.com/electrical-and-lighting/can-bus-hub...

sitkack · on Sept 5, 2018

How much does death weigh?

perilunar · on Sept 5, 2018

-21 grams apparently /s

retbull · on Sept 4, 2018

I mean "a lot" in computer terms is like 4 lbs of equipment but since the brake pads on a truck weigh in at more than that I assume its not weight related.

crististm · on Sept 4, 2018

This begs the question: how do you know this is true?

frankharv · on Sept 4, 2018

Well JD Power listed Fords MyTouch as the biggest customer complaint of 2011[0].

I think customers might know how messed up their car is.

[0]https://wheels.blogs.nytimes.com/2011/06/23/aggravating-myfo...

crististm · on Sept 5, 2018

My post makes no sense. I replied to the wrong parent

jccooper · on Sept 4, 2018

Most modules have some feature or other that needs to communicate with infotainment or another module that needs to communicate with infotainment. Error reporting is the big one, but there's lots of little ones too. The brake module, for example, may be involved in tire pressure sensing, and also takes configuration for traction control and reports activation.

It's hard to find an appropriate segmentation for an air gap without sacrificing features, and most customers won't trade features for "security".

rthomas6 · on Sept 4, 2018

You are right of course. But still. Once you can talk over the CAN bus you can spoof pretty much anything, if I recall correctly. You could at least have some intermediary module on both buses that acts as a proxy. Then if someone hacks your radio they can't pretend to be the ECM.

Then again, this exact scenario actually happened and having an intermediary didn't help. https://www.kaspersky.com/blog/blackhat-jeep-cherokee-hack-e...

mrweasel · on Sept 4, 2018

The infotainment system should only be allowed to read values from the modules related to the actual driving. Configuring things like traction control will just have to be done via a system separate from the infotainment system.

I understand that it a little wasteful to not use that big touch screen in the center console to do configuration of the driving experience, but there's no safe way to do so.

plttn · on Sept 4, 2018

But the gauge cluster screen is connected to the infotainment system.

marcosdumay · on Sept 4, 2018

An air gap may not be viable, but you can place a firewall there. There is no reason for everything to be on the same bus.

pipelineist · on Sept 5, 2018

> An air gap may not be viable, but you can place a firewall there.

No idea how it was one or two decades ago, but modern architectures certainly have one.

I can't remember the specifics, but the one I looked at was actually pretty beefy hardware-wise, it had three reasonably powerful cores, 1GB RAM, etc.

> There is no reason for everything to be on the same bus.

It isn't. They would run into bandwidth or (more likely) latency issues anyway.

pipelineist · on Sept 5, 2018

> IMO there should be separate infotainment and hi-speed system CAN buses.

There are. IIRC BMW uses seven (anyway, several) different CAN buses, plus other bus systems for entertainment (MOST bus)

InternetOfStuff · on Sept 5, 2018

> Modern electronic cars use a single bus (CAN bus) that connect all electrical systems. The brakes, the engine, the wipers, the stereo...

Not true.

The automotive bus architectures I know first hand (admittedly far from all of them of course) have multiple CAN buses, plus other bus technologies as well (MOST/LIN/Ethernet).

I don't think it would even be possible from a bandwidth/latency standpoint to push everything across a single CAN bus.

hultner · on Sept 4, 2018

I had Hughes as a lecturer in University (Chalmers, Sweden) and we heavily utilized quick check in our Haskell courses. He’s a great lecturer and a brilliant mind, I do think they still teach new CS students Haskell with QuickCheck as a first course.

sitkack · on Sept 5, 2018

There should be at least two if not more CAN buses in a modern vehicle. The NHTSA should mandate that safety critical systems be on their own (redundant) network.

astrodust · on Sept 4, 2018

Yeah, that's a terrible idea. You should have at least three separate communication channels, and the most critical should be as isolated as possible from the others.

Critical (e.g. brakes, steering, engine control) important (AC, heat, dashboard) and then everything non-essential.

The people that design cars need to be software engineers but instead they're behaving like mechanical engineers.

breatheoften · on Sept 4, 2018

This seems like just a mind boggingly bad architecture — is there any real reason that a system should be designed this way or is this organization simply the result of poor thinking applied incrementally to inconvenient realities?

avereveard · on Sept 4, 2018

well it's real convenient to have all faults available from the obd port, plus it's mandatory in some countries.

analbeads01 · on Sept 4, 2018

The obd port does not preclude multiple can buses and some cars have at least two accessible from the port

jgrip · on Sept 4, 2018

In most cars this is not the case. For example, in my 2005 Opel Astra (Saturn Astra for US) there are three buses. High-speed CAN used for critical systems, Low-speed single wire CAN used for other vehicle systems and a mid-speed CAN for the entertainment and climate control stuff. All traffic between these buses are "firewalled" by CAN bridges that should only forward relevant frames between the nets.

VW for example have a firewall in front of the OBD connector, only allowing traffic for the diagnostic addresses to pass.

However, I expect that in newer "cloudy" cars, they need so much data that these "firewalls" have become very permissive. Remote start via Apps, triggering signal horns from the Internet, OnStar telemetry reporting etc.

Traditionally the car makers have been completely terrible at tech security but they are slowly improving on this front. In fairness they've also been to some degree hampered on this front by regulations protecting local small garages, stating that the diag stuff cannot be locked down too hard.

Sileni · on Sept 4, 2018

The trend towards monolithic hub systems is baffling. Didn't we learn those lessons while we were struggling with mainframes and timesharing? You don't lock critical systems, ever. You don't even directly interact with them, you send them messages and if they don't receive them they should continue to function as the operator would expect. The collision avoidance system failing to function correctly is no where near as dangerous as arbitrarily applying the brakes.

ModernMech · on Sept 4, 2018

> Didn't we learn those lessons while we were struggling with mainframes and timesharing?

I feel like there are really three kinds of programmers today: those who learned first-hand the aforementioned lessons, those who were taught second-hand these lessons, and those who are destined to learn these lessons first-hand.

There are very few people relatively who experienced the lessons first hand. They taught the lesson to a larger group, who will hopefully do their best to heed its warning.

But what's happened? A whole lot of people came into programming in the last couple of decades. I feel like there literally aren't enough "elders" in the field to teach the influx of new programming ability. They're making the same mistakes again because for everyone one you tell not to, there are 10 more who are willing to try. And to some of them we're giving millions of dollars and they're forming companies with it, who go on to sell products about which exactly 0 people have asked "Is this a good idea? Has it been tried before? What was the result? What can we learn from that? Has anything changed since then that would lead to a different result?"

setr · on Sept 4, 2018

Thats pretty much the same story as usenet’s eternal september: as the rate of newcomers increases, it eventually outpaces the rate of teaching. At that point, no culture can be sustained longer than a year; a new batch will arrive, and learn things from scratch, and come up with their own solutions... until the next batch, which overrides any lessons learned by the prior

At the very least, here, the knowledge learned is sustained by individuals, and acknowledged (so you still have “experts” and “elders”), but as a general community, you can safely assume that everything is forgotten very rapidly.

Another example of it I’ve seen in video games, where modern designers seem to have forgotten that games existed before 2005: the same design mistakes solved in 1990 crop up again in 2018.

My preferred example being star control 2 vs mass effect: very similar games in design and spirit, two decades apart, and mass effect features many of the same mistakes, and even mistakes solved in star control. As if they had started the design from scratch.

crooked-v · on Sept 4, 2018

> and mass effect features many of the same mistakes, and even mistakes solved in star control

I'd be curious to hear examples.

setr · on Sept 4, 2018

Note that I'm talking about ME1; I never played ME2/3, since I never cared about their combat systems (and it's not very interesting anyways)

Also note that in some fashions ME1 does succeed in its goals, and outdoes SC2, but it doesn't really matter for this; whats interesting is what it failed to learn from its 20 year predecessor

tldr; SC2 does less work and gets much further with it, both mechanically and narratively, in a lot of ways that ME should have learned from. Instead, it seems more like they weren't even aware space operas existed outside of film.

the same mistakes include things like the landing/mining minigame, which is terrible in both games in similar fashion: it's kind of interesting for the first couple rounds until you realize its repetitive, mechanically simple and somewhat poorly controlled. In ME, its a tiny-bit alleviated by being in 3D, so driving around is more fun, but the geology remains dull and generally pointless.

Both feature the nuisance of having to iterate over every world, looking at the stats, and dropping it for 80% as its unlikely to be worth exploring. SC2 also acknowledges its mediocrity and generally avoids requiring it, with very few quests requiring it (and the coordinates usually given), and it becomes mostly unnecessary for mining further into the game as you find massive resource deposits. ME's 3D becomes a negative in this regard, as they still use it for questing throughout the game, and while SC2 has a small 2D screen to parse for the quest location, ME requires "exploring" the dead lands to find whatever marker. The lack of variety becomes more obvious due to the 3D environment (and the amount of time you're stuck in it) compared to SC2.

Notably, the existence of shit exploration in SC2, and how memorable it ends up being, should have been a strong indicator to ME not to pull the same shit

Probably much more based on personal preference, but the much more open, broad and simple narrative of SC2 is more compelling than ME's, at least partially because there's less for them to do a poor job in. ME has bigger and more varied politics between the races, and by far is the more serious game, but this also leads to a lot more stupid politics. Trying to give a bigger background to the races just leads to each race feeling not much different from humans, because, well, they basically are humans in a different skin. Same politics, same motives, different colors. SC2 "cheats" by simplifying races to really distinct, bare-bone traits, and going from there.

The spathi are absurdist cowards, and that's all they are. Their background stories, and all their operations, derive almost entirely from their extreme fear of everything. The Zoq-Fot-Pik are a strange, friendly symbiotic trio of aliens, operating by weird language rules, and odder political preferences. A lot of races lack a clear background, told only in minor hints, and this makes them feel more alien than ME could hope to accomplish. Compounded with the lack of requiring them to walk around and such, primarily being differentiated by their voices and tiny gif-like animations, leaving it to the imagination.

ME makes the mistake that SC2 didn't: aliens are interesting by their very nature; the haze of information is what fuels it. SC2 also has the benefit of being comedic, so it can come up with less plausible stories. ie from SC2's wiki: "For another fifty thousand years, the Zoq, the Fot and the Pik relaxed in the forests, until one day one of each race was walking up a steep path looking for something to eat, when a bolt of lightning struck nearby. The bolt of energy carved a wheel-shaped chunk of granite out of a cliff. As the rock began to roll down the hill, some dry grass got caught in its hole, and since the rock was still hot the grass caught on fire. Thus the Zoq, Fot, and Pik simultaneously discovered the wheel, fire, and religion"

But regardless, it puts out the feeling of Star Trek far better than ME could ever hope to accomplish, at least partially because the game doesn't try as hard. It might be argued that ME was leaning more towards Star Wars (space drama), but even then, it fails by virtue of expunging too much information (and not being grand enough)

And there are other smaller things like SC2 granting a greater degree of freedom, again by virtue of doing less work on its weaker parts. The combat is simpler, and doesn't require it for the most part, and is generally better off for it (ME loves to pretend it has a good enough system; it does not.) SC2 still has the fault of having an annoying amount of combat for what it is, but still far less so than ME (and doesn't really require it for any narrative events). Exploration is more interesting in SC2, again because of the freedom, narrative structure, and it wasting less time on its weaker components.

And of course there are those mistakes that just come out of modern game design, but these are unsurprising:

fast travel: kills any sense of distance and exploration (in ME's case, all travel is fast-travel, so space doesn't feel at all ... distant. The citadel feels larger than space.)

quest-logs: kills any sense of autonomy in the narrative, and background-discovery, and even exploration itself

emphasis on combat missions: ME does not have a good combat system. There's absolutely no reason for it to emphasizing it.

emphasis on player character: It's a space opera, and its focused on upgrading your player character & co? This is just...wrong. Personal preference, but focusing on ship upgrades is much more sensible. But then, the game barely involves space in the first place, so maybe not. It could have been a medieval fantasy and it wouldn't have been too different, in a lot of regards. (ofc, it is bioware, so maybe it really did derive from medieval fantasy)

dialogue wheel: adding morality to the choices, and then making it obvious to the player? It's just absurd. narrative branching on action alone is sufficient, less work, and obviously more correct.

combat-wise there's a lot of mistakes as well, but they're uninteresting for this discussion, and have a lot of other predecessors they failed to learn from. Suffice to say, its not a shitshow, but its not well done. Whats more interesting is that the designers failed to acknowledge its lack of quality, and account for it (or maybe they did so intentionally; its an EA title, so publishers are likely a good deal at fault for that one).

PurpleBoxDragon · on Sept 4, 2018

I'd say a bigger problem from my perspective is all the non-programmers who get into management and project management roles who never learned these lessons and who refuse to listen to programmers trying to teach them. They'll ignore you, and if you don't 'be a team player' you'll end up replaced by someone who will.

logfromblammo · on Sept 4, 2018

Because car companies have always built cars, not secure networks. Those executives have never been to DefCon; they have no idea what level of risk they face from the unsecured electronics components in their cars. They are not just unaware that putting critical systems and non-critical systems on the same bus is a bad idea, but they are also completely ignorant of the difference between real-time systems and the operating systems on their phones and computers.

They need to hire some folks with aviation backgrounds, who can explain to them why the plane does not fall out of the sky when the in-flight entertainment system chokes on a scratched DVD. Even when they get it wrong, it is still less wrong than the auto-makers.

[Edit:] Aviation people are the only outsiders they're likely to listen to.

They're not perfect. But they are better. They at least have an awareness that security is an issue, even if the ways they handle it are... well... let's just say they're not ideal.

Even though the tech folks who frequent this site are knowledgeable, the people who build stuff don't always respect our expertise. Sometimes they don't even realize that our expertise might apply to their problems. This is how we get black-box electronic voting machines and CANBUS2 and wi-fi light bulbs or security cameras that inadvertently open a back door into your LAN.

monocasa · on Sept 4, 2018

I'd be surprised if aviation systems are much better. They're really into putting everything on the same physical network too, but employing these things called "data diodes". As if that's a real thing, practically speaking.

crististm · on Sept 4, 2018

You can even imagine that Dilbert manager telling you to put those diodes back...

Dylan16807 · on Sept 4, 2018

The problem is having virtual 'data diodes', or doing things like vlans on shared complex switches.

A real data diode is easy. You hook up half a serial port.

monocasa · on Sept 4, 2018

Sure, but the infotainment system isn't connected via a serial port.

Dylan16807 · on Sept 4, 2018

It absolutely could be, though. Or you could use half an ethernet port. The idea of a data diode is fine, it's the [lack of] implementation that's at fault.

monocasa · on Sept 4, 2018

I mean, those movies being played on the back of seats aren't being run through a serial port. Given the move to an "on demand" scheme, they're not one way. The data diode is clearly meaningless when there's obviously two way traffic.

Dylan16807 · on Sept 4, 2018

I don't understand. You wouldn't store the movies on the avionics systems in the first place. That's all inside the infotainment system, on one side of the diode. The things going over the diode would be stuff like current location and tire pressure.

monocasa · on Sept 4, 2018

Ultimately the whole point of having these systems connected on both cars and planes is to save weight on wiring. Their data diodes are just vlans.

Dylan16807 · on Sept 4, 2018

Which are often insecure, making them not actually data diodes.

Which is where we started.

And still the problem is not the idea of a data diode, it's implementation.

pipelineist · on Sept 5, 2018

> "data diodes". As if that's a real thing, practically speaking.

On CAN bus, it actually can be.

I've seen CAN bus participants with the transmit pin not connected. They were physically incapable of writing to the bus (granted, this drastic solution only works in very simple cases).

eletious · on Sept 4, 2018

Maybe we shouldn't use avionics (the article we're commenting on is about a buffer overflow bug in new Boeing planes that could drop them from the sky) engineers to train the auto industry on system security

LeifCarrotson · on Sept 4, 2018

The other comments have effectively lambasted the manufacturers for their incompotence and foolishness in putting these on the same system, so I'll skip that bit.

The practical reason is that the big central touchscreen is a convenient place to put stuff. On an old truck, you selected 4WD by moving a big transfer case lever, which indicated you were in 4WD by its position. No computer to get in the way. Slightly newer trucks have physical buttons with lights that connect to electrical solenoids, often directly. Tractors have brake pedals for each wheel, but old trucks don't have a similar braking force distribution system - you got locking diffs or dragged all the brakes and hoped it helped. With a touchscreen and a computer in place, it's easy to make compelling sales pitches for putting complex features there instead of on expensive, complicated physical controls.

Complicated on a per-unit basis, that is - I understand that a computer has far more moving parts than any plastic-and-wires assembly, but when it works it's easier to draw buttons and graphics than to make mechanical actuators.

Whether or not the processing happens on the same CPU or communication happens on the same physical bus doesn't matter that much when you have the ability to make selections that affect the engine and drivetrain on the main display.

crististm · on Sept 4, 2018

Makes sense. And it makes sense also to step back a little and see how this design may fail and change it in ways that makes it robust.

chooseaname · on Sept 4, 2018

Either the engineers thought it was a good idea or the engineers could not convince management that it was a bad idea.

java-man · on Sept 4, 2018

incompetence.

let me explain. attaching a non-essential sub-system with a large attack surface to a mission-critical network is not malicious (I hope) it is incompetence. I would accept a unidirectional serial link with a validated protocol as a sole connection between the two.

I think it is the same kind of thinking that goes into other bad ideas: flickering stop lights, red turn signals. The auto makers just don't really care or think. This is something I don't understand.

Someone1234 · on Sept 4, 2018

It is cheaper.

DoofusOfDeath · on Sept 4, 2018

I've noticed the occasional posting in HN from people who aren't [aspiring] professional software developers, which surprises me.

If you don't mind me asking, what brings someone in your line of work to a site like this?

[edit: I just realized that ^ sounds like a tragically bad pickup line.]

glup · on Sept 4, 2018

There's also a ton of academic lurkers -- I've always though that this due to the facts that 1) the border between academia and industry is pretty porous in technical fields (super obvious in ML research, but I think holds more generally) 2) academics borrow---and contribute to---the same set of methods and technologies 3) most individuals have an interest in complex systems (and their hilarious, tragic faults) such that they find articles like this one interesting.

I think it's also useful to look at preceding communities like Slashdot — in that case it started out pretty tightly scoped for Linux enthusiasts, gaming, programming, and the then-nascent social internet but then the cohort of regular readers and contributors got older and they found themselves in policy, academia, and managerial business roles; as long as people remained involved in the community, the content of the site expanded to reflect the (extremely rapidly increasing) role of tech in these domains. (Obviously, Slashdot suffered from several ownership changes, a lame commenting system, and most everyone moved on to other venues).

phs318u · on Sept 4, 2018

and 4) the signal-to-noise ratio on HN is very good compared to other online discussion forums.

soared · on Sept 4, 2018

~30% of submissions are software-dev-only (new typescript 3.0 release), but most submissions are just high quality, interesting posts. And the level of discussion is excellent - no memes, no jokes. Its like the opposite of reddit.

FooHentai · on Sept 4, 2018

Yeah this. Signal:Noise ratio on HN is far better than any other source of tech news & discussion I know of. I skip the programming-related threads.

mclehman · on Sept 5, 2018

I think Lobste.rs might sneak past HN on signal:noise, but it gives up a bit of variety in topics.

nimbius · on Sept 4, 2018

our heavy diesel emissions performance machine runs Linux, and after a vendor training on how to use it I became totally fascinated with it. A lot of Linux professionals like HN a lot, and the articles are a good break from the norm on lunch break.

Im learning python, too! :)

engi_nerd · on Sept 4, 2018

I've been lurking here for 10 years. For most of that time I was employed as an electrical engineer.

Why should people who are interested in high quality technical discussion surprise you with their presence?

"Hacker" used to mean a lot more than "software developer".

analog31 · on Sept 4, 2018

I came across HN by accident, and what keeps me coming back is the interesting, informative links & threads, and the quality of discussion. Threads about workplace issues are of interest to me, as are general scientific and social topics.

I'm a "scientific" programmer, meaning that I use programming as a problem solving tool, but I don't write commercial software for a living. I turned away from that path in the early 80s.

cli · on Sept 4, 2018

I've been reading this site since 2009 and can barely write anything more complex than Hello World. The guidelines on submissions say nothing about software engineering, and the definition of hackers as written by Paul Graham does not limit it to software engineers.

seem_2211 · on Sept 4, 2018

The comments are informative, there's no politics, it's more interesting than the news. I'm a sales recruiter, so I don't read the in the weeds software articles, but there are still 2-3 good articles on the front page at any one time.

_Tev · on Sept 6, 2018

> there's no politics

Ok I am definitely clicking on bad articles

zitterbewegung · on Sept 4, 2018

HN is not a monoculture the appropriate stuff on the site can interested anyone and no one is required to be a software developer to get on this website.

setr · on Sept 4, 2018

Nobody is required to be anything on any public forum; that doesn’t change the fact that communities exist, and the vast majority of people on a forum are there due a particular taste, and share a common preference for discussion.

That anyone of any interest can visit the webpage is not a sufficient explanation as to why anyone of any interest do visit it. The question is whats so special about HN that draws such a varied background, despite it clearly having its strongest preference towards software engineering (on any given day, if you counted by category, it would be surprising to find software engineering / programming not as the highest).

A reddit forum has the same public property, but the software engineering channels presumably do not draw such varied backgrounds. So why?

glup · on Sept 4, 2018

Maybe granularity -- there's only one front page for HN? I usually find that I want to read 10% of articles on the front page, but I'm often surprised by which 10%.

zitterbewegung · on Sept 4, 2018

Most likely to the type and extent of moderation that is present on HN and the evolving purpose of both websites .

okmokmz · on Sept 4, 2018

A lot of the content is still relevant for those in other IT roles, or even just hobbyists. I'm a security engineer rather than a dev but I still find many of the posts on the site useful and/or interesting

ggcdn · on Sept 4, 2018

Big factor for me is the higher-quality discussion compared to say, Reddit.

MiscIdeaMaker99 · on Sept 4, 2018

Not everyone in the software world is a developer.

tejtm · on Sept 4, 2018

you had me at "bad pickup" :)

vvanders · on Sept 4, 2018

That's pretty spectacular, I'd love to read some more sources on it.

Makes me glad Tesla did their stuff the way they did. You can reboot the cenelter console and/or days while driving and not much happens other than A/C turning off for about 20s.

Someone1234 · on Sept 4, 2018

Tesla pushed out a software update to AutoPilot (Lane Centering) that caused a fatality[0], I hardly think they should be held up as an example of good software hygiene.

[0] https://www.mercurynews.com/2018/03/29/tesla-crash-victims-f...

tomc1985 · on Sept 4, 2018

There was recently an interesting twitter dump by a former Tesla engineer whose NDA was up; as of a few years ago their software development practices were... pretty bad.

Apparently every Tesla vehicle runs a Kubernetes cluster and is/was remotely accessible to Tesla via SSH?

https://twitter.com/atomicthumbs/status/1032939617404645376?...

FussyZeus · on Sept 4, 2018

I sleep very well at night knowing my software is not keeping helicopters in the air, cars on the road, etc.

Honestly I'm in awe of the lot of programmers who take these kinds of tasks on. I don't know how they can relax at all relative to their work.

SmellyGeekBoy · on Sept 4, 2018

> I don't know how they can relax at all relative to their work.

They strictly follow procedures. If faulty code gets through to production and somehow causes a fatality, it's a failure of the procedures, not the individual developer. This works so well in aviation and yet seems completely non-existent in the automotive world.

ianhowson · on Sept 4, 2018

The procedures exist and are 'followed', but there's a lot more time-to-market pressure. Manufacturing lines dictate the schedule and software must keep up no matter what.

InternetOfStuff · on Sept 5, 2018

When I worked on safety-critical products, it was actually really nice to be able to push back.

"We can't satisfy the safety case yet" is all you need to say to get your manager to cave.

If they want to take the risk upon themselves to sign off on a known-unsafe (technically; in practice it was already pretty good, just not good enough yet) device, and go to jail if something goes sideways, they can be my guest...

In practice, they preferred to come back next week and ask if we were done yet.

The project went wayyyy past the deadline, thanks for asking :-)

ianhowson · on Sept 4, 2018

If you as an individual are taking on that level of personal responsibility, your organization is broken.

Part of the regulatory process is to demonstrate sufficient levels of resilience in both the systems produced and the development process used to produce them. It should be a really boring process with a lot of crosschecks. You should be sleeping well knowing that your work is being reviewed and tested and attacked by other people.

(Reality is often different, of course. Just don't try to be the hero.)

nradov · on Sept 4, 2018

I recommend reading the "NASA Manager's Handbook for Software Development". It contains some great guidelines for writing safety-critical software. AFAIK, no Space Shuttle mission ever suffered a serious safety incident due to a software defect.

https://ntrs.nasa.gov/search.jsp?R=19910006460

yoklov · on Sept 4, 2018

> I sleep very well at night knowing my software is not keeping helicopters in the air, cars on the road, etc.

Hopefully at least, but you never know where those random shell scripts you put in a gist (or whatever) might end up.

Ocerge · on Sept 4, 2018

I'm with you, I'm nowhere near confident enough in myself to work directly on safety-critical systems. I imagine an important part of those organizations is removing SPOFs with respect to the engineering team itself; i.e. nobody fails alone, and everybody tests the hell out of everything. Could be wrong though.

sitkack · on Sept 5, 2018

If you work for AirBnb your bad software is definitely changing lives and putting people in danger. Software doesn't have to be embedded to materially affect someone's life.

Timmah · on Sept 4, 2018

Saying it caused "a fatality" is being disingenuous. It's along the same false pretense as "we shouldn't use self driving cars at all, if they lead to even one fatality".

Fatalities aren't good things, but they're inevitable consequences of driving. And it's disingenuous to expect anyone's code to be guaranteed to work on every mile of road in the US, under all conditions, at all times. If autonomous vehicles substantially reduce overall fatalities we are better off.

(Nobody talked about banning combustion vehicles after the pinto or banning Fords over their side mounted fuel tanks or their explorers that rolled over when the defective tires blew out)

Someone1234 · on Sept 4, 2018

> Saying it caused "a fatality" is being disingenuous.

It is accurate.

They pushed out an update that changed how AutoPilot behaved, someone had AutoPilot enabled, AutoPilot accelerated straight into a concrete barrier, and the occupant died.

Here is the NTSB's initial report on the incident:

https://www.ntsb.gov/news/press-releases/Pages/nr20180607.as...

> It's along the same false pretense as "we shouldn't use self driving cars at all, if they lead to even one fatality".

I didn't say anything remotely like that. I stated what had already occurred. Trying to put those words in my mouth doesn't seem like you're responding in good faith.

Timmah · on Sept 4, 2018

You have no proof that the accident wouldn't have occurred anyway without the update. Or that their updates didn't save one or more lives.

That's my point. Not an "accurate" but misleading accusation at the software without considering the hidden variables.

decasia · on Sept 4, 2018

"You have no proof that the accident wouldn't have occurred anyway without the update."

Well, the NTSB preliminary report doesn't make any statements about what the software update did, per se. But it does indicate that the driver's hands were not on the wheel at the time of the accident and that the "autopilot" software was activated. So it's fair to say that the "autopilot" software was responsible for the crash.

"Or that their updates didn't save one or more lives."

This is a red herring. It could be true. But there's no evidence for it so it's not worth thinking about. Our system of moral judgements rightly puts a lot of weight on demonstrable causes and effects, and generally ignores hypothetical, but totally unproven speculation like this. Otherwise anyone could get off the hook for anything, by saying, "I may have done bad action X but you can't prove I didn't also do good actions Y and Z which could well outweigh X."

"Not an "accurate" but misleading accusation at the software without considering the hidden variables."

In ordinary life we never know all the hidden variables. We just make judgments based on the best available information (or if necessary, decide to postpone judgment until better information is available). The NTSB preliminary report seems credible and I see no reason not to draw reasonable conclusions from it.

antongribok · on Sept 4, 2018

Tesla really scares me.

They released an OTA update to the brakes with less than 2 weeks from first reports to code being pushed to the entire fleet of cars on public roads.

slg · on Sept 4, 2018

Toyota had a similar brake software issue in 2010 [1]. They had a fix available 3 days after the NHTSA announced the issue. Toyota issued a voluntary recall for over 130,000 cars in the US. Since the recall was voluntary, it likely took years until some cars received that update when they were serviced for other unrelated issues. I have no idea if Tesla is doing things the best way, but Toyota having the ability to push that update out immediately would have indisputably prevented accidents and possibly saved lives.

[1] - https://en.wikipedia.org/wiki/2009%E2%80%9311_Toyota_vehicle...

jclulow · on Sept 4, 2018

Move fast on brake things?

amluto · on Sept 4, 2018

I’m very glad that Tesla’s instrument cluster is semi-independent. Because mine crashes every few minutes while driving due to a “high priority” firmware regression.

pjmlp · on Sept 4, 2018

On the latest ADB Podcast, they mention that certain car infotaiment systems just go "blue screen" if you feed them a more recent Bluetooth version than what they have hardcoded.

http://androidbackstage.blogspot.com/2018/08/episode-97-blue...

macintux · on Sept 4, 2018

John Hughes likes to talk about the CAN bus when discussing Quickcheck and test generation.

Learning about the crazy complexity and potential for unforeseen interactions made me very unhappy when security researchers remotely controlled a car on the highway as a publicity stunt. There’s simply no way they can know with absolute certainty what might happen.

lanius · on Sept 4, 2018

Yikes. I'm not a prepper, but there is something to be said for older vehicles without this unnecessary computer nonsense.

samstave · on Sept 4, 2018

Question from the ignorant:

What % of a 24-hour day is any given commercial airliner in service, vs. sitting completely turned off in a "parking" spot?

sitkack · on Sept 5, 2018

Most modern things don't ever "turn off".

saudioger · on Sept 4, 2018

Uhh that's kind of terrifying. I'm due for a new car after 10 years, and I think I might specifically try to avoid the systems that centralize things like breaks with entertainment systems...

ryanmarsh · on Sept 4, 2018

That’s horrifying. Are there any comprehensive write ups on this you can share? I’d like to read more about this.

jacobush · on Sept 4, 2018

This is just ... outlandish. You couldn't make this shit up with a straight face.

thrillgore · on Sept 4, 2018

Why aren't these systems just plain air gapped?

marijn · on Sept 4, 2018

> Your options are to increase the number of bits used, which puts off the overflow, or you could work with infinite precision arithmetic, which would slowly use up the available memory and finally bring the system down.

Yeah, no. Doubling the amount of bits to 64 while keeping the same precision gets you about 3 billion years worth of time, which is probably enough. And I'm going to leave calculating how much time it'd take to fill up any reasonable amount of memory with a single arbitrary-precision integer as an exercise to the reader.

ainar-g · on Sept 4, 2018

Even if you do use arbitrary precision arithmetic and count nanoseconds, the heat death of the universe is more likely to occur before your number takes 1KiB of RAM.

gvx · on Sept 4, 2018

If my calculations are correct, 520 bits should be enough to count in units of Planck time until the heath death of the universe.

ttul · on Sept 4, 2018

We should probably leave a bit of room in case any new physics is discovered prior to the heat death of the universe.

ascar · on Sept 4, 2018

Make it 528 then. More than two orders of magnitude room ;)

bunderbunder · on Sept 4, 2018

Probably the deeper problem with using arbitrary precision arithmetic is that you end up with a variable-sized datatype, which I believe means at least a modicum of extra hassle & complexity for any language that the control software is likely to be written in. And less predictable timing, which might be a big no-no if this is something that needs to be used in timing-sensitive places.

I'd much rather take the 64-bit int, myself.

lucb1e · on Sept 4, 2018

I was going to comment how you should probably still account for an overflow condition by warning at 70%, beeping at 80%, refusing to take off at 85%, etc., but it turns out that even with nanosecond precision timekeeping (1e-9), a signed 64 bit integer is enough for 292 years of not rebooting.

    (2^63)/1e9/3600/24/365 = ~292

Yeah, just go for that 64-bit int and call it a day.

AnimalMuppet · on Sept 4, 2018

If I recall correctly, at least some guidelines for avionics software (JSF, maybe?) forbid dynamic memory allocations, period.

bunderbunder · on Sept 4, 2018

JPL's does. See Rule 5, on page ten: https://lars-lab.jpl.nasa.gov/JPL_Coding_Standard_C.pdf

monocasa · on Sept 4, 2018

In addition to JPL's, the JSF's does too, after initialization. That is you can malloc at startup, but not past that.

berti · on Sept 5, 2018

That's a general thing for safety critical real-time systems, not just avionics.

manicdee · on Sept 4, 2018

Are you assuming an update to a number will use the same memory as the original number, and that the original memory will be released?

shrimp_emoji · on Sept 4, 2018

Always buy in to 64. What was our ancestors' problem?[0]

0: https://en.wikipedia.org/wiki/Year_2038_problem

codeulike · on Sept 4, 2018

Storage space was the problem, used to be expensive.

wnissen · on Sept 4, 2018

Specifically, 32 bits of PDP11 core memory cost on the order of $2, and that was per timestamp. Pretty obvious why you wouldn't go 64-bit unless you had to.

daemin · on Sept 4, 2018

I don't think this cases needed to use 64bits for the clock. Because realistically I would hope that the plane would be serviced more than every 248 days. Therefore each service the computer could be reset or go into a self-check which would reset the computer and therefore prevent the overflow.

__david__ · on Sept 4, 2018

That's a silly attitude. You're right, the plane should be serviced regularly but that's no reason to build a time bomb in the code.

daemin · on Sept 5, 2018

It's a pragmatic attitude, especially at a time when embedded devices were far more expensive and limited.

Granted the software could have been written to always take a delta and handle overflow of the variable gracefully rather than using the raw value.

joe_the_user · on Sept 4, 2018

The thing about the error is that it's very standard sort of error but an error which apparently formal methods didn't catch.

I've just been looking up TLA+ (model verification language) and I'm not sure how it would deal with avoiding overflow; it's approach is proving that a system will always be in a required state but avoiding overflow in most practical approaches involves effectively staying far enough away from the overflow condition that it never appears (as the other comments here go into). Done right, overflow will never be impossible, just unlikely.

https://en.wikipedia.org/wiki/TLA%2B

beached_whale · on Sept 4, 2018

At 4 billion increments per second, it takes over 70 years to overflow a int64

Symmetry · on Sept 4, 2018

64 bits is certainly the solution I would go for as well. Arbitrary precision numbers are going to involve heap allocation which is something I would normally like to avoid entirely in a system like this.

chrisacky · on Sept 4, 2018

This is a little familiar with the rocket failure at Dhahran[1] of 1991 resulting in 28 deaths. The Patriot missile battery at Dhahran had been in operation for 100 hours, by which time the system's internal clock had drifted by one-third of a second. Due to the missile's speed this was equivalent to a miss distance of 600 meters.

Two weeks earlier, on February 11, 1991, the Israelis had identified the problem and informed the U.S. Army and the PATRIOT Project Office, the software manufacturer. As a stopgap measure, the Israelis had recommended rebooting the system's computers regularly. The manufacturer supplied updated software to the Army on February 26.

[1]: https://en.wikipedia.org/wiki/MIM-104_Patriot#Failure_at_Dha...

mabbo · on Sept 4, 2018

While it is a good lesson from a software perspective- your bits are going to overflow, make your software handle that gracefully- I've always been a bit uncomfortable with the blame for deaths being placed on that bug.

The bug didn't kill anyone. The scud missile fired with intent to kill did. And of course, it was fired because it was the Gulf War and countries were attacking each other. Blame who you like for that situation. But all anyone talks about is how it was the Patriot Missile bug that lead to the deaths.

The software bug failed to prevent the deaths that were going to happen anyway. Lessons learned, yes, but I'd hate to imagine a programmer somewhere living in guilt over it.

briffle · on Sept 4, 2018

I agree. Really there are two alternatives to what happened.

1. The patriot battery does not have the bug, (or was recently rebooted) and now has a BETTER chance of hitting the scud missile. (but not guaranteed) People may or may not die.

2. The patriot battery is turned off, or not there, and nothing stands in the way of the scud missile. People most likely die.

brianwawok · on Sept 4, 2018

> The software bug failed to prevent the deaths that were going to happen anyway.

It MAY have failed to prevent the deaths. I don't think Patriot has a 100% success rate. If the normal success rate of that intercept was 1%, 50%, or 99% - it changes the wording a bit I think?

SiempreViernes · on Sept 4, 2018

>was 1%, 50%, or 99%

Incidentally, that's about the numbers the pentagon reported, but in reverse chronological order.

http://www.turnerhome.org/jct/patriot.html

dmix · on Sept 4, 2018

The relevant quote:

> Official assessments of the number of Scuds destroyed by the Patriot missile system in the war have fallen from 100 percent during the war, to 96 percent in testimony to Congress after the war, to 80 percent, 70 percent and, currently, the Army believes that as many as 52 percent of the Scuds were destroyed overall but it only has high confidence that the Patriot destroyed 25 percent of the Scud warheads it targeted.

SiempreViernes · on Sept 4, 2018

Don't forget the bit after:

> Independent review of the evidence in support of the Army claims reveals that, using the Army's own methodology and evidence, a strong case can be made that Patriots hit only 9 percent of the Scud warheads engaged, and there are serious questions about these few hits. It is possible that the Patriots hit more than 9 percent, however, the evidence supporting these claims is even weaker.

tim333 · on Sept 4, 2018

This report actually sounds a little worse

>the FAA's new rules require operators to reboot the plane's electrical system every now and then because "all three flight control modules on the 787 might simultaneously reset if continuously powered on for 22 days." The effect of this simultaneous reset "could result in flight control surfaces not moving in response to flight crew inputs for a short time and consequent temporary loss of controllability."

Hope they've fixed that one now - also from 2016 https://www.popularmechanics.com/flight/airlines/a24151/boin...

ainar-g · on Sept 4, 2018

IIRC, 64 bits give you 200 years when you're counting nanoseconds. Why on Earth would they use a 32-bit integer? I doubt this was some kind of microoptimisation. My bet is on some sort of legacy component that is 64-bit-o-phobic.

kilo_bravo_3 · on Sept 4, 2018

I was in training brushing up on my embedded programming skills, which thankfully I don't have to use anymore, and the instructor told a story about a bug in a missile that was on the verge of delaying its deployment due to tests failing.

While on the bench and being subjected to a bunch of tests to validate the seeker components against simulated targets and countermeasures, the flight control surfaces would start spazzing out after a few minutes in the test jig.

The problem had something to do with IR (or something) tracking and attitude control, with the tracker rolling over the odometer and sending spurious data to the flight guidance system which caused the tests to fail and would have lead to an in-flight breakup of the missile real-world.

The onboard hardware was highly resource constrained, and engineers and developers worked for weeks trying to fix the problem, going as far as contemplating a complete redesign of the seeker system.

Then somebody pointed out that between the time of launch and the missile's two end states: impact with the target or fuel exhaustion-- was less than 60 or so seconds.

The only reason the bug was causing problems was that tests were being run back-to-back-to-back in order to speed up things, and the seeker subsystem was being powered on for way longer than it ever would be in the real world.

A missile shouldn't have to be on for more than a couple of minutes, and a Dreamliner shouldn't have to be on for more than 248.55 days so I'm willing to bet they stuck some ancient 8-bit micro in there, reused ancient code that was already tested, and called it a day.

Nimelrian · on Sept 4, 2018

> a story about a bug in a missile that was on the verge of delaying its deployment due to tests failing.

A Patriot Missile in Saudi Arabia wasn't launched to intercept a SCUD, which then killed 28 soldiers. The reason was poor handling of rounding errors [1]

The maiden flight of the Ariane 5 rocket resulted in a RUD due to an integer overflow [2]

[1] http://www-users.math.umn.edu/~arnold/disasters/patriot.html

[2] http://www-users.math.umn.edu/~arnold/disasters/ariane.html

SiempreViernes · on Sept 4, 2018

To be fair, there is little public evidence that Patriots ever killed a SCUD (but plenty of public evidence of the Army lying about their effectiveness).

http://www.turnerhome.org/jct/patriot.html

ajross · on Sept 4, 2018

> Why on Earth would they use a 32-bit integer?

Why stop there? Why on earth would anyone use an integer instead of a double, given the inherent risk of truncation error? Or for different arguments: Why on earth would anyone use a 16 bit wchar_t? Why on earth would anyone make char unsigned (or signed)? Why on earth would anyone put the little end first?

Machines are machines. They have fixed representations for different types, with tradeoffs. And you have to pick one. And the thing about timeout handling specifically is that everyone along the path from the timer driver up through the app needs to agree on the precision needed, or you'll get an overflow condition.

Arguments of the form "Bugs are bad and we shouldn't write them" have not historically helped with improving software quality.

_verandaguy · on Sept 4, 2018

I think the comment you're replying to makes a solid point, still.

If this bug were to appear every 200 years, that's substantially longer than the lifespan of any single airframe currently in existence (and nearly twice as long as the existence of powered flight) -- and if a Dreamliner were to actually survive that long (most likely as a historical artifact doing heritage flights, maybe), doing this kind of reset would just be part of the routine of getting the "old frame" up and running, not unlike timing certain steps of a WWII fighter plane's manually rather than having them done automatically.

Without saying that increasing the width of that variable is the _optimal_ solution, in this case deferring the error leads to safe and predictable operation over the airframe's nominal lifespan.

For what it's worth, quadrupling the variable's width to a int128_t would mean you can store over 10^22 years at nanosecond granularity, effectively future-proofing this bug out of existence, and I'm reasonably sure a modern system should be able to spare 3 extra bytes.

CountSessine · on Sept 4, 2018

They have fixed representations for different types, with tradeoffs. And you have to pick one. And the thing about timeout handling specifically is that everyone along the path from the timer driver up through the app needs to agree on the precision needed, or you'll get an overflow condition

I agree, but the difference between 32-bit counters and 64-bit counters as a precision refinement is very special. Upgrading from 32-bit counters to 64-bit counters, even when counting nanoseconds, lifts you out of human timescales. No electronic system has ever or will ever need to maintain a microsecond count for 600000 years, for example.

cpeterso · on Sept 4, 2018

> Why on earth would anyone use a 16 bit wchar_t?

Fun fact: the C standard does not specify the width of wchar_t. While MSVC uses 16-bit wchar_t (UTF-16), gcc (even MinGW on Windows) uses a 32-bit wchar_t.

Dylan16807 · on Sept 4, 2018

> Why on earth would anyone use an integer instead of a double, given the inherent risk of truncation error?

Depends on what you're doing.

> Why on earth would anyone use a 16 bit wchar_t?

That's a very good question! Either use something that can hold at least 21 bits, or use UTF-8.

> Why on earth would anyone make char unsigned (or signed)? Why on earth would anyone put the little end first?

These have nothing to do with space, what weird analogy are you trying to make?

> They have fixed representations for different types, with tradeoffs. And you have to pick one.

For most purposes, there is zero benefit for going too small. Pick a number that can't break under the use it's getting. If a fixed-size number can't do that, change your algorithm.

> And the thing about timeout handling specifically is that everyone along the path from the timer driver up through the app needs to agree on the precision needed, or you'll get an overflow condition.

Making sure your data types are compatible should be one of the easiest pieces of analysis you're applying to your safety-important code.

> Arguments of the form "Bugs are bad and we shouldn't write them"

Yeah, this isn't that at all. This is "think about the limits of your data types as you pick them". A computer being left on for a year should be an expected use.

titzer · on Sept 4, 2018

"Well, I figured....that bridge is gonna fall down in 200 years no matter what, so I went ahead and used rubber bands to hold the suspension lines in place. Only cost me $2!"

No, just no. Don't be a hilljack. Build something to spec. 32 bits is clearly not to spec.

ajross · on Sept 4, 2018

Both you and the other replier seem to have missed the point. I'm not arguing for the use of a 32 bit time value. I'm pointing out that the drive-by "why on earth" framing of the bug is unhelpful and probably harmful. Sure, maybe you'd never make this "simple"[1] mistake. But there are a thousand others you would.

You fix that with careful design and testing, not with "why on earth".

[1] As pointed out, it's not simple. Again, the whole stack need to agree with you and your concept of "to spec", not just the code you're personally typing.

0xffff2 · on Sept 4, 2018

How do you know? What's the spec? At first glance, it seems totally reasonable that the vehicle would be completely rebooted at least once every n << 248 days.

titzer · on Sept 4, 2018

> What's the spec?

I dunno but clearly running more than 248 days was not in it, otherwise this would hopefully have been caught and tested. If it was specifically specced to last N days and N < 248, personally I would have asked for my money back. As this appears to come as a surprise to everyone (caught late), I'm calling this one as I see it--a facepalm.

DocTomoe · on Sept 4, 2018

Of course, in security-relevant areas, prooven legacy code always trumps the new code. Noone wants to lose an airliner because Phil right out of a CS course positively needed a ruby interpreter in the main flight computer...

To understand this mindset, have a look at http://www.stroustrup.com/JSF-AV-rules.pdf

sbradford26 · on Sept 4, 2018

My favorite recommendation to people. Take a C or C++ program they have written and compile it with full MISRA. Definitely makes things more interesting.

adrianN · on Sept 4, 2018

The hardest thing to change after the program is already finished is getting rid of allocations. But other than that MISRA compliance is not that difficult and avoiding allocations is not that difficult either if you plan for it from the beginning.

sbradford26 · on Sept 4, 2018

If you know what you are doing it is not difficult, but if you are doing it for the first time it a good tool to learn how to think about all the repercussions of how you program something. MISRA isn't really to catch problems with c it is to catch problems with people using c. So knowing what those problems are and how to avoid them is incredibly useful.

danbruc · on Sept 4, 2018

200 years sounded low to me and so I did the math and it is indeed pretty spot on. 63 bits, i.e. signed integers, will last 292.27 years and you obviously get twice that if you are not interested in negative numbers.

ainar-g · on Sept 4, 2018

Right, thanks. My estimate was closer to 200 because I was thinking of Go's time.Time.UnixNano method[1], which returns undefined results when used before the year 1678 or after the year 2262. But yeah, if you count since the system restart, it's bigger.

[1] https://tip.golang.org/pkg/time/#Time.UnixNano

danbruc · on Sept 4, 2018

My gut feeling was more like it should be at least millions of years but my gut was probably thinking in milliseconds or so.

deivid · on Sept 4, 2018

That's just +-292 since 1/1/1970

Symmetry · on Sept 4, 2018

Well, the sort of engineer working on this might very well have gotten their start in environments where they were restricted to 5k of RAM. In that case it's easy to paint yourself into a corner by choosing integer representations that are larger than you need.

nofunsir · on Sept 4, 2018

64-bit-o-phobic... GreenHills INT178, until recently.

andyjohnson0 · on Sept 4, 2018

The space shuttle had a related problem, although not caused by overflow. Some parts of the STS's avionics used a clock that would reset to zero at 00:00:00 on 1st January, while other components had clocks that would continue to count up. If a shuttle mission spanned the new year boundary then systems would panic if they could no longer agree on the time.

The only reference I could find to this is https://abcnews.go.com/Technology/story?id=2699091&page=1

sbradford26 · on Sept 4, 2018

"One interesting fact is that the FAA claim that it will take about one hour to reboot the GCUs - so there clearly isn't a reset button."

I am incredibly surprised by this, most higher level flight control systems have power up requirements in the seconds. Then lower level actuator controls or engine controls have power up requirements in the milliseconds.

andyjohnson0 · on Sept 4, 2018

Anecdote:

A few years back I was travelling on a Virgin Pendolino train (an Alstom Class 390, probably) in the UK that was having a few problems. Speed changes were causing a lot of juddering, and as I remember the interior lighting and air conditioning was being unreliable. After limping on for a while, the train stopped at a small non-scheduled station and the crew announced that we'd be stopped there for a while, and not to be alarmed if the lights went off (it was night). This duly happened, the ventilation flans stopped, the interior doors all moved to the open position, seat booking indicators turned off, etc. After a pause that was probably a minute or two, everything came back to life and we resumed the journey.

Basically, they rebooted the train. It was interesting to realise that such a large, complex, fast-moving machine could be reset like that, with several hundred people embedded inside it.

tialaramex · on Sept 4, 2018

On trains with the original version of the UK's TPWS protection system you have to reboot them if somebody mistakenly leaves them on top of the TPWS transmitter when shutting down, for example stopping slightly short of the usual position in a terminus station. In this state when the train boots up TPWS is in an error condition, so you have to override that, move it a few metres, then reboot it to get rid of the error. Newer models recover without being rebooted.

Failing to reboot caused at least one accident because the TPWS was just left overridden and so wasn't protecting anything.

Since a series of UK rail disasters involving trains whose safety systems were switched off, running in passenger service without the safeties is prohibited.

odorousrex · on Sept 4, 2018

The smallest unit of time the FAA uses in their requirement documents is "one man-hour".

It could take 5 minutes or less to boot, but the FAA will still say it takes an hour.

cm2187 · on Sept 4, 2018

What do they do if there is a temporary power loss in mid air?

sbradford26 · on Sept 4, 2018

So that depends on if you mean a power source temporarily goes down or if a box experiences a power blip. So in the first case each box usually has 3 different sources of power, something like the generators on the engines, then backup on the 24 volt bus, then finally a backup battery.

In the second scenario it depends on how long the blip is. Usually there are holdup requirements that a box will not reset if power is lost for x amount of time. If power is lost for longer it will save state and leave itself in a state that will come back up quickly.

What I am thinking is happening in this case is that some value is getting messed up in NVM and it must be reset by the maintenance crew, so the "reset" they are talking about isn't just rebooting after the error occurs. But if you reboot before the error happens the NVM doesn't get messed up and the value is just updated with the correct number.

cm2187 · on Sept 4, 2018

So you think the one hour quoted by the article isn't the time it takes for the system to boot but rather the time it takes for the maintenance to access the device, reset it, and close everything?

sbradford26 · on Sept 4, 2018

That seems much more reasonable to me. Mostly because planes are rebooted for maintenance every so often between flights and customers(Airlines) would be not be okay with a 1 hour reboot time.

There are also usually maintenance tests that can be ran to reset a box on the plane. So the technician would have to put the plane in maintenance mode and run through the test to get the box reset or something.

paulie_a · on Sept 4, 2018

Yes they might not be okay with that but airlines apparently can hold the passengers for 8 hours because of delays. To the point of toilets overflowing. I think if you run an airline, you are legally required to go out of your way to to provide terrible service.

bronco21016 · on Sept 4, 2018

Hard to say what the actual procedure is. Might need a cold dark airplane for an hour for the GCU to clear memory or you might have to open panels and go back there and physically disconnect power to GCUs to get it done. It seems to me opening panels would probably take more than an hour.

bdonlan · on Sept 4, 2018

Realigning the IMUs would take something like 20 minutes alone (though it depends on the plane's current latitude).

bronco21016 · on Sept 6, 2018

I’m currently working on an aircraft that has IMUs rather than AHRS/RNAV/GPS. I cannot recall ever seeing them take more than 5 minutes on first start up in the morning. Manuals specify 2.5-10 minutes. This is on a 1980s model. I’m sure newer designs are quicker.

pjc50 · on Sept 4, 2018

Sounds plausible - it takes 1h to follow the "reboot checklist", of which only a few seconds is actually the reboot.

ethbro · on Sept 4, 2018

https://en.m.wikipedia.org/wiki/Ram_air_turbine

jaclaz · on Sept 4, 2018

Provided that the guess is correct:

>A simple guess suggests the the problem is a signed 32-bit overflow as 2^31 is the number of seconds in 248 days multiplied by 100, i.e. a counter in hundredths of a second.

The question that comes to my (perverted) mind is what is the counter for, or more strictly why it needs an accuracy of 1/100th second?

If it is related to a "periodical" action (a time interval) it makes little sense to have that degree of precision, and on the other hand, if the precision is needed, why not calculate it from a base point?

I mean, if it is related to "boot" time, I presume that noone would ever use a counter, rather a log of some kind with a timestamp and calculate (properly) the time elapsed ...

clarkmoody · on Sept 4, 2018

If the clock is being used for real-time dynamic model propagation, the 100 Hz speed is enough the capture the physical phenomena in the system, such as control system response. 10 Hz is a bit too slow, and 1 kHz is overkill.

During my graduate work in aerospace engineering, 100 Hz was pretty much standard for sensor filtering and simulation work.

gargravarr · on Sept 4, 2018

The systems are controlling generators, which produce AC current. It's not unreasonable to think the controllers are monitoring the AC waveform, which has a frequency of 50-60Hz. Ergo, if the controller is checking the generator is producing the expected waveform (because a lot of sensitive things depend on the waveform being accurate), it makes sense that it could be using a 100Hz polling interval, which is where the counter comes into play.

It's possible that it's using a system-wide timer for convenience since the embedded hardware is very limited compared to a full computer, where spinning off a separate timer is trivial. When the requirement for different timers hit the program designer's desk, they probably took the most precise use case and designed one timer around it, and neglected to take into account the overflow.

jaclaz · on Sept 4, 2018

Yep, but purely anecdotal data, I once was involved in the construction of a (small) hydroelectric plant (I was responsible for the construction/building, not for the electrical parts, our construction company had a couple electric partners for that).

For some reasons the tender was for the building and plant, but the client made a separate tender for the actual turbines.

Though I tried (vainly) to convince the manufacturer of the turbines to use a controller from the same firm we had as partner, they decided to use "their" way.

In theory not a problem as the turbines and their controllers would "talk" to the external control system (SCADA, etc.) through a "standard" RS422 connection.

After a couple days of testing, it was clear that the SCADA was going periodically "beserk".

Though not at all my field/responsability, I was willing to have the problem solved and after 3 (three) days of the engineers from the turbine manufacturer and from the control system manufacturer finding nothing (and BTW largely failing to communicate between them) I started looking with them, one by one, at the signals that were exchanged on the connection.

It was evident that there was some form of overflow, the on-turbine controller sent way too many data and "clogged" the receiving part.

There was a sensor for rotation speed and another one for oil pressure that were polled (and sent data) at a rate of 1000 per second.

The turbines were of the Pelton type, with an external (large) flywheel, and it took (literally) tens of seconds since you stopped the waterflow to have the actual wheel slow down and after several more tens of seconds stop, and viceversa once it got to the intended speed, it had in practice no variation.

So you had these two sensors polled 1000 times a second to measure something that would change - maybe - only after several seconds intervals and that could be as well corrected with a delay of tens of seconds.

Reducing the polling rate from 1000 to 10 times a second (still way overkill for the proper functioning of the system) all errors went away.

It came out that these sensors were a new type/make/model, the first to be capable of Khz polling, and the turbine engineers set them to the max only because they could.

MrBuddyCasino · on Sept 4, 2018

Aircrafts operate at 400Hz to reduce the size and weight of transformers, motors and power supplies. Its less efficient than 60Hz, but planes aren't that long anyway.

howard941 · on Sept 4, 2018

Anectdotal like my brother, an interactive inverter product I developed sampled the 60hz voltage and current at 19.2khz to more accurately determine RMS voltage and current in the face of some really nasty distortion. It was probably overkill but another guy on the team who worked on revenue-grade watthour meters thought it was just fine.

adrianN · on Sept 4, 2018

I'd assume the software is not event driven but cyclical. In each cycle you gather inputs, compute, produce outputs. You need a cycle counter because event validity and response time limits are likely defined in terms of cycles (e.g. peripheral sensor A has to report at least once every three cycles or sound alarm). Bad things happen when the cycle counter overflows.

masklinn · on Sept 4, 2018

Linux's "jiffy" defaulted to 100Hz in 2.4 and older, and to 250Hz since 2.6.13.

_ofdw · on Sept 4, 2018

At the very least, the HUMS and CV/FDR systems need sub-second precision for maintenance and crash investigation reasons.

As for why they don't calculate it in a particular way, probably just an amateur mistake.

Lots of aerospace-related applications are moving towards the methodology where you cook up a model in matlab and then hit the AutoCode button. Things like this are the reason why.

ocfnash · on Sept 4, 2018

The article finishes with the rather vague and somewhat terrifying assertion that:

"It is estimated that the Airbus A380, comparable in complexity to the Dreamliner, has more than 100 million lines of code."

pocketarc · on Sept 4, 2018

The thing is, do they just count all the lines of code in all the libraries they use, as well as the OS (all of it compiled from source, I would guess)? I imagine they do, since that's the only reasonable way that 100 million lines of code could be reached.

sbradford26 · on Sept 4, 2018

So on a project like this, you do not use many standard libraries. It is usually explicitly stated that you can't use anything from the std c library and such.

Also source lines of code are usually calculated as lines of code that need to be verified. So if a line of code is in the software, it needs to have a test that covers that line of code.

pocketarc · on Sept 4, 2018

Then that really does raise some questions as to what 100 million lines are for. What operations exactly does an airplane need to execute that require that much code?

sbradford26 · on Sept 4, 2018

So this lines of code estimate are spread across all the systems on the aircraft. One reason for all of the lines is that everything is done in C.

Then really the more exact reason is that every command on an aircraft is monitored. You have complex control laws to control a surface and then probably 4 or 5 different monitors that make sure the surface isn't oscillating or the actuators are not fighting each other and all kinds of stuff. It is because the mentality that not only do you design something to not go wrong, you monitor for failures when it does and handles them accordingly.

Finally you have maintenance where an technician can run a test to determine any issue on the aircraft. So there has to be tests and monitors built into the software to determine root cause of an issue down to a single box/actuator/sensor/or wiring problem.

NikolaeVarius · on Sept 4, 2018

Everything?

Modern avionics is insanely complex. This isn't something you throw 1-year self taught JS devs on.

When I worked on next-gen turbofans, we had multiple dozen engineers working on managing the requirements of the software, much less the software itself.

You have the main avionics software managing the flight itself, the engine software managing fuel consumption, and the 3x safety factor required to be certified.

TeMPOraL · on Sept 4, 2018

"Everything" isn't telling much. I'd love to, out of curiosity, browse through a codebase of an avionics system. Is anything like this publicly available?

sbradford26 · on Sept 4, 2018

Most likely there is not much that you can look at but this overview of the space shuttle can give you a lot of insight into how software and hardware are designed for flight critical applications. I believe this was posted a while ago here.

https://spaceflight.nasa.gov/shuttle/reference/shutref/orbit...

One take away is that software and hardware are nearly impossible to separate in a flight critical application.

Also one note, the space shuttle was incredibly complex for its day. So I would say the complexity of the space shuttle would give you a decent idea of what commercial aviation does now.

Dylan16807 · on Sept 4, 2018

> Also one note, the space shuttle was incredibly complex for its day. So I would say the complexity of the space shuttle would give you a decent idea of what commercial aviation does now.

And yet it has less than a million lines of code.

TeMPOraL · on Sept 4, 2018

I definitely missed this if it was posted here. Thanks!

I was kind of hoping that some legacy avionics system somewhere would have sources available on-line, but I guess companies writing this code don't open-source projects just because they get old.

bronco21016 · on Sept 4, 2018

In addition to every component that has some kind of data bus you also have the fly by wire system which includes all of Airbus’ ‘laws’ for the different flight envelopes. You also have modern navigation gear and the displays and interfaces that come with that. Just look at the cockpit and realize everything in there is software and it becomes clearer.

lnsru · on Sept 4, 2018

Think about it as a 1000 independent sometimes redundant embedded systems. 10000 lines of code you can write alone in 2 years for middle complexity project. It’s not single data bank application.

chooseaname · on Sept 4, 2018

gruez · on Sept 4, 2018

>It is usually explicitly stated that you can't use anything from the std c library and such

Why?

ams6110 · on Sept 4, 2018

I don't work in this field, but my understanding is that code for critical systems like avionics isn't permitted to (among other things) allocate/free memory dynamically. That probably nixes a lot of functions in the standard library.

yason · on Sept 5, 2018

It would bring in whatever crap the standard libraries from the compiler or the system happen to have.

And those libraries aren't written with the mindset of "when this code runs on an A380".

Even most non-certified projects will grow their own standard library once they get past a certain size, and this is merely for convenience, consistency, and features.