I'm still amused at how the west coast is starting to pay attention to "those guys in the rust belt". ;-) I'm reminded of when I interviewed at TI and they were just coming across some foreign concepts. Some people have a practice of refreshing peripheral registers (like port direction and such) on a regular basis. What happens if the bit flips, and now that button on your door doesn't work? But then for some reason you disconnect the battery and everything is fine? Why would the bit flip? Transients, latch-up? And you're going to be connected to a battery for 6-10 years right? This stuff has all been seen, and it's not always just a reboot to fix it.
At the company I work for, we've been struggling with the fact that nobody has made a new-design CPU or SOC that's suitable for our safety-critical, "high integrity at the source" application with modest radiation exposure (avionics) for over ten years. We finally just gave up waiting and designed our own. We're told by our senior folks that they've received calls from their counterparts at several west coast companies (apple, qualcomm, etc.) expressing interest & best wishes for what we are doing.
10 years isn't true. In the last few years, we have finally seen a successor to the RAD750, the RAD5500. And it's multi core finally. But rad hard technology does move at a glacial pace and is 10-20 years behind commercial.
Our problem doesn't have to do with radiation hardening. In fact, we don't even need radiation hardened cores -- our problem is actually due to design decisions that have been made in modern cores that sacrifice determinism for speed.
Maybe it is true for him. This isn't my area, but the Wikipedia article for the RAD5500 says it's for "high radiation environments experienced on board satellites and spacecraft," but the grandparent only a need for tolerance to "modest radiation exposure (avionics)." Maybe the RAD5500 is overkill for his application, and there actually hasn't been anything new (in his niche) for 10 years.
Well, a semiconductor company can design a general purpose processor and sell millions or billions of units.
Or, it can design a safety critical and rad-hard chip and sell you maybe a couple thousand? :-)
At least, there is now the option to get defense-grade FPGAs and throw some radiation hardening techniques (triple-modular-redundancy, scrubbing, etc) on it.
I don't understand why you would radiation harden a chip...?
Why not just have three exactly identical chips, run the exact same code on them, and compare results? If results ever differ, you power cycle the bad one, self test it, and you're good to go again.
Sure you have to not use on-chip random number generators, and make sure you don't have any non-deterministic things in your silicon (eg. things that depend on PLL locking time), but that sounds pretty trivial.
Because there are two distinct types of radiation effects in space: ionizing and upsets. Multiple computers voting solves the upset problem, but not the ionizing one. Ionizing is a measure in krad of how much radiation a part can be exposed to in its life and keep working. Ionizing radiation changes semiconductors in fun ways (e.g., gate voltage thresholds shift). They are fabbed in exotic ways like silicon on sapphire. When you want a satellite to last 20 years at geostationary orbit, you account for this! For example, I believe the Juno probe to Jupiter's life will be limited by ionizing radiation.
Three systems might not fit into your design or be too expensive for what the project is. How do you combine the votes? The chip that is doing that work needs to be absolutely immune from any bit-flips otherwise the voting is worthless...
There are three main things to protect against with radiation hardening: lifetime dosage, latch-ups, and upsets.
Chips can only go for so long being blasted with heavy particles until they cease to function properly. Smaller, newer technology is more affected (thicker traces and larger gates might last longer) by this so designers will tend to use older chip designs when making new radiation hardened stuff.
This design choice also leads to less chance of latch-ups. Large particles can short out traces and cause the chip to halt and draw a lot of current. Larger tech makes this less likely to happen.
Upsets are a lot different. You have to mitigate these a few different ways, an easy one being thick layers of insulating substrate to isolate gates better. This can also help with latch-ups too. Next are things like memory voting and error correction, methods that can be done in software. One layer up from that are redundant systems, very robust in terms of bit-flips but adds more complexity and cost to the system.
Redundant elements at the system or circuit level are usually part of radiation hardening.
>Why not just
The problem with 'just' is that it's usually not true. Hardening versus redundancy is typical safety engineering problem where details and balance between different requirements, verification and maintenance are important.
When I was doing this for military jets the concern was staying in the air during a nuclear explosion and the subsequent EM field that comes out of that. You could have 30 processors and they would all die at the same time, but the one with radiation hardening would still be chugging along.
> Why not just have three exactly identical chips, run the exact same code on them, and compare results? If results ever differ, you power cycle the bad one, self test it, and you're good to go again.
Easier said than done :) But yes, that is a common solution.
For automotive they're starting to make interesting things like dual-core CPUs running in lock-step. The outputs are compared at each cycle and a fault is declared if they differ. The problem is that doesn't generally allow you to keep running (perhaps a reset if that helps). They still like to rely on fallbacks like manual steering and braking, and the fact that you can pull over to the side of the road when something quits working. None of that is relevant for fully autonomous self-driving cars where the driver may be reading a book and not ready to take over quickly.
What do you mean starting? Bosch has been using Infineon TriCore's in ECU's for almost 10 years now.
Friendly advice - stay away from Infineon offerings, they are still alive in automotive due to inertia and top down decision making, not because they are any good.
To your comment on Infineon... I thought the TriCore was stupid, but my German boss at the time insisted. We're using one of their SoCs today, but it's actually very very good for what we're doing and it's fairly cheap. We could get cheaper but that would require the addition of other parts on the board. If we can secure the volumes we may go custom - and that's something I haven't seen in person in automotive (I'm beginning to hear about it though).
Your chip is better than Gaisler's Leon4-FT? Or what reasons were it not good enough? European Space Agency funded Gaisler's stuff for their use. Not a hardware engineer but still collect info for others.
Note: If speed isn't big deal, there's the formally-verified AAMP7G from Rockwell and the old 1802 ftom Intersil.
An FPGA running a soft core was out of the question? There are plenty of rad-hard FPGAs out there. Combine it with some rad-hard flash or eeprom and you've got what's flying in a lot of satellites these days.
Consumer hardware folks often think about performance envelopes and product lifetime.
"We can run this $gadget at $somany GHz and most of them will fail at 4 years, or we can get to 5 years if we reduce the number of gigglehertzes." (Imagine software people gnashing teeth here, but they always do, so ignore them).
Then it might turn out that the models were conservative, so the hardware folks could say "Every machine gets an additional 50 gigglehertzes" (note that nobody tells the consumer about that lifetime projection, though).
It's not just silicon, of course. It's the nature of hardware to die: Electrolytic caps go dry, or coils buzz themselves open, or electromigration murders a metal run on a chip, or fans seize, or thermal cycling cracks some solder. Memento mori, don't expect your grandchildren to play on your current generation game console.
They're really solid state, ie. they're transistors not vacuum tubes. But "digital" components are not always digital ("digital" is a leaky abstraction).
The metal atoms actually move over time simply as a result of electrical current flow! Parts will fail because of this, depending on how they're designed.
https://en.m.wikipedia.org/wiki/Electromigration
It's not just structures within silicon you need to worry about. Lead-free solder is susceptible to growing tin whiskers [0], which can cause short circuits. Often you can't even see the growth of these whiskers as they're under a BGA or on a microscopic level.
IIRC, NASA still requires all space hardware to be manufactured with leaded solder to avoid having a multi-million/billion dollar mission ruined by some tin whiskers.
The same is required for many projects that sit in exceedingly cold environments. In these situations, leaded solder is required to ensure that the solder joints don't decay (or crumble away completely) from tin pest [0] in the low temperatures.