Ask HN: Why don't transistors in microchips fail?

joelaaronseely · on July 9, 2015

There is another mechanism called "Single Event Upset" (SEU) or "Single Event Effects" (SEE) (basically synonymous). This is due to cosmic rays. On the surface of the earth, the effect is mostly abated by the atmosphere - except for neutrons. As you go higher in the atmosphere (say on a mountaintop, or an airplane, or go into space) it becomes worse because of other charged particles that are no longer attenuated by the atmosphere.

The typical issue at sea level is from neutrons hitting silicon atoms. If a neutron hits the neucleus in some area of the microprocessor circuitry, it suddenly recoils, basically causing an ionizing trail of several microns in length. Given transistors are now measured in 10s of nanometers, the ionizing path can cross many nodes in the circuit and create some sort of state change. Best case it happens in a single bit of a memory that has error correction and you never notice it. Worst case it causes latchup (power to ground short) in your processor and your CPU overheats and fries. Generally you would just notice it as a sudden error that causes the system to lock up, you'd reboot and it would come back up and be fine, leaving you with a vague thought of, "That was weird".

lilyball · on July 9, 2015

How often does this sort of thing actually happen in real life? Or rather, what's the chance that some given computer will experience one of these events in its operational lifetime (or, if the chance is actually high enough, how many such events would it be expected to see on average given a lifespan of several years)?

sliverstorm · on July 9, 2015

Somewhere in the range that your laptop will almost certainly never see even a single event, but a very large datacenter or colo will have multiple events a month.

There is a lot of disagreement on bitflips from ionizing radiation. They are unequivocally real, and unequivocally very rare. Even when they do happen, a large portion of the chip is dark a lot of the time, and a lot of the live data in the chip is simply thrown away and never used. (Think prefetching) Some bits, if flipped, will break something but will not corrupt the disk and the machine will be able to recover.

Nobody really knows for certain exactly how big of a problem they are and how often they happen- it's all statistics, and it depends on things like where on the globe your computer is, what your building is made of, and what phase of the solar cycle we are in. It even depends on workload. Anybody who claims to know for certain...

morgosmaci · on July 10, 2015

Try multiple times a second. This guy made a hobby of taking advantage of cosmic radiation bit flips to cause dns lookup problems and capturing data.

http://dinaburg.org/bitsquatting.html

sliverstorm · on July 10, 2015

Multiple times a second, if your pool of hardware is "all the internet connected hardware in the world"! Neat experiment.

Also, FWIW that experiment will include people subject to bit errors in DRAM, not just in the CPU- and I would even guess that bit errors are more common in DRAM than SRAM given their electrical characteristics (a tiny floating capacitor vs. two inverters driving eachother)

ch4ch4 · on July 10, 2015

Bit flips aren't solely caused by radiation- it could also be caused by clock skew or a failing crystal oscillator on an overheating router or something...

mng2 · on July 10, 2015

When I talked to somebody at Blue Waters (petascale supercomputer at UIUC), she told me that they had uncorrectable errors once or twice per day, even with ECC. Blue Waters has 22640 compute nodes that contain 16 cores each (we'll ignore the GPU nodes). So even if your typical home computer had 16 cores, you would expect 12 hours * 22640 = 31 years between uncorrectable errors.

Caveats: most computers don't have ECC, and I don't remember if Blue Waters was completely installed when I visited.

tsotha · on July 10, 2015

Heh. When I went to college one of my professors told us cosmic rays made computer systems with more than four megabytes completely impractical.

Oh, and get off my lawn.

burnte · on July 10, 2015

> Oh, and get off my lawn.

Indeed, that had to be, what, in the early to late 70s? I remember the Cray X-MP in the early 80s supported up to 16MB and frequently came with 4 to start.

fezz · on July 9, 2015

On CCD/CMOS sensors it can happen enough to be visible as dead columns or pixels. Pixels can be calibrated out.

Most notably, gamma rays killing columns was the suspected cause of dead columns while filming Superman a few years back. As a result, cameras were shipped by ship rather than air to minimize this happening ( a bit of a reactionary tale to this particular incident and not generally what happens).

fezz · on July 10, 2015

Forgot to mention... in talking with NASA about deploying cameras on the space station, the issue being able to fix pixel death by gamma ray is more relevant. Water is apparently one defense against it but not so practical. How much water would you need?

TheLoneWolfling · on July 10, 2015

Water has a halving thickness of ~18cm for gamma rays. So a fair bit.

Unfortunately, most other things are also of the same order of magnitude of ~20g/cm^2 - with gamma rays the single most important thing is just "how much mass is in the way". Which is exactly what you don't want.

wsxcde · on July 10, 2015

This is a good paper on this topic: http://dl.acm.org/citation.cfm?id=1555372.

Another slightly older paper with similar data: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.115...

rgovind · on July 11, 2015

It happens often enough that NASA has asked FPGA synthesis tool vendors to implement error correction feature on this. Basically, when state machine goes into some error condition/state, the system is reset to a known state.

lilyball · on July 11, 2015

NASA also needs hardware to operate under extreme conditions. For example, anything that goes into orbit is going to have a significantly increased likelihood of a gamma ray burst (because it's no longer protected by the atmosphere). I'd also imagine that they have a much lower tolerance for faults than your average consumer machine as well (because they're doing things that are much more critically important).

mafuyu · on July 9, 2015

The figure I've seen quoted in the context of embedded systems is 1 bit flip a month.

Florin_Andrei · on July 9, 2015

> This is due to cosmic rays. On the surface of the earth, the effect is mostly abated by the atmosphere - except for neutrons.

So, if we had more hydrogen (either free or compounds) in the air this would not be the case, right?

The column of air on top of your head is equivalent (in terms of mass) to a column of water 10 meters tall, with the same base section area. But the composition is quite different, of course - the only major component they have in common is oxygen.

gh02t · on July 9, 2015

Cosmic rays have energies in the tens to hundreds MeV range where pair production is the dominant attenuation mechanism. The probability of a photon inducing pair production in a material is roughly proportional to the square of the proton number, ergo cosmic rays don't give a shit about hydrogen (which has the smallest proton number possible). Even if the atmosphere was pure hydrogen gas at STP, the average distance traveled by a 20 MeV cosmic ray would be around 17km. http://physics.nist.gov/PhysRefData/XrayMassCoef/ElemTab/z01...

Hydrogen is pretty good at moderating neutrons down to thermal energies (eV range, ie room temperature) via elastic scattering, but gasses don't really have enough density to do a very good job. If you really want to protect something from neutrons you just coat it with boron. A mm coating of the stuff will keep out pretty much any common source of neutrons.

Florin_Andrei · on July 9, 2015

Right, I was thinking about the neutrons. Maybe something like methane would be more efficient than pure hydrogen. Anyway, it's just a thought.

It's very surprising to see how efficient boron is. I thought neutron shields (paraffin, water) are supposed to be very thick. Maybe boron does the job via a different mechanism?

gh02t · on July 9, 2015

Boron is almost exclusively an absorber, so the boron nucleus captures the neutron and it basically disappears. Hydrogen is primarily a moderator, so it reduces the energy via elastic scattering, but a significant number of very low energy neutrons still escape (some absorption to produce deuterium also occurs). Thermal neutrons can still cause damage, but they're easy to block with a subsequent thin layer of lead or something like that.

Boron's probability to capture a neutron is astronomically high, that's why you can get away with so little. Environmental sources of neutrons are actually pretty rare normally and most neutrons you do see will be pretty low energy and won't have a huge amount of penetrative power. A thin layer of boron will pretty much stop them. Pyrex (like the stuff baking dishes are made of, which is borosilicate glass - glass with boron added) is actually commonly used as control material in nuclear reactors.

harshreality · on July 9, 2015

US Patent 7309866 - Cosmic ray detectors for integrated circuit chips - Intel (applied for 2004-06, issued 2007-12)

http://www.google.com/patents/US7309866

Is Intel using any such things in their commodity chips?

elwell · on July 10, 2015

Next time a surprising bug pops up on my server I'll just blame it on cosmic rays and reboot.

rhizome · on July 10, 2015

"Sunspots" is a very old sysadmin joke.

rbritton · on July 10, 2015

I actually have had one user take me seriously with that one. It's been a while, but I believe I also attributed her problem to the phase of the moon as well.

barrystaes · on July 10, 2015

I tend to deduce and eventually blame either gravity or gremlins. I was never wrong.

hyperpallium · on July 10, 2015

As process nodes shrink and this happens more often, perhaps we'll eventually have to move to a more probability-based model of computation, giving up classical predictability.

wsxcde · on July 10, 2015

Well, this is what researchers were predicting 12-14 years ago. There was a lot of work in fault tolerant and probabilistic computer architectures as a result of this. (I believed the prediction and contributed to some of this work.)

The prediction was that chips below 32nm wouldn't work reliably and the only option would be to use these exotic architectures.

Well, here we are at 14nm and everything seems to be going okay.

gibrown · on July 9, 2015

As a former hardware engineer who worked on automated test equipment that tested ASICs (and did ASIC dev), there are a lot of different methods used to avoid this.

As others mentioned, most of these problems are caught when testing the chips. Most of the transistors on a chip are actually used for caching or RAM, and in those cases the chips have built in methods for disabling the portions of memory that are non-functional. I don't recall any instances of CPUs/firmware doing this dynamically, but I wouldn't be surprised if there are. A lot of chips have some self diagnostics.

Most ASICs also have extra transistors sprinkled around so they can bypass and fix errors in the manufacturing process. Making chips is like printing money where some percentage of your money is defective. It pays to try and fix them after printing.

Also, as someone who has ordered lots of parts there are many cases where you put a part into production and then find an abnormally high failure rate. I once did a few months of high temperature and vibration testing on our boards to try and discover these sorts of issues, and then you spend a bunch of time convincing the manufacturer that their parts are not meeting spec.

Fun times... thanks for the trip down memory lane.

mcshicks · on July 9, 2015

In my previous job I did failure analysis for returned ASICs for a high volume commercial application. As the parent says in fact because a modern ASIC has structural coverage with built in self test for the overwhelming majority of digital circuits, you don't ship a lot of parts with bad transistors. The vast majority of ASICs that I saw that were returned had failed due to electrical over stress, which is polite way of saying the customer blew up the part.

But even more to the point, in modern electronics the ratio of problems caused by either software or assembly level issues is very high compared to ASIC level hw problems. High volume ASICs are designed with relatively large margins to ensure good product yield. Further the cost of investigating and root causing an issue to the transistor level is very high, that unless you have measurable trend defect rate, any actual random failure not related to design would probably be attributed to something else first.

That is not to say that there are no HW design level issues with ASICs, but generally when they are discovered you would try and change the low level software to either make the problem not happen ever, or make it very very infrequent. You might also simply screen (test) the parts so you don't ship any that exhibit the undesired behavior.

So it's not that such an event that the OP mentions can never happen, it's just that it's so infrequent compared to all the other types of problems that can happen in modern electronics that unless you have a very specific reason to believe it's being caused by hardware you would never be able to differentiate it from a random software bug.

Taniwha · on July 10, 2015

well not quite - people certainly add spare gates - but not for fixing individual errors - instead you add maybe 1% extra gates - if you find a bug in your design you can redo the upper metal layers using the extra gates to fix it - that changes ALL the chips you make, not just fix one bad transistor in aparticular chip

gibrown · on July 10, 2015

You're right, that was inelegantly written and kinda conflates two different things. Gates are spread around for fixing design errors as you describe. There is also often redundant logic and memory built in to allow fixing individual chips though.

Here's a quick presentation I found on laser repairs: http://www.ee.ncu.edu.tw/~jfli/memtest/lecture/ch07.pdf

kabdib · on July 9, 2015

Oh, they do fail.

The last time I worked with some hardware folks speccing a system-on-a-chip, they were modeling device lifetime versus clock speed.

"Hey software guys, if we reduce the clock rate by ten percent we get another three years out of the chip." Or somesuch, due to electromigration and other things, largely made worse by heat.

Since it was a gaming console, we wound up at some kind of compromise that involved guessing what the Competition would also be doing with their clock rate.

zokier · on July 9, 2015

That is very interesting that they do that for cheap massproduced consumer goods. I mean I can understand doing such tradeoffs in very expensive stuff that is expected to last for decades (industrial machines, space probes etc), but that the manufacturer cares enough about the lifetime (beyond the minimum warranty period) of their goods is somewhat surprising in this day and age.

abtinf · on July 9, 2015

If you want to minimize warranty expenses in order to maintain anything resembling profit, then you need to engineer you product so that the average useful life is well beyond your warranty period. The math is brutal: An solid net profit margin for a typical manufacturer is around 5-7%. Even warranty rate of 5% would send you deep into losses. So the average durability of your product needs to be 2 standard deviations above your warranty period. Of course, not everyone takes advantage of warranties, so you might discount durability to account for that.

Also, with regard to the gp specific point about the discussion being in regard to a gaming console: they want the product to last as long as possible. Each additional function unit in existence counts toward their installed base and increases the attractiveness for third party developers.

a3n · on July 9, 2015

> If you want to minimize warranty expenses in order to maintain anything resembling profit, then you need to engineer you product so that the average useful life is well beyond your warranty period.

Or the flip side, if you want to go that way. Higher reliability allows you to offer a longer warranty, attracting either higher prices or more customers.

This is part of how Japan made themselves mainstream car sellers in the US. They calculated that the cars had to be more reliable, because in the early days without a lot of infrastructure, recalls would be very expensive. So they made the cars reliable. Allowing them to offer longer warranties.

I just bought a two year old Toyota, after driving a Honda that I bought new for 25 years.

jsprogrammer · on July 9, 2015

You are assuming a normal distribution of failures. A sufficiently evil company would design their failure curves to be as flat as possible until the warranty period expires and then rapidly increase to 100%.

The practice still makes no real long-term sense though. What do you do after `warranty_period` expires and no one wants to buy your products anymore?

mhb · on July 9, 2015

Presumably, ceteris paribus, companies do attempt to design their failure curves to be as flat as possible. Otherwise they are wasting money on components which will survive longer than the whole product. (There is no One-Hoss Shay (http://holyjoe.org/poetry/holmes1.htm))

brianwawok · on July 10, 2015

Well it depends.. even if an Xbox1 has a -10% profit margin and a 10% failure rate.. Microsoft can still make back their money on game licensing and xbox live fees.

For something like a car or a Microwave this is not the case, but this is why the profit margin on a Microwave is closer to 50% than 5%

abtinf · on July 10, 2015

Be careful not to confuse gross profit with net profit. Gross profit is how much money you make less the raw cost of materials. Net profit is your final profit after accounting for labor, marketing, support, overhead, taxes, and so forth.

For example, take a look at the financials [1] for Black & Decker, a manufacturer of tools. For the quarter ending 4/4/2015, they had gross revenue of 2.6 billion. Less cost of goods sold, they have a gross profit of almost 1 billion, or about 61% gross margin. But then observe all of the numerous expenses and taxes they have, which causes their final Net Income (aka net profit) to drop to 162 million, or a net margin of about 9%.

9% can be considered an outstanding rate of return in this industry. If you could come up with a way to build a manufacturing company with such a return, you could have your own IPO.

[1] https://www.google.com/finance?q=NYSE%3ASWK&fstype=ii&ei=-Am...

mhb · on July 9, 2015

Someone more cynical might conclude that they are interested in the MTBF because failures on the early end of the tail that ARE within the warranty period will cost them money.

kabdib · on July 10, 2015

Why be cynical? That's engineering. Real engineering is about materials, cost and time (if you leave one of those out, you're either doing research or just aimlessly puttering around ... maybe both :-) ). There are other dimensions, too. In the consumer market, engineering is also about the reliability that you want your customers to experience.

Companies that make console hardware want you to be happy; they're not going to ship you a hunk of hardware that generates a Warranty Expired Interrupt at one-year-and-one-day because they want you to keep buying games and services. Having to buy a new console is a hassle. Maybe you'll buy the competitor's console instead, who knows?

On the other hand, console margins (and yes, generally they have margins these days and are not sold at a loss) are razor thin. There are knife-fights in meetings over three cent changes to components because at production scale those pennies rapidly turn into millions of dollars. You don't make a console with a reliability of 20 years because it would cost way too much and be obsolete long before the failure curve started to inflect.

So you optimize the product lifetime for user expectation of value, how long you think the technology will remain relevant, what the market will bear, and a bunch of other things (chip sourcing, cost of manufacture, architectural headroom and so on). This involves hard-won experience, spreadsheets, testing, figuring out how vendors are lying to you, plane trips to godforsaken industrial parks, and fist-fights in hallways. It's awesome :-)

Are there companies selling stuff that will break the moment the warranty is over, or earlier? Sure; they are betting that actually getting warranty service is so inconvenient that you won't bother. On the other hand, having watched (from the outside!) the execution of a product-recall-scale warranty, I was impressed with the company's professionalism, and it didn't strike me as a company that didn't care about repeat customers. It sucks on both sides for something like this to happen, but there's very little cyncism involved.

mhb · on July 10, 2015

Yes, that's all true and the use of the word cynical here is more nuanced. GP seemed to be impressed that a company was producing a quality product as an end in itself - providing consumer surplus for not necessarily any benefit to the company. A more cynical view is that they are doing that because it will produce a better outcome for them as well. Yes, it's a beautiful system and everyone wins, but, generally, altruism would be viewed as a higher motive than self-interest.

I can provide a different and less cynical interpretation for your example too. Companies that sell stuff that only lasts until the warranty is over are providing a valuable service for customers who don't want to pay as much as they would have to for a higher quality product which would outlast its warranty.

Retric · on July 9, 2015

A console is often a loss leader, so the company may want it to last much longer. Ideally, you want them to last ~2 generations. So the used market keeps companies producing older generation games for as long as possible.

luchadorvader · on July 9, 2015

I wouldn't think it was that surprising. Businesses, especially consoles, want to be seen as reliable and good quality in order to develop a strong relationship with their customers. If it commonly fails right after the minimum warranty, then why would someone want to buy something again that is unreliable? They wouldn't and would go right to the competition.

sliverstorm · on July 9, 2015

The most interesting part is modeling the workload, and what you do with that. If you're selling a laptop, the odds the customer is running it 24/7 at maximum load are slim. In fact, it would be simply wasteful to engineer a laptop chip meant for always-on server average workloads- nobody benefits from that!

OK, so what is the average workload of a laptop? What is the real-world worst case workload of a laptop? Don't forget to consider that different software operates different pieces of the chip! Ok, now that you have all that information, start modeling the odds of a weak[1] chip ending up in the hands of a worst case user...

The rabbit hole is very deep.

[1] chips have a failure distribution, like hard drives

Someone · on July 9, 2015

Real-world worst case? The laptop I have from work that, when told to hibernate, often wakes up while in my backpack, and then waits at the 'enter full-disk encryption code' prompt (with the lid closed, in a backpack)

I don't think there's any power management or throttling due to heat active at that prompt.

Luckily, the thing isn't plugged in at such times. Now, the laptop battery is dead after an hour or so.

wtallis · on July 9, 2015

The mistake there really lies with the laptop manufacturer, not the CPU manufacturer. Stupid motherboard firmware and stupid operating system drivers can undo any chip designer's hard work, and that's why laptops are so failure prone (especially where GPUs are concerned).

ajross · on July 9, 2015

Yes, they can fail. Lots and lots of them fail immediately due to manufacturing defects. And over time, electromigration (where dopant atoms get kicked out of position by interaction with electron momentum) will slowly degrade their performance. And sometimes they fail due to specific events like an overheat or electrostatic discharge.

But the failure rate after initial burn-in is phenomenally low. They're solid state devices, after all, and the only moving parts are electrons.

exDM69 · on July 9, 2015

I work as a software engineer for a chip manufacturer. The fab (silicon manufacturing company) gives only a 5 year guarantee for the smartphone/tablet chips (with presumably some allowance).

As years go by, the chip starts slowly degrading and some of the high performance chips start to get higher temperatures, worse power consumption, needs higher voltages, etc. The power management software counters this by keeping the clocks lower and the voltages higher, causing performance degradation over time to avoid catastrophic failure.

When the same chips are used in products with higher reliability requirements, they are clocked down and more conservative power management software is utilized.

disclaimer: not my area of expertise, I work on something completely different than power management.

VieElm · on July 9, 2015

> The fab (silicon manufacturing company) gives only a 5 year guarantee for the smartphone/tablet chips (with presumably some allowance).

I'm not sure that bodes well for smart watches selling at 4+ figures.

exDM69 · on July 9, 2015

> I'm not sure that bodes well for smart watches selling at 4+ figures.

I'm pretty sure the smart watch makers don't expect them to last for too many years, definitely not decades. After all, they want to be selling you a smart-er watch in just a few years.

This consumerism drives the whole thing, if your new watch was to last decades it would be designed in a whole different manner. And it's not only the chips, you won't be able to get a compatible display, battery, PCB or case or anything to replace a broken/worn out one in just a few years.

This sad state of consumerism is why I do woodworking to balance my mind. The pinewood dovetail box I built last week will still be there when I'm dead.

nhaehnle · on July 9, 2015

To be fair, if you don't use the latest and greatest manufacturing processes - which you don't really need to do in smart watches - chips can be very robust and long-lasting. Plus, given the battery requirements, you don't really want to use high-performance components in watches anyway, Apple's ridiculous battery life notwithstanding.

As for the whole market segment of "this watch will pass through generations", I guess the honest thing to say is that we just don't have that kind of experience with integrated circuits yet... besides, does this type of traditional watch never need repairs? They must have failures as well.

leoc · on July 9, 2015

It's different for simple BT notification buzzers, but "maximalist" smartwatches like the Apple Watch surely call out for the latest and greatest semiconductor processes. They face harsh trade-offs between capability, size and battery life, harsh enough to help make them still marginal as mainstream consumer products, and those dilemmas would be significantly eased if performance-per-watt and size were improved. They're also high-margin products so manufacturing at fancy fabs should be affordable.

pjc50 · on July 9, 2015

It'll be horribly obsolete in five years. It would cease to be a status symbol, like trying to impress someone by being able to get Outlook on your Blackberry. The purchasers at 4+ figures generally know this.

msoad · on July 9, 2015

Now I understand why my iPhone gets slower after a year!

avn2109 · on July 9, 2015

Honestly I'm pretty sure your iPhone slows down after a year because they build the software that way on purpose so as to incentivize you to buy a new one.

macjohnmcc · on July 9, 2015

More likely they write software with the new hardware in mind but not wanting to fragment the market too much they offer the newer software to people with the older hardware and along with that comes some performance issues.

barrystaes · on July 10, 2015

Which is what he said.. they thought this over.

sneak · on July 9, 2015

I am also pretty sure that Apple is intentionally adding more and more features to iOS to make it Do More Stuff so that they can sell more iPhones.

dzhiurgis · on July 9, 2015

iPhone comes with serious memory handicap. CPUs are running just fine since iPhone 4. That's been Apple's sales pattern for decades.

That said, it would really interesting to see just how much the CPU actually degrades over time. I guess it's around few percent.

chetanahuja · on July 10, 2015

the phone slowdowns are more likely due to flash storage filling up which makes new writes excruciatingly slow.

vcarl · on July 9, 2015

Yeah, this is a real concern for computer enthusiasts who heavily overclock their CPUs. Higher clock rates need more voltage to be stable, but higher voltages increase electromigration. So maybe bumping the voltage from 1.3v to 1.5v stops your computer from crashing, but it could also cause your CPU to fry itself in 2 years instead of 10.

PaulKeeble · on July 9, 2015

I ran an I7 920 at 4Ghz for roughly 2 years at around 1.35V with water cooling (max was 55C on full load on all 4 cores).

After the 2 year mark the chip became unstable and over the period of the next 6 months the clock speed it would reliably maintain was 3.2Ghz. On that progression it would be down below its default rated speed of 2.6Ghz in presumably another 6 months or perhaps outright failed.

The current estimate is that Intel targets about 15 years for a CPUs life at the clock speeds they ship, overclocking can vastly decrease that.

provemewrong · on July 10, 2015

>They're solid state devices, after all, and the only moving parts are electrons.

SSDs are solid state (duh) as well, and yet they degrade over time.

jflatow · on July 9, 2015

As far as I understand, chips are designed with many redundant pathways for exactly this reason. I think its a rather important part of the design process, but someone who knows better should probably say.

bri3d · on July 9, 2015

They're designed for manufacturing defects - this process is called DFM (Design For Manufacturing) and usually revolves around performing a Critical Area Analysis for defects of a given size. Design software is used to determine what would happen if a speck of dust of a certain diameter landed on the die in given locations. Then, the critical areas are spaced out and moved around to attempt to balance design constraints with yield.

For large, expensive parts or parts in which a single common defect could easily blow the whole yield (for example DRAM, especially when embedded, CPUs with lots of cache or cores, and so on), regions (or rows and columns of memory) are generally fused off so that if one specific region fails qualification, it can be disabled without discarding the whole chip. This is the source of most 3-core CPUs, as well as the difference between most models in a single CPU family (they're often binned off based on how much of their L2 cache actually works).

However, once parts are manufactured and qualified, they're pretty much done. Some hardware has BISR (Built In Self Repair) but as far as I know it's not particularly common outside of DRAM.

revelation · on July 9, 2015

Built-in-self-repair is a key feature in the old spinning rust hard drives and particular the new flash memory.

ajross · on July 9, 2015

In neither case is the device "repairing" a failure like the ones posited here, though. Storage devices "repair" failures by detecting them and recovering the data using some form of ECC, and then rewriting it to a location that has not failed.

No one has yet figured out a way to have a shorted polysilicon feature un-short itself in situ. :)

valarauca1 · on July 9, 2015

Not exactly. Reduancy is designed into chip hardware not to account for cosmic rays striking the wrong trace, but for manufacturing defects.

For example Nvidia runs it's GTX 980 and GTX 970 production at the same time. The only difference is GTX970's can have up to 2 of their compute units non-functional.

This is very commonly done in the industry. If you remember Phenom Dual, Tri, and Quadcores. Which were the same chip, just it was expected that only 5% of produced chips would be fully featured quad cores, the rest would be sold as other core counts. This was done with 27xx series i5, which were 34xx core i7's but with hyper threading disabled due to issues with yields on dye shrinks.

If a single transistor fails, normally the whole thing dies.

zokier · on July 9, 2015

Slightly related thing is RAM random bit errors. There was an interesting article published few years ago where some guy registered domains that differed by one bit from some popular domains and recorded the traffic that hit them. Kinda scary to think what else is wrong in your RAM then... Too bad that ECC is still restricted to servers and serious workstations.

http://dinaburg.org/bitsquatting.html

buren · on July 9, 2015

Cool! The author of the post you linked, also linked to this paper https://www.cs.princeton.edu/~appel/papers/memerr.pdf. From the paper: "..experimental study showing that soft memory errors can lead to serious security vulnerabilities in Java and .NET virtual machines"

rgovind · on July 11, 2015

The first author in the princeton paper is my brother!!

Nomentatus · on July 9, 2015

Nearly all chips experienced transistor failures, rendering them useless, back in the day. Intel is the monster it is because they were the guys who first found out how to sorta "temper" chips to vastly reduce that failure rate (most failures were gross enough to be instant, back then, and Intel started with memory chips.) Because their heat treatment left no visible mark, Intel didn't patent it, but kept it as a trade secret giving them an incredible economic advantage, for many years. They all but swept the field. I've no doubt misremembered some details.

nickpsecurity · on July 9, 2015

They're extremely simple, have no moving parts, and the materials/processes of semiconductor fabs optimize to ensure they get done right. The whole chip will often fail if transistors are fabbed incorrectly and rest end up in errata sheets where you work around them. Environmental effects are reduced with Silicon-on-Insulator (SOI), rad-hard methods, immunity-aware programming, and so on. Architectures such as Tandem's NonStop assumed there'd be plenty of failures and just ran things in lockstep with redundant components.

So, simplicity and hard work by fab designers is 90+% of it. There's whole fields and processes dedicated to the rest.

nhaehnle · on July 9, 2015

Errata are usually caused by bugs in the logical design of the chip, not in the manufacturing or physical behavior of transistors. If you have a source for an errata that was issued due to systematically buggy transistors, I'd be curious to hear that story!

nickpsecurity · on July 10, 2015

Cant the erratic behavior of the transistors (eg flawed ones overheating) cause their logical function to fail?

Past that question, Ive still plenty to learn on hardware and will take your word for it about errata sheets. Sounds right given the things described in them.

rdc12 · on July 10, 2015

"Cant the erratic behavior of the transistors (eg flawed ones overheating)"

Yes but that is either a manafacturing defect (if persistant) or a transient error, or running out of speced tolerances. Or simply details of reality.

Whereas a logical design flaw, is more the actual design/implemenation, is fundementally wrong more akin to a bug.

nickpsecurity · on July 10, 2015

I gotcha. Appreciate the tip.

mchannon · on July 9, 2015

Generally, yes, a failing transistor can be a fatal problem. This relates to "chip yield" on a waferfull of chips.

Faults don't always manifest themselves as a binary pass/fail result; as chip temperatures increase, transistors that have faults will "misfire" more often. As long as this temperature is high enough, these lower-grade chips can be sold as lower-end processors that never in practice reach these temperatures.

Am not aware of any redundancy units in current microprocessor offerings but it would not surprise me; Intel did something of this nature with their 80386 line but it was more of a labeling thing ("16 BIT S/W ONLY").

Solid state drives, on the other hand, are built around this protection; when a block fails after so many read/write cycles, the logic "TRIM"s that portion of the virtual disk, diminishing its capacity but keeping the rest of the device going.

ajross · on July 9, 2015

> Am not aware of any redundancy units in current microprocessor offerings

Sure there are. That's why Intel sells chips with a thousand different cache sizes, for example. Bad bit in the cache? Just turn that block off. Likewise for whole cores in some of the bigger chips, I believe.

frik · on July 9, 2015

The Xbox360 had four cores (IBM Power) but only three cores were enabled. And older AMD CPUs were released with 6 enabled cores out of 8 cores.

scott_s · on July 9, 2015

I spoke with the project lead for the XBox 360 chip, and I asked him: Why three cores? It's a strange number. He said it was to make Christmas.

GTP · on July 9, 2015

Wow, that's interesting! Do you know if somebody tried to enable one of this "extra cores"?

jotm · on July 9, 2015

Haha, of course they did.

http://www.tomsguide.com/forum/id-2142984/amd-phenom-710-4th...

AMD used to software-disable the 4th core on the Phenoms, but then they switched to disabling them by hardware, so that rendered any software means useless.

AMD only disabled the cores to save on manufacturing - most of the time, these cores were not working properly. But as the process got better, people got 4 or more cores for the price of 2-3 :-D

jevinskie · on July 9, 2015

The PS3 disabled 1/8 SPUs for yield and you can try to reenable it: http://ps3devwiki.com/ps3/Unlocking_the_8th_SPE

Good luck if it turns out the disabled SPU is actually bad!

ixtli · on July 9, 2015

I seem to recall the ps4's Cell shipping with 2 out of 8 cores disabled because of the failure rate per wafer.

provemewrong · on July 10, 2015

AMD was particularly fond of selling dual and even tripple core CPUs that were quad core chips with some cores disabled. And people where unclocking these cores for profit, because they speculated that AMD disables core for the whole batch even if it was a single unit from it that failed the test, so there could be possibly plenty of good cores disabled along the way.

RogerL · on July 9, 2015

Others have answered why, here is the 'what would happen'. Heat your CPU up by pointing a hair dryer at it (you may want to treat this as a thought experiment as you could destroy your computer). At some point it begins to fail because transistors are pushed past theiroperating conditions. Another way to push it to failure is to overclock. The results are ... variable. Sometimes you won't notice the problems, computations will just come out wrong. Sometimes the computer will blue screen or spontaneously reboot. And so on. Just depends where the failure occurs, and if the currently running software depends on that part of the chip. If a transistor responsible for instruction dispatch fails it's probably instant death. If a transistor responsible for helping in computing the least significant bit of a sin() computation, well, you may never notice it.

fr0styMatt2 · on July 10, 2015

I remember playing around with overclocking my old Pentium 4 and how it would boot fine into Windows, but then you'd run Prime95 on it and the benchmark would start failing because the FPU was returning incorrect results.

intrasight · on July 9, 2015

When I was studying EE, a professor said on this subject that about 20% of the transistors in a chip are used for self-diagnostics. Manufacturing failures are a given. The diagnostics tell the company what has failed, and they segment the chips into different product/price classes based upon what works and what doesn't. After being deployed into a product, I assume that chips would follow a standard Bathtub Curve: https://en.wikipedia.org/wiki/Bathtub_curve

As geometries fall, the effects of "wear" at the atomic level will go up.

panax · on July 10, 2015

Your dual core CPU is often the same quad core die with two of the cores disabled because one of the cores has a defect and is sold at a discount to increase yield.

pjc50 · on July 9, 2015

I think the proportion is nowhere near that high for most ASICs - JTAG boundary scan adds a few percent to give you this testability, maybe 1-5%.

Taniwha · on July 10, 2015

Full chip scan provide logic that links EVERY flipflop in a design into one or more scan chain (not just boundary scan which checks pin bonding) - mostly it's a mux on each input - and is more like 5%ish

You test by scanning in a bit pattern, issuing a single clock andscanning out the result.

Smart software generates the minimal set of test vectors that tests every wire and gate between the flops.

Chip testers are expensive (millionish) so minimising tester time minimises chip cost - we make special testign logic for things like srams

greenNote · on July 9, 2015

As stated, two big variables are clock rate and feature size, which both effect mean time between failures (MTBF). Being more conservative increases this metric. I know from working in a fab that there are many electrical inspection steps along the process, so failures are caught during the manufacturing process (reducing the chance that you see them in the final product). Once the chip is packaged, and assuming that it is operated in a nominal environment, then failures are not that common.

tzs · on July 9, 2015

Speaking of the effects of component failure on chips, a couple years ago researchers demonstrated self-healing chips [1]. Large parts of the chips could be destroyed and the remaining components would reconfigure themselves to find an alternative way to accomplish their task.

[1] http://www.caltech.edu/news/creating-indestructible-self-hea...

panax · on July 10, 2015

The 10 series Altera FPGAs are going to be a little like this, although it is more for SEUs and also IP segmentation/security. You will have lots of little islands, all which can detect configuration errors and report them, and even recover using partial reconfiguration. Maybe someday we can make it smart enough to replace and reroute around physically bad LEs automatically.

wsxcde · on July 10, 2015

Others have already mentioned one failure mechanism that causes transistor degradation over time: electromigration. Other important aging mechanisms are negative-bias temperature instability (NBTI) and hot carrier injection (HCI). I've seem papers claim the dual of NBTI - PBTI - is now an issue in the newest process nodes.

This seems to be a nice overview of aging effects: http://spectrum.ieee.org/semiconductors/processors/transisto....

spiritplumber · on July 9, 2015

This is why we usually slightly underclock stuff that has to live on boats.

cskau · on July 10, 2015

Boats specifically because there's a higher turnaround time to replace failing hardware?

spiritplumber · on July 13, 2015

Yeah, also corrosion and voltage spikes due to old/untuned gensets or even taking a lightning strike on the hull.

2bluesc · on July 10, 2015

In 2011, Intel released the 6 series chipset with an incorrectly sized transistor that would ultimately fail if used extensively. A massive recall followed.

http://www.anandtech.com/show/4142/intel-discovers-bug-in-6s...

Gravityloss · on July 9, 2015

They do fail. Linus Torvalds talked about this in 2007 http://yarchive.net/comp/linux/cpu_reliability.html

jsudhams · on July 10, 2015

So would that mean we need to ensure the systems in critical area (not nuclear or some but banks and transaction critical) be tech refereshed mandatory at 4/5 years? Especially when 7nm production starts.

msandford · on July 10, 2015

> Considering that a Quad-core + GPU Core i7 Haswell has 1.4e9 transistors inside, even given a really small probability of one of them failing, wouldn't this be catastrophic?

Yes, generally speaking it would be. Depending on where it is inside the chip.

> Wouldn't a single transistor failing mean the whole chip stops working? Or are there protections built-in so only performance is lost over time?

Not necessarily. It might be somewhere that never or rarely gets used, in which case the failure won't make the chip stop working. It might mean that you start seeing wrong values on a particular cache line, or that your branch prediction gets worse (if it's in the branch predictor) or that your floating point math doesn't work quite right anymore.

But most of the failures are either manufacturing errors meaning that the chip NEVER works right, or they're "infant mortality" meaning that the chip dies very soon after it's packaged up and tested. So if you test long enough, you can prevent this kind of problem from making it to customers.

Once the chip is verified to work at all, and it makes it through the infant mortality period, the lifetime is actually quite good. There are a few reasons:

1. there are no moving parts so traditional fatigue doesn't play a role

2. all "parts" (transisotrs) are encased in multiple layers of silicon dioxide so that you can lay the metal layers down

3. the whole silicon die is encased yet again in another package which protects the die from the atmosphere

4. even if it was exposed to the atmosphere, and the raw silicon oxidized, it would make silicon dioxide, which is a protective insulator

5. there is a degradation curve for the transistors, but the manufacturers generally don't push up against the limits too hard because it's fairly easy and cheap to underclock and the customer doesn't really know what they're missing

6. since most people don't stress their computers too egregiously this merely slows down the slide down the degradation curve as it's largely governed by temperature, and temperature is generated by a) higher voltage required for higher clock speed and b) more utilization of the CPU

Once you add all these up you're left with a system that's very, very robust. The failure rates are serious but only measured over decades. If you tried to keep a thousand modern CPUs running very hot for decades you'd be sorely disappointed in the failure rate. But for the few years that people use a computer and the relative low load that they place on them (as personal computers) they never have a big enough sample space to see failures. Hard drives and RAM fail far sooner, at least until SSDs start to mature.

Gibbon1 · on July 9, 2015

Transistors don't fail for the same reason the 70 year old wires in my house don't fail. The electrons flowing through the transistors doesn't disturb the molecular structure of the doped silicon.

jacquesm · on July 9, 2015

Sorry, but that's just plain wrong:

https://en.wikipedia.org/wiki/Electromigration

At the scale of your house wiring the effect is not so noticeable but for integrated circuits it is definitely a factor.

As for your house wiring, if it is really 70 years old you might want to worry about the insulation, not the copper.

Gibbon1 · on July 10, 2015

What's wrong about it? Transistors don't work by electromigration.

panax · on July 10, 2015

Electromigration will move atoms on the interconnects between transistors and eventually cause an open.

Gibbon1 · on July 10, 2015

He said transistors, not IC's. And electromigration is a problem with metal interconnects at high current densities. So you talking solemnly about a failure type that can occur with on some integrated circuits under some conditions, or when the designers screwed up. But which doesn't actually happen much in practice.

The persons actually question is roughly why do transistors last so long compared to other types of mechanisms. No one in the comments made an attempt to answer that, at all.

rhino369 · on July 9, 2015

Extremely good R&D done by semiconductor companies. It's frankly amazing how good they are.

MichaelCrawford · on July 10, 2015

They do.

That's why our boxen have power-on self tests.