Hacker News new | past | comments | ask | show | jobs | submit login
Why Chips Die (semiengineering.com)
97 points by Lind5 on Nov 25, 2018 | hide | past | favorite | 30 comments



Seems like a very light and theoretical article. I remember reading papers produced by IBM and HP where they analyzed chip failures - and they were much more specific in terms of both deep analysis and proximate cause (heat being essentially the overwhelming proximate cause.)


Do you have any references that you can share?


Wilhelm G. Spruth's "The Design of a Microprocessor" has some interesting material on the entire lifecycle of a design to ship of a chip including failure analysis. I have some newer books in my library on this kind of thing as well that I'd have to go re-familiarize myself with but that one is striking in that it is first hand knowledge of a crown jewels kind of project and not academic or the accounts of an outsider.


Thank you. It seems like a great read! I have obtained the pdf.


I used to read a lot of different hardware journals back when I worked for a semiconductor manufacturer (including lots of unpublished analysis from our big customers). I don't remember any of them anymore.


Electrical design seems to have a lot of dead ends where we can't go any further. Does anyone have any insights as to what the current status is for transistors based on light? I would like to start with chip design (did microchip programming as CS), is it even a feasible to achieve something in that field? Is there any good places / papers to start with?


> Does anyone have any insights as to what the current status is for transistors based on light?

It's a good question, but there is something often ignored in press-releases about photonic transistors: How exactly does a speedup occur?

For example: Photons do not interact with each other, so a photonic transistor will rely on electrons or holes to effect some sort of interaction between photons. That interaction might occur in a photorefractive material, but ultimately the photorefraction is a result of photons interacting with electrons of the underlying material. So why are the electrons in the photorefractive material faster than the electrons in a conventional transistor? It also might be worth noting that the fastest fT in conventional transistors is around 0.5 THz, so the bar is not particularly low.

FD: I have a patent on photonic transistors from Bell Labs days.

FD2: I have become an old cynic on photonic transistors.

FD3: The above is really a rant on high-speed transistors. Photonic transistors might well have a superiority in different areas, eg quantum computing.


fT for the fastest transistor - probably isn't the parameter limiting electronic circuits - it's more about wire length and capacitance, and power density of a circuit. It's both true inside circuits, but also outside, at the "memory wall" - which on the surface at least, seems particularly fitting for photonics.


> it's more about wire length and capacitance

At the fastest speeds, interconnect on crucial lines is engineered as a transmission line, where the inductance balances out the capacitance. When you look at the propagation in the transmission line, the energy of the traveling wave is dominated by the electric and magnetic fields outside the metal. So it is not generally appreciated that transmission-line interconnect is already photonic, and moving at photonic speeds. As you might expect, for most transmission-line geometries, the speed of the signal is not terribly far away from the speed-of-light in the material.

Moreover, the intuition that "metal wires" are slowing down the signal is not taking into account that an "all photonic transmission" still requires some sort of confinement. If you were to eliminate the metal, you would still need some sort of waveguide to confine the photonic energy. Inevitably, the waveguide will have some region of a higher dielectric, which as you know will slow down the propagation and also have some loss. As it turns out, the final propagation speed is close to a transmission line.

So again, it's important to quantify exactly how the speedup occurs by switching to "photonic interconnect", because the reality is that it is already photonic.

Where photonic interconnect might have had a use is where the lines are very long, and the loss associated with a transmission line gets impractically large due to the confining metals. At that point, it's an engineering tradeoff: Converting to and from photons at each end of the line is non-trivial, and not all materials lend themselves to converting photons to electrons. There have been efforts since the 80s to put GaAs on silicon just to address this issue. (I was part of a team that got to 100MHz for GaAs-on-Si. The fact that GaAs-on-Si photonic transmission is still pretty much confined to the lab tells you everything you need to know about the manufacturability.)


Thank you. That was most interesting !


Thanks for the response!

Did you end up using that patent in any product? What did happen to it?


To my knowledge, the concept was never used in product. I left Bell before any kind of prototyping etc, so don't have any idea whether it was pursued in my absence (but I'd guess probably not.)


Optical switches are not a contender for general computing. Scaling beyond the diffraction limit (approximately half the wavelength) requires plasmonics, which is arguably more lossy than traditional transistors. If anything ever subplants CMOS, it will still likely be based on electron transport.

Electrons have a lot of great properties for switching. They are massive enough to be localizable into small spaces, but light enough to be cheap (energy-wise) to accelerate up to high speeds. They are ubiquitous, and they interact strongly with matter, electric fields, and magnetic fields.

Light is great for signal transport over long distances in a large part because it interacts so weakly with matter. But that same property makes it difficult to switch.


Not really what I expected. My cousin worked as a chem eng for an on shore chip mfgr in the 70s/ 80s doing polymer research into chip packaging material and the general impression I got from informal discussion was the main long term enemy was water and gas (air) infiltration and ceramic chips were essentially inert in comparison to plastic DIP material. She was happy when here theoretical model indicated the chip pins on the outside of the chip would corrode off before most interiors would contaminate and fail; I have not idea if the team working on pin metallurgy had a similar goal of not corroding until her plastic seal failed, LOL. She also said something about EPROM UV windows being her nemesis or impossible or something similar.

Possibly her main long term enemy to chip life as a polymer chemist was not the overall system enemy to long term chip life.


My friend studied chip design and was taught how to design for planned failure after X years, curiously without any ethical considerations. Are these techniques actually used in practice?

(That was my first thought after reading 'death by design')


I work in chip design, and we certainly don't design for planned failure, but we do design for 20 year reliability. This means we over-design to ensure reliability up to 20 years. If a market does not need such longevity then it would be reasonable to design for less time, to ensure you are not over designing. Chip deign is all about trade offs - power, area, performance, reliability, flexibility, integration cost, high/low temperature operation, yield, etc - and designing for 20 years certainly impacts this trade off.


What's the upper limit of longevity testing, say for space use, 100 years ?


Good question, it depends on a few factors - e.g. BTI is aging that alters the transistor performance over time when a transistor has a fixed bias, so if you've loads of margin for this it should not impact you, or you can mitigate by periodically altering the bias to recover. TDDB is an issue at higher than normal voltages, so of the environment is controlled this can be avoided. For high reliability situations older process nodes are usually used which are less susceptible to these failures, and the environment is well controlled. Space has other issues like much higher probability of random bit flips, so design for space is challenging. Note our 20 year lifetime also assumes worst case conditions, so I'd wager many of our chips would last much longer in practice.


What does changing the bias to recover look like in practice? Any resources on this?


I'm not sure about resources but I can explain the basics. BTI happens when there is a constant voltage across the gate/source of a transistor for a long time. An example for a digital circuit could be an inverter held in one state. Either the pmos or nmos will have a large vgs and over time this bias will shift the threshold voltage (vt). Now if this inverter delay is critical, BTI will cause the delay to change and could break timing. To fix it if the inverter is toggled after sometime it will mostly recover to the normal delay. Some people mitigate it by having very low frequency toggling for 'off' gates, but usually only if those gates are critical/matched. Alternatively people just margin the timing for the worst case. Analog circuitry has to be designed for it also, and it's an issue in differential pairs for example.


There's another issue - metastability - in some designs with multiple clock domains it's not possible to create absolutely reliable chips - things fail (dynamically, usually without any damage to the chip though there's actually a minute chance of runaway power eating cascades of failure, something we try to avoid).

For example the graphics back end of a video card is running its framebuffer memory at a different clock rate from the video dot clock, at some point pixel data has to move from one clock domain to the other - metastability failure might cause an occasional bad pixel on the screen, you can do the math, trade latency (and more gates) for reliability so that you see that pixel burble once a year.

Other times it could be worse - I worked on a chip where we did the math on whether the PCI interface would suffer synchroniser failure and what the worst case failure was (maybe bus lockup?), in the end the boss signed off on once a year ... which at the time was ~100 times the mean Win95 uptime

So we do sort of do that math, knowing that you can't 'fix' metastability issues, just make them rare


It's not designed for failure at point X, it's designed for proper operation until at least point X. You always add some safety margins such that the failure rate in the designed life is low enough. After that most of the devices may still work for a long time.


Well, I was told once that the more you expose a chip to stress-generating heat (for example running a game with a bad quality cooling fan), it drastically reduces its life span, and will also make it slower.

Apparently chips "age", but they still function because or error mitigation, which is witnessed by a performance drop.

To me it's the only explanation why I see so many computers turn slow, despite the fact that I have reinstalled them, cleaned them, etc. I argued many times against the "just put a SSD and add RAM, windows 10 is just slower", but in my mind, I just cannot explain why a newer OS becomes so much slower and memory hungry and unresponsive, even when running very basic programs.

I think there is a myth that chip will always perform exactly the same as long as they work. Chip engineering sounds much more complicated than it seems, and I cannot trust any IT support person telling me to "upgrade".

I'm sure military-grade chips have different designs and cooling requirements, which doesn't turn them into domestic/obsolete products after 3 years.


You are mistaken, this is not a myth. Synchronous digital chips have a clock with a fixed frequency, and their working is fixed. So if it works, a chip perform exactly the same, regardless of age. If something's slower, it's software or pheripherals. Probably software (and that's why some people are so upset: hardware is getting faster, observed performance keeps degrading. I have to wait 10 minutes for my computer before I can start working, every day).


The cooling system ages - dust accumulates and coolingpaste become less efficient - which in turn can cause more frequent throttling of the cpu. (technically a peripheral to the chp)


This is an important and often overlooked aspect of system aging. Heat dissipation is the primary performance bottleneck of modern CPUs (in truly CPU-bound tasks). There's a reason liquid nitrogen/helium cooling can enable dramatic over clocking.


Not entirely true either. Transistor aging is real and is becoming more relevant at the latest nodes. But what is important to remember is that a slower transistor is still being operated in a circuit switching at the original frequency, the result of that is incorrect calculations.

It won't take any longer to startup, it just won't startup in a usable state. Startup/tasks taking longer is definitely the software industry's doing.

https://semiengineering.com/transistor-aging-intensifies-10n... for more.


A few months ago I pulled out a machine with an AMD Athlon XP 2200+ processor that had lain unused for four or five years. It had difficulty booting, failing to POST a couple of times, and BSODing early in the Windows 2000 boot process several times, even once almost finishing booting. But the fascinating thing about it was that the clock frequency the BIOS was reporting was 532MHz, which is only double the FSB (266MHz), rather than the intended 13.5× (1800MHz).

After about ten attempts, I was no longer able to get it to POST at all. So I pulled it apart and scavenged its disk’s magnets, which was what I had been planning on doing anyway.


failed capacitors, corrupted/bad cmos data, nothing to do with aging silicon


Yes this CPU gen falls right into the golden age of the great capacitor plague.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: