Hacker News new | past | comments | ask | show | jobs | submit login

> 1) Use ECC memory

Not exactly. When I was in telco, where I had this problem was in FPGA's, we had all ECC memory and I never linked any problems to bit flips in RAM. But as I remember, the FPGA's we had were using a type of SRAM cell, but because it's not a memory module the FPGA programming could bit flip. So the product had a checksum function, that read back the program on a cycle and reset itself if the program no longer matched the checksum. So we would see 1-2 crashes / restarts per week in our FPGAs that we believe were bit flips.

We then ran an anlysis on any of these that higher than expected error rates to try and identify actually bad hardware and replace them.

I think the vendor eventually came up with a way to reprogram the FPGA without just crashing and rebooting the entire board.




Many modern FPGAs now include dedicated logic for config SRAM "scrubbing." This logic continuously checks config frame checksums to identify upsets. These can then be fixed in real-time either using the error correction properties of the checksum technique, or from the non-volatile config memory (typically NOR flash). It's also important to note that only a subset of the SRAM config bits are critical for a given application. Usually this is a small percentage of the overall array.

https://www.xilinx.com/support/documentation/application_not...

If even higher levels of reliability are needed, there are rad-hard-by-design FPGA families (e.g. Xilinx Virtex 5QV). These have a special config SRAM cell that has more charge storage sites than a conventional SRAM cell. It is less area efficient than a conventional SRAM cell, but geometry of the charge storage sites ensures that a single cosmic ray can't flip the state of a majority of them. Essentially the cell can self-correct, no scrubbing required.


Interesting and makes sense. Do you have any additional references you would suggest and particularly in the context of FPGAs ?

Would you say this quick reference is a good overview ? https://www.intel.com/content/dam/www/programmable/us/en/pdf...


Sorry, I should have mentioned this was quite a few years ago, so I'm very out of date. So I don't have any known good references that are handy. That link you shared seems pretty good on a quick scan through and inline with what I remember, I'm pretty sure I dug up similar resources for other vendors, including one I think was looking at satellite hardware.


Compaq/HP handled this with many redundant resources / cores: https://en.wikipedia.org/wiki/Tandem_Computers




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: