DRAM errors vastly more frequent than previously thought

proee · on Oct 6, 2009

I've worked for a major DRAM company and it's a wonder that the parts are reliable at all. The storage cells are being pushed to their limit as they are often the bottleneck in scaling the overall size of the bit structure. Not to mention the engineers are pressed to the max to get them out the door, often before they are fully qualified. Throw in the millions of process steps that are required to make the part, and the fact that your computer boots at all is remarkable.

Then throw in a high speed bus, processor and disk array....

uiohnuipb · on Oct 6, 2009

And try and convince a programmer that it's possible that their program's memory can be wrong.

They understand in theory but refuse to code for the possibility. Especialy when you get into HPC and there are clusters of 50-60 machines with 4Gb each, the chance of not having corrupt memory is almost 0.

jbellis · on Oct 7, 2009

> And try and convince a programmer that it's possible that their program's memory can be wrong. They understand in theory but refuse to code for the possibility.

Because the hardware can still detect multi-bit errors, just not transparently correct them. So you shut the machine down automatically until you get new dram installed.

Programmers _are_ coding for machines-will-fail-temporarily, but coding to "handle" random memory errors instead of buying the right ECC hardware would be insane.

Psyonic · on Oct 7, 2009

I'm honestly curious... what kind of defensive programming techniques could you use to try and deal with this?

dmm · on Oct 7, 2009

Do everything twice and make sure the results match.

pyre · on Oct 7, 2009

What happens when the code doing the comparison becomes corrupted? Do the comparison twice? What happens when the code controlling the evaluation of both comparisons becomes corrupted?

Your data and your instruction set are in the same memory. Even if they are separated into different areas of memory to prevent buffer overflow exploits, it's all still in memory. Once the memory starts going, you're kind of screwed. It's the same as how -- in respect to computer security -- once someone has physical access to the machine, you're screwed.

With respect to memory errors in distributed environments, usually such environments are distributed to increased the processing power for number crunching. If you run all calculations twice and have code comparing them for acceptance, you're more than doubling your processing requirements.

But at the end of the day, it's all a matter of what level of risk is acceptable (or tolerable). There is no magic bullet to fix these issues.

wmf · on Oct 7, 2009

You're ignoring that the voting code would be a very small fraction of your RAM and thus less likely to be corrupted. But it's academic since no one runs twice to avoid the cost of ECC.

pyre · on Oct 7, 2009

I realize that. My point is that there is no 100% solution.

ppereira · on Oct 7, 2009

The IBM z9 processor, if I remember correctly, can do the comparison in hardware. When the cores consistently fail to match, the system can probably also call the IBM service technician.

moe · on Oct 7, 2009

The z-series stuff is indeed some amazing piece of kit.

Yes, the machine can and does call the technician when something fails. And iirc everything, including the CPUs, is hot-swappable. That means you can physically remove a CPU-book (containing processors and RAM) and your OS will keep running.

Quite a nerds dream, if you have the spare change...

gnaritas · on Oct 7, 2009

Yea, programmers don't do that because in the vast majority of programs written, this would be absurd, costly, and not at all worth the effort.

mey · on Oct 7, 2009

To be safe you need to do everything 3 times and call a vote. Typically with 3 complete systems of identical nature.

pyre · on Oct 7, 2009

Well, once you've done it twice and the results don't match you'll probably re-run it a third time. It wouldn't make sense to just choose the result that 'makes sense' at that point.

btilly · on Oct 7, 2009

OK, you run it twice and the results differ. You go back and look at what the starting state should be and the starting states differ. Where do you get the definitely correct data from for the third run?

basugasubaku · on Oct 7, 2009

I remember djb exhorting people to buy ECC memory and supported motherboards back in 2001 (http://cr.yp.to/hardware/ecc.html) for their standard workstations and me thinking he must be somewhat crazy since no one else seemed to be making a fuss about it.

Perhaps he was right after all.

jrockway · on Oct 7, 2009

He is right.

Have you ever had fsck detect errors on a filesystem that you haven't abused? Guess what, that's memory corruption -- saved to disk forever.

I remember having a machine with especially flaky memory (memtest86 failed in about 30 seconds)... I detected it because dpkg's database was corrupted enough for it to cause errors in the application. I never even tried to save that filesystem...

dkarl · on Oct 7, 2009

As a user of hardware, not a hardware engineer, I wonder if I've seen these errors. Crashes -- I don't see any of those except the ones related to web browsers. I do a lot of long compiles and don't see any crashes from gcc. Corrupted data -- well, I've had several large downloads this year that didn't match the advertised md5 sums. I redownloaded and got matching checksums. Does any of this have to do with DRAM errors? There are lots of other potential sources of error in my computers. I could name half a dozen off the top of my head, but I'm sure I would only prove that I'm ignorant of another half dozen that are an order of magnitude more important than the ones I named. I await an answer to one question: who should care about these DRAM errors? Does that group include me?

tillulen · on Oct 7, 2009

Did you check whether the downloads had been truncated? Downloads finishing prematurely is a frequent cause for a checksum mismatch.

pmorici · on Oct 7, 2009

What's the difference between a "hard" and "soft" error that the article mentions?

Kadin · on Oct 7, 2009

I've heard the term used in two different ways, and I'm not sure which way is the 'correct' one:

In one usage, "soft" errors are ones that are 'caught' and transparently fixed by ECC, and thus have no effect (on a system that has ECC memory). "Hard" errors, by contrast, are ones that affect multiple bits and aren't corrected by ECC.

In the other usage, which I think is the more technically correct one, a "soft" error is a transient condition (bit flipped by cosmic ray, etc.) and the memory cell continues to operate normally on the next cycle. A "hard" error is where the cell is basically stuck in one state or another, and indicates that it's probably time to replace the module. I think you detect a "hard" error by looking for a series of "soft" errors, although maybe some architectures/chipsets detect the difference and report them in different ways...?

If anyone can substantiate either set of definitions, I'd be interested as well.

wmf · on Oct 7, 2009

Soft errors are temporary (e.g. a bit flip caused by cosmic rays) so rebooting will eliminate them.

Hard errors are permanent (e.g. a bit is always bad) and that's when you throw the DIMM away.

Herring · on Oct 7, 2009

by definition you can recover from a soft error, eg by automatically retrying a calculation