I'll run memtest86 for a full day when I build a system or get new ram. I've nev...

sliken · on May 11, 2023

In my experience, after handling many dozens of systems (out of 3000) with ECC errors, it's quite rare for the problem to be reproducible in memtest86. Running in production often triggers many errors per day. Our alarm threshold is over 100 correctable or 1 uncorrectable error per day. After a producing zero errors with 24 hours in memtest86, I usually see more errors when returned into production. Not sure if it's heat/cooling related, the access pattern, or maybe memtest86 tests the memory value, but not the memory address. I.e. if you fill memory with 0xdeadbeef you'll read 0xdeadbeef successfully even if the address is off.

Dimms can "fail", but still have low error rates. Stories are pretty common, LinusT had one in the last year spent something like a week tracking it down, assumed it was a new kernel error. Tracking down memory errors with ECC is a painful process that involve tracking down numerous possible causes crashes.

Sure bitflips are easy to ignore, but weird things do happen. Linux crashes, processes crash, suddenly things are weird, the desktop doesn't work like you expect. People assume that "the system is buggy" or it's related to patching, or some other user space error, but sometimes it's just a bitflip.

Sadly there are few cases where a bit flip is obvious, but if you keep 10k files around for a decade it's pretty common to find a few corrupted. Various weird behaviors have been tracked down to memory issues. GCC for years had an error that nearly always was an internal compiler error ... triggered by a memory problem. One famous speed run of a game had something very weird happened, which was tracked to a single bit memory error.

Sure can you get away with ECC, certainly. But the minimal price increase is in my opinion is quite reasonable. I'd much rather get a "single bit error corrected on dimm #3" dmesg, then occasional (or not so occasional) process, kernel, or filesystem crashes.

I'm typing this from an 2015 system with a Xeon e3-1230 CPU that was cheaper than the equivalent i7, but has ECC support. Sure I spent a bit more on the motherboard and dimms, something around $120. Money well spent. I find system crashes quite disruptive, even 1 less per year is valuable to me.

duffyjp · on May 12, 2023

My server is quite similar, an e3-1275L v3 with 32GB of ECC from a trashcan Mac Pro. I had been planning to migrate my i9 desktop into that role at some point but not supporting ECC you're making me second guess that option. Maybe I'll ebay the lot and get a Ryzen. ;)

sliken · on May 12, 2023

I believe alder lake and newer intel desktop chips support ECC, if you get the right motherboard/chipset.

Desktop ryzens support ECC for several generations, but ECC support varies by chipset. AMD forums often discuss it. Some Ryzen motherboards mention the support and certify ECC dimms.

So ECC is possible, but annoying.

bcrl · on May 10, 2023

Anecdotes are not real world statistics. Most people wouldn't identify a flipped bit of memory that caused a web page to glitch in their browser as a hardware problem. They'll just write it off as general enshittification of the web.