Hacker News new | past | comments | ask | show | jobs | submit login

I'll run memtest86 for a full day when I build a system or get new ram. I've never seen an error outside of the situation where a stick has gone bad and it errors like crazy. I don't use ECC other than on my homelab server either.

Tons of people do this and have no issues with bit flips or we'd be hearing about it.




In my experience, after handling many dozens of systems (out of 3000) with ECC errors, it's quite rare for the problem to be reproducible in memtest86. Running in production often triggers many errors per day. Our alarm threshold is over 100 correctable or 1 uncorrectable error per day. After a producing zero errors with 24 hours in memtest86, I usually see more errors when returned into production. Not sure if it's heat/cooling related, the access pattern, or maybe memtest86 tests the memory value, but not the memory address. I.e. if you fill memory with 0xdeadbeef you'll read 0xdeadbeef successfully even if the address is off.

Dimms can "fail", but still have low error rates. Stories are pretty common, LinusT had one in the last year spent something like a week tracking it down, assumed it was a new kernel error. Tracking down memory errors with ECC is a painful process that involve tracking down numerous possible causes crashes.

Sure bitflips are easy to ignore, but weird things do happen. Linux crashes, processes crash, suddenly things are weird, the desktop doesn't work like you expect. People assume that "the system is buggy" or it's related to patching, or some other user space error, but sometimes it's just a bitflip.

Sadly there are few cases where a bit flip is obvious, but if you keep 10k files around for a decade it's pretty common to find a few corrupted. Various weird behaviors have been tracked down to memory issues. GCC for years had an error that nearly always was an internal compiler error ... triggered by a memory problem. One famous speed run of a game had something very weird happened, which was tracked to a single bit memory error.

Sure can you get away with ECC, certainly. But the minimal price increase is in my opinion is quite reasonable. I'd much rather get a "single bit error corrected on dimm #3" dmesg, then occasional (or not so occasional) process, kernel, or filesystem crashes.

I'm typing this from an 2015 system with a Xeon e3-1230 CPU that was cheaper than the equivalent i7, but has ECC support. Sure I spent a bit more on the motherboard and dimms, something around $120. Money well spent. I find system crashes quite disruptive, even 1 less per year is valuable to me.


My server is quite similar, an e3-1275L v3 with 32GB of ECC from a trashcan Mac Pro. I had been planning to migrate my i9 desktop into that role at some point but not supporting ECC you're making me second guess that option. Maybe I'll ebay the lot and get a Ryzen. ;)


I believe alder lake and newer intel desktop chips support ECC, if you get the right motherboard/chipset.

Desktop ryzens support ECC for several generations, but ECC support varies by chipset. AMD forums often discuss it. Some Ryzen motherboards mention the support and certify ECC dimms.

So ECC is possible, but annoying.


Anecdotes are not real world statistics. Most people wouldn't identify a flipped bit of memory that caused a web page to glitch in their browser as a hardware problem. They'll just write it off as general enshittification of the web.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: