Hacker News new | past | comments | ask | show | jobs | submit login
Cosmic Rays Flipping Bits (johndcook.com)
26 points by tentacleuno on Sept 22, 2021 | hide | past | favorite | 13 comments



Full error correction/parity checking seemed to go out of fasion in the 1990s because people started cheapskating on memory. We even had dummy parity chips on memory to fool parity checking. I was involved in many wars with suppliers at the time to stop them suppling me with 'fake'/pretend-parity-is-working memory (it was a constant battle until I issued a tightly-written memory specification which was sent to memory suppliers with orders--and even then some still tried to fool me or their suppliers tried to fool them).

It seems to me that if particles flipping bits is now becoming a significant problem then perhaps we ought to make memory with full error correction as a matter of course as once we did long ago (it makes sense to things properly even if it costs a little more).

(Incidentally, the memory/particle debate was had in the 1980s and supporters of parity checking like me eventually lost out (as memory seemed reliable enough). I've never doubted that error/parity checking would eventually come back when chip densities reached a certain point. Perhaps we're getting close to that point now.)


I mean the only place where full ECC is not common in consumer computers is DRAM.

Every CPU cache layer has ECC and the registers have checksums. SSDs use ECC. It's hard to say ECC has fallen out of fashion when the number of subsystems using it has increased.

DRAM's lack of commonplace ECC is a real problem. I think it draws enough attention that maybe things will start steering in the right way soon.


ECC DRAM (not parity but at least SECDED) has been the norm in servers for 40 years, and memory mirroring and other RAID-like schemes that add additional redundancy to entire chips or word lines etc has been a thing for decades too.

Even the next consumer DRAM specification (DDR5) will have an internal ECC for the memory arrays. Reliability and yields quite likely mean that even for cheap consumer grade devices it will probably become cheaper to manufacture cells with reliability but covered with redundancy at higher levels (this has been the case for NAND almost since the beginning).

For the past several decades, ECC on consumer DRAM hasn't really been required. It's cheaper and "good enough". I'm guessing software and disks (HDDs, removable media, even early SSDs) were almost certainly the biggest cause of reliability and data loss and corruption issues.


Didn't lpddr3/4/5 also have on chip ECC?


Just wanted to point out previous HN discussions on this topic if anyone is interested:

Cosmic rays and computers (1998) (nature.com) https://news.ycombinator.com/item?id=25218687


There are much better articles on the subject than this one. Many are linked from Wikipedia: https://en.m.wikipedia.org/wiki/Soft_error


Veritasium recently did a good video on this topic: https://m.youtube.com/watch?v=AaZ_RSt0KP8


Somewhat related, there's a pretty fun & interesting case being made [1] that a flipped bit might have helped during a Mario 64 speedrun.

1. https://www.thegamer.com/how-ionizing-particle-outer-space-h...


There was a good Defcon talk about some security implications of this phenomenon (specifically, DNS squatting) around a decade ago: https://www.youtube.com/watch?v=aT7mnSstKGs


This is a documented case of a bit flip, not a bit flip from a cosmic ray. I'm still skeptical that cosmic rays actually cause issues for computers inside earth's atmosphere.


If you want to learn about it I'm sure there are dozens of research studies that go over it. ECC is expensive, and so is the efforts gone through to reduce "cosmic ray" interference. Generally those costs, in the software world, have good measured reasons for their existence.


Parent is saying that bit-flips are not necessarily generated by cosmic rays, not that they don't exist.

RowHammer is a good example - bit flips which can be triggered through specific memory access patterns.


I don't believe a row hammer attack can be precise enough to target a single bit.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: