Were you around for enough DRAM generations to notice an effect of DRAM density / cell-size on reported ECC error rate?
I’ve always believed that, ECC aside, DRAM made intentionally with big cells would be less prone to spurious bit-flips (and that this is one of the things NASA means when they talk about “radiation hardening” a computer: sourcing memory with ungodly-large DRAM cells, willingly trading off lower memory capacity for higher per-cell level-shift activation-energy.)
If that’s true, then that would mean that the per-cell error rate would have actually been increasing over the years, as DRAM cell-size decreased, in the same way cell-size decrease and voltage-level tightening have increased error rate for flash memory. Combined with the fact that we just have N times more memory now, you’d think we’d be seeing a quadratic increase in faults compared to 40 years ago. But do we? It doesn’t seem like it.
I’ve also heard a counter-effect proposed, though: maybe there really are far more “raw” bit-flips going on — but far less of main memory is now in the causal chain for corrupting a workload than it used to be. In the 80s, on an 8-bit micro, POKEing any random address might wreck a program, since there’s only 64k addresses to POKE and most of the writable ones are in use for something critical. Today, most RAM is some sort of cache or buffer that’s going to be used once to produce some ephemeral IO effect (e.g. the compressed data for a video frame, that might decompress incorrectly, but only cause 16ms of glitchiness before the next frame comes along to paper over it); or, if it’s functional data, it’s part of a fault-tolerant component (e.g. a TCP packet, that’s going to checksum-fail when passed to the Ethernet controller and so not even be sent, causing the client to need to retry the request; or, even if accidentally checksums correctly, the server will choke on the malformed request, send an error... and the client will need to retry the request. One generic retry-on-exception handler around your net request, and you get memory fault-tolerance for free!)
If both effects are real, this would imply that regular PCs without ECC should still seem quite stable — but that it would be a far worse idea to run a non-ECC machine as a densely-packed multitenant VM hypervisor today (i.e. to tile main memory with OS kernels), than it would have been ~20 years ago when memory densities were lower. Can anyone attest to this?
(I’d just ask for actual numbers on whether per-cell per-second errors have increased over the years, but I don’t expect anyone has them.)
I think it's been quadratic with a pretty low contribution from the order 2 term.
Think of the number of events that can flip a bit. If you make bits smaller, you get a modestly larger number of events in a given area capable of flipping a bit, spread across a larger number of bits in that area.
That is, it's flip event rate * memory die area, not flip event rate * number of memory bits.
In recent generations, I understand it's even been a bit paradoxical-- smaller geometries mean less of the die is actual memory bits, so you can actually end up with fewer flips from shrinking geometries.
And sure, your other effect is true: there's a whole lot fewer bitflips that "matter". Flip a bit in some framebuffer used in compositing somewhere-- and that's a lot of my memory-- and I don't care.
Sorry, I don't have the numbers you asked for. But afaik one other effect is that "modern" semiconductor processes like FinFET and Fully-Depleted Silicon-on-Insulator are less prone to single event upsets and especially result in only a single bit flipping and no drain of a whole region of transistors from a single alpha particle.
I’ve always believed that, ECC aside, DRAM made intentionally with big cells would be less prone to spurious bit-flips (and that this is one of the things NASA means when they talk about “radiation hardening” a computer: sourcing memory with ungodly-large DRAM cells, willingly trading off lower memory capacity for higher per-cell level-shift activation-energy.)
If that’s true, then that would mean that the per-cell error rate would have actually been increasing over the years, as DRAM cell-size decreased, in the same way cell-size decrease and voltage-level tightening have increased error rate for flash memory. Combined with the fact that we just have N times more memory now, you’d think we’d be seeing a quadratic increase in faults compared to 40 years ago. But do we? It doesn’t seem like it.
I’ve also heard a counter-effect proposed, though: maybe there really are far more “raw” bit-flips going on — but far less of main memory is now in the causal chain for corrupting a workload than it used to be. In the 80s, on an 8-bit micro, POKEing any random address might wreck a program, since there’s only 64k addresses to POKE and most of the writable ones are in use for something critical. Today, most RAM is some sort of cache or buffer that’s going to be used once to produce some ephemeral IO effect (e.g. the compressed data for a video frame, that might decompress incorrectly, but only cause 16ms of glitchiness before the next frame comes along to paper over it); or, if it’s functional data, it’s part of a fault-tolerant component (e.g. a TCP packet, that’s going to checksum-fail when passed to the Ethernet controller and so not even be sent, causing the client to need to retry the request; or, even if accidentally checksums correctly, the server will choke on the malformed request, send an error... and the client will need to retry the request. One generic retry-on-exception handler around your net request, and you get memory fault-tolerance for free!)
If both effects are real, this would imply that regular PCs without ECC should still seem quite stable — but that it would be a far worse idea to run a non-ECC machine as a densely-packed multitenant VM hypervisor today (i.e. to tile main memory with OS kernels), than it would have been ~20 years ago when memory densities were lower. Can anyone attest to this?
(I’d just ask for actual numbers on whether per-cell per-second errors have increased over the years, but I don’t expect anyone has them.)