To be clear, this is about corruption in the CPU/GPU/memory complex. There's a whole separate set of techniques (some of which I worked on) to detect and correct data corruption on disk.
I'm in the same boat and my takeaway is that the vast majority of a "silent" on-disk corruption actually happens on the way to the storage, i.e. the data gets corrupted in some RAM it passes through and then just ends up being written out in corrupted state. This is because, virtually all modern drives implement per-sector FEC coding, so if a bit does flip on the disk, you will either get back original data (now FEC-corrected) or you will get a read error.
That is, the so-called "bitrot" phenomenon is largely mis-attributed. Bitrot doesn't happen at rest. It happens in transit.
I can state categorically that bitrot on disk does exist, because that's one of the parts I worked on. It's pretty rare - unfortunately I don't think I can give you the numbers - but across enough exabytes it does happen enough to justify slow scans to detect it.
The only correct way to test for bitrot is to read the data back immediately after it was written and the cache flushed. If it's the same as the original, we know it made it to the disk undamaged. Then re-read it again after some time. If it doesn't match, re-read immediately again, ideally using a different physical memory block. Compare again. If it doesn't match, take the disk to another machine and re-read again. If it doesn't match, only then it's an actual at-rest bitrot... OR it's a drive's firmware bug, because corrupted data must be corrected or it must not be returned at all.
Because we had the checks for it in flight. Also, more often than not these same blocks had been checked before, and found to be fine.
> The only correct way to test for bitrot is to read the data back immediately
No, the only correct way is to read it back after some time has passed. Mis-written data is not the same as bitrot.
> must be corrected or it must not be returned
Every error-correction technique has a limit to how many simultaneous errors it can correct. Beyond that, bits can be flipped in a way that seems valid but in fact is not (detectable by cross-checking with other erasure-coded fragments of the same block on other machines). Just because you haven't seen it doesn't mean it doesn't happen. As I said, and as others have said many times, with sufficient scale and time even the most unlikely scenarios become almost inevitable. Why do you persist in telling me I didn't see what I saw with my own eyes? Are you assuming that my thirty years in storage gave me less understanding or insight regarding these issues than whatever experience (if any) you have?
>> The only correct way to test for bitrot is to read the data back immediately
> No, the only correct way is to read it back after some time has passed. Mis-written data is not the same as bitrot.
Well, no. If you want to check for at-rest bitrot, you need to make sure that you've written out the correct thing. Otherwise it's not possible to tell at-rest corruption from the one that happened on the way in.
> Every error-correction technique has a limit to how many simultaneous errors it can correct.
But it can detect that the case when it can't recover. Which is why it will either produce a correct output or an error.
> As I said, and as others have said many times, with sufficient scale and time even the most unlikely scenarios become almost inevitable.
This is not an argument if it goes against how things actually work.
> Why do you persist in telling me I didn't see what I saw with my own eyes?
I am merely curious in your exact testing technique, because at-rest bitrot is vanishingly impossible, even at the exabyte scale. For it to happen, the data and its ECC (7-11% of the data size) need to be both corrupted in a coordinated way. That is exceedingly unlikely. Especially in the context of academic papers that found that on-disk corruption is nearly always clustered and is either small scale or full-sector failures.
So when you say you ran into a lot of these cases, it's only natural to ask for details. And "scale" is not a detail.
> Are you assuming that my thirty years in storage gave me less understanding or insight regarding these issues than whatever experience (if any) you have?
I have no way to tell. But given your experience, can you explain how at-rest bitrot, should it occur, can seep through the on-disk error correction? I am not talking about raid-style setups, just the banal ECC record in a disk sector [1].
> But it can detect that the case when it can't recover.
That is simply not true. For any parity/ECC/FEC/erasure-code scheme carrying M data bits in N (greater than M but less than 2M) total, there must be multiple data patterns that will match the same error checks. That's just mathematics. Also, bear in mind that ECC bits can be corrupted too. This opens up the distinct possibility of something that looks like a correctable error, but the "correction" leads to a wrong result. I've seen such issues in many kinds of storage systems, from low level to high. Anyone who has actually worked in this area, instead of deriving their "expertise" from a quick scan of Wikipedia, would be utterly unsurprised by the idea that disk firmware might do such a thing, or have bugs in their ECC implementation, or not follow a spec.
Whatever the causes, whatever the merely-theoretical probabilities, the fact remains that I've seen these. I've been paged for them. I've done the analyses of possible causes. A bit pattern was written and repeatedly verified over a quite long period of time (ruling out data path issues), then at some point a different bit pattern was read and would persistently be read thereafter. How is that not real bitrot? How does it matter, beyond ruling out everything above the disk level, what the precise causes are? If you can't answer those questions, you're just posting noise.
It indeed is not. Had to reread the theory and I stand corrected, RS-style ECC can't detect errors in excess of the redundancy count.
> How is that not real bitrot?
It is and I can see how it can happen.
> How does it matter, beyond ruling out everything above the disk level, what the precise causes are?
It would've mattered if a drive could detect on-disk bitrot reliably, which was what the stats I worked with (also in exabytes, funnily enough) and the IEEE papers I read led me to believe.
Didn't you answer your own question above? It's firmware bugs. The disk reported a successful write at block X but it actually wrote block Y. Later you read block Y and you get data X. The block-level ECC codes are consistent. You also stand a low but not zero probability that you requested a read at block X and were served up some other block, again with matching checksums. And of course there's always the possibility that your firmware simply has a bug in the code checker.
The paper "Parity Lost and Parity Regained" assigns a probability of 1.88e−5 to misdirected writes bugs among disks, so if you have a warehouse full of disks you now have this nightmare.
Fun question: what if a relocation table gets corrupted? And what protection is there against that possibility? You can bet it's not the same ECC as on data blocks. The rest is left as an exercise for the reader. ;)
uh oh, I recognize this one. Love to have a file corrupt after months at rest with no access logged and no mtime changes because a file on a neighboring track needed rewriting (SMR, of course).
ZFS and other checksumming file systems can detect bit rot in data at rest. When data is read back, that sector is checksummed and compared against when it was written before returning the request.
You can periodically scrub the entire pool to find and even fix these issues (in a pool with redundancy)
Sure. However the main purpose of scrubbing is to flush out deteriorating media and to prompt the drive to relocate salvageable sectors and to report completely dead ones.
I found many comments on ZFS, and not so much comments on dm-integrity, BlueStore/Cephfs, and others. So I am thinking of looking into ZFS, but if there are any recommendations, I would like to seek advice.
I am experimenting with Git LFS, git-annex, I like the filesystem UI better, so I am looking for filesystem like solutions.