To add to this, even though bad blocks are allowed by the contract between a device and the OS, because they are so rare, all code paths that deal with them are riddled with bugs.
For example, if you power off a disk improperly, all un-acked writes are allowed to contain a mix of the new data, the old data, or an unreadable sector.
If, on Linux, you now try to read one of those unreadable sectors, you will get an IO error - as designed. If you now try to write that sector, you won't be able to. Why? Well the kernel cache is 4k sectors, whereas the disk is 512 byte sectors, so to write a disk sector it needs to read the other nearby sectors to populate the cache.
That pretty much means every time you power off a disk improperly, you'll get a bunch of unreadable data and filesystem errors, even though the filesystem is journaling so should be able to cope with a sudden poweroff just fine.
Every hard disk has bad blocks and reserve space of good blocks to allocate from when one of these bad blocks is written upon. Also the drive's firmware has an internal list of known bad blocks discovered while a low level format was performed at the factory, which is why consumers should not perform a low level format ever again, unless they are aware of this and know how to write the updated bad blocks list back into the drive.
Either way, bad blocks on a hard disk are an unavoidable reality. The same is true for SSD's. It is normal and should be expected to run into bad blocks on the media during normal use of the hard drive, whereby the correctly programmed firmware will remap the pointer to a good block from the reserve and either tell the OS driver to attempt the write again (if the drive's write cache has been disabled, as it should be), or silently re-map to the good block and write it with the data from the drive's write cache. If power is lost during this, no amount of redundancy will save one. The only solution is ZFS or Oracle ASM.
It has always been this way, and will always be like this as long as we don't have media which cannot have bad "blocks", for lack of a better term.
As the article tries to state, the problem isn't that bad blocks can happen. The problems with md's bad blocks log are:
- Most of the time the entries don't correspond to actual bad blocks.
- It's buggy because it copies BBL between devices leading to another instance of "says the blocks are bad but they actually aren't"
- Once you've worked out that the entries are bogus it is still very hard to remove the entries or the BBL as a whole
- It's overly quiet in what it does. I monitor my syslogs but many people don't. There are many documented instances of people carrying entries in a BBL for years without knowing.
- Once MD thinks there is a bad block it renders that block on that device useless for recovery purposes. So in a two device array you just lost redundancy.
So this article has very little to do with the concept of bad blocks; it's about how md's BBL feature can be dangerous and why I don't want it enabled until it's fixed.
I don't understand the mentality of dealing with bad blocks in the first place.
Storing data is easy. Storing data reliably is a whole other thing.