When writing a post titled "Preserving data integrity", I find it weird not to actually demonstrate that the data integrity is preserved.
I assume md-raid is proven so removing a disk isn't of primary concern, but at the very least power down the system, and overwrite a few random spots in the data partition on a couple of the disks. Power the system back up and see how the pool reacts when reading everything. Verify that the data is correct.
I mean, write and read performance means little without actual integrity, or if you need to perform some arcane incantations to get the system running smoothly again.
This not provide the same integrity as ZFS. Dm-integrity only protects against corruption on the disk itself, while ZFS attempts to provide protection against all sources of corruption, including software/firmware bugs, RAM and I/O path.
ZFS is using a Merkle tree and so knows which checksum to expect before reading the data. With raidz/mirror, when it detect corruption, it will attempt to read from the other drive(s) and combines those disks that give the correct checksum. If it can't find a combination that works, just what depends on that block is not available. If it can recover the block it will attempt to repair the disk(s) with corruption by writing the correct data to it again.
Standard mdraid can not repair an array in case of corruption since it doesn't know which of the data it has is correct or not, you need to figure out which of the disks has corrupt data on it, and manually remove the drive from the array and hope none of the other disks has corruption. In theory in case of a read error mdraid could attempt to fix it, but it doesn't and just removes the drive from the array. Dm-integrity provides more guarantees that the data you get is correct, but since mdraid and dm-integrity are different layers, they don't actually work together. There is no attempt to repair in case of corruption, dm-integrity will just return an read error and mdraid will remove the drive from the array.
Whether uncorrectable read error (bad sector) reported by drive firmware, or dm-integrity detecting corruption, the affected LBA's are propagated up to md. And then md can determine the location of a copy (mirror or reconstruct from parity). It then overwrites the bad location.
This mechanism is often thwarted with consumer drives, when their SCT ERC timeout can't be set or is longer than the kernel's SCSI command timer default of 30 seconds. Once a command hasn't returned a result of some kind within 30s, the SCSI driver does a link reset. On SATA this has a pernicious effect of clearing the entire command queue, not just the one that was hung up in "deep recovery". LBA's aren't returned so it's indeterminate where or what the problem was caused by, no fix up happens. This results in bad sector accumulation. This misconfiguration is common, and routinely costs people their data.
This may be easier and more reliable to do with a udev rule; but the concept is the same. Also, while this is linux-raid@ wiki, it doesn't only apply to mdadm raid, but LVM, and Btrfs as well. I don't know if it applies to ZoL because I don't know about all the layers ZFS implements itself separate from Linux. But if it depends at all on the SCSI driver for error handling, it would be at risk of this misconfiguration as well.
I understand the licensing issues, but having a system with 4 components (ext4, dm-crypt, mdraid, dm-integrity) instead of a single integrated one (ZFS) can hardly be said to be simpler. On a distribution like Ubuntu, adding and maintaining ZFS is completely painless.
Well, ZFS isn't exactly monolithic if you look under the hood: it has the ZPL (files, directories), DMU (objects, transactions on those objects), SPA (actual disk I/O).
A potato-quality video from 2008 with Moore and Bonwick, the creators (timestamped to relevant section):
> > On a distribution like Ubuntu, adding and maintaining ZFS is completely painless.
> There are packages for most distros (they generally leverage DKMS):
As a Debian user using ZFS, I can assure you that it's absolutely not painless. Definitely worth the pain, however. (Same story with Wireguard + Debian currently too)
Ubuntu actually ships ZFS (and Wireguard) in their main line. Debian (and others that depend on DKMS) do not.
> [...] whereas ZFS, the most prominent candidate for this type of features, is not without hassle and it must be recompiled for every kernel update (although automation exists).
I have the feeling that this setup is even more of a hassle than ZFS.
>>ZFS Compression, Incompressible Data and Performance
>>You may be tempted to set “compression=off” on datasets which primarily have incompressible data on them, such as folders full of video or audio files. We generally recommend against this – for one thing, ZFS is smart enough not to keep trying to compress incompressible data, and to never store data compressed, if doing so wouldn’t save any on-disk blocks.
> As ZFS did not adhere to conv=fdatasync5, the main memory was restricted to 1GB [...]
I guess this might explain the latency tail for ZFS. As far as I understand, ZFS relies heavily on its caches for performance, both by design and implementation choices.
I'm also a bit puzzled by the read performance.
Would be nice to see a re-run of the benchmarks with 2 or 4 GB of memory, to see the effect of the cache.
As I understand it, dm-integrity is supposed to graft zfs-like checksumming onto any filesystem. How good is its ability to detect corruption and fix them in a RAID setup?
AIUI, dm-integrity doesn't handle repair at all, it "just" causes checksum errors to bubble up as IO errors on the affected blocks, so whatever you layer on top of it (e.g. Linux MD) can handle it like the disk actually errored out.
> One thing to note is that checks and resyncs of the proposed setup are significantly prolonged (by a factor of 3 to 4) compared to an mdraid without the integrity-layer underneath. The investigation so far has so far not revealed the underlying cause. It is not CPU-bound, indicating that the read-performance is not held back by checksumming, the latency figures above do not imply an increase in latency significant enough to cause a factor 3-4 prolonged check-time, disabling the journal did not change this either (as one would expect, as the journal is unrelated to read, which should be the only relevant mode for a raid-check).
> ZFS was roughly on par with mdraid without the integrity-layer underneath with regard to raid-check time.
> If anyone has an idea of the root cause of this behavior, feel encouraged to contact me, I’d be intrigued to know.
note that by default ZFS is tuned to prioritize online transactions at the expense of scrub/resilver speed, as this is a sensible choice for its intended use-case of business storage appliances. Businesses can't stop the world for a resilver.
For your average home NAS, you can tune ZFS to scrub/resilver faster, not really a big deal as you are not really hammering it anyway.
It's funny that when using cloud storage (like S3) the probability of me losing my data due to a billing issue or user error is way higher than any underlying technical risk. It's like being afraid of dying in a plane crash but seeing no issue riding a motorcycle to the airport.
I've had a lot of good luck with ZFS over the past decade and a half. I just started building my third file server last week, actually. Going to try it with NVMe drives this time!
how is the on-disk data repaired once an error is found?
if dm-integrity returns an error to mdraid, will mdraid rewrite the error'd blocks once it determines the correct value from a mirrored copy, or via parity?
"In kernels prior to about 2.6.15, a read error would cause the same effect as a write error. In later kernels, a read-error will instead cause md to attempt a recovery by overwriting the bad block. i.e. it will find the correct data from elsewhere, write it over the block that failed, and then try to read it back again. If either the write or the re-read fail, md will treat the error the same way that a write error is treated, and will fail the whole device."
I assume md-raid is proven so removing a disk isn't of primary concern, but at the very least power down the system, and overwrite a few random spots in the data partition on a couple of the disks. Power the system back up and see how the pool reacts when reading everything. Verify that the data is correct.
I mean, write and read performance means little without actual integrity, or if you need to perform some arcane incantations to get the system running smoothly again.