Preserving data integrity: A ZFS-inspired storage system

magicalhippo · on Sept 4, 2020

When writing a post titled "Preserving data integrity", I find it weird not to actually demonstrate that the data integrity is preserved.

I assume md-raid is proven so removing a disk isn't of primary concern, but at the very least power down the system, and overwrite a few random spots in the data partition on a couple of the disks. Power the system back up and see how the pool reacts when reading everything. Verify that the data is correct.

I mean, write and read performance means little without actual integrity, or if you need to perform some arcane incantations to get the system running smoothly again.

kroeckx · on Sept 4, 2020

This not provide the same integrity as ZFS. Dm-integrity only protects against corruption on the disk itself, while ZFS attempts to provide protection against all sources of corruption, including software/firmware bugs, RAM and I/O path.

ZFS is using a Merkle tree and so knows which checksum to expect before reading the data. With raidz/mirror, when it detect corruption, it will attempt to read from the other drive(s) and combines those disks that give the correct checksum. If it can't find a combination that works, just what depends on that block is not available. If it can recover the block it will attempt to repair the disk(s) with corruption by writing the correct data to it again.

Standard mdraid can not repair an array in case of corruption since it doesn't know which of the data it has is correct or not, you need to figure out which of the disks has corrupt data on it, and manually remove the drive from the array and hope none of the other disks has corruption. In theory in case of a read error mdraid could attempt to fix it, but it doesn't and just removes the drive from the array. Dm-integrity provides more guarantees that the data you get is correct, but since mdraid and dm-integrity are different layers, they don't actually work together. There is no attempt to repair in case of corruption, dm-integrity will just return an read error and mdraid will remove the drive from the array.

cmurf · on Sept 4, 2020

Whether uncorrectable read error (bad sector) reported by drive firmware, or dm-integrity detecting corruption, the affected LBA's are propagated up to md. And then md can determine the location of a copy (mirror or reconstruct from parity). It then overwrites the bad location.

This mechanism is often thwarted with consumer drives, when their SCT ERC timeout can't be set or is longer than the kernel's SCSI command timer default of 30 seconds. Once a command hasn't returned a result of some kind within 30s, the SCSI driver does a link reset. On SATA this has a pernicious effect of clearing the entire command queue, not just the one that was hung up in "deep recovery". LBA's aren't returned so it's indeterminate where or what the problem was caused by, no fix up happens. This results in bad sector accumulation. This misconfiguration is common, and routinely costs people their data.

https://raid.wiki.kernel.org/index.php/Timeout_Mismatch

This may be easier and more reliable to do with a udev rule; but the concept is the same. Also, while this is linux-raid@ wiki, it doesn't only apply to mdadm raid, but LVM, and Btrfs as well. I don't know if it applies to ZoL because I don't know about all the layers ZFS implements itself separate from Linux. But if it depends at all on the SCSI driver for error handling, it would be at risk of this misconfiguration as well.

kroeckx · on Sept 4, 2020

It seems that a HGST/WD Ultastar, which is their data center drive line, has ERC disabled by default. I've now set it to the suggested 7 seconds.

fmajid · on Sept 4, 2020

I understand the licensing issues, but having a system with 4 components (ext4, dm-crypt, mdraid, dm-integrity) instead of a single integrated one (ZFS) can hardly be said to be simpler. On a distribution like Ubuntu, adding and maintaining ZFS is completely painless.

throw0101a · on Sept 4, 2020

> I understand the licensing issues, but having a system with 4 components […]

I am reminded of ZFS co-creator Jeff Bonwick's "Rampant Layering Violation?" (then-Sun) weblog post:

* https://web.archive.org/web/20070508214221/http://blogs.sun....

* https://blogs.oracle.com/bonwick/rampant-layering-violation

nix23 · on Sept 4, 2020

Sometimes A monolith is "better" than the "Unix-way", think of kernels, network-stack and Filesystems (Btrfs,ZFS) and databases.

throw0101a · on Sept 4, 2020

Well, ZFS isn't exactly monolithic if you look under the hood: it has the ZPL (files, directories), DMU (objects, transactions on those objects), SPA (actual disk I/O).

A potato-quality video from 2008 with Moore and Bonwick, the creators (timestamped to relevant section):

* https://www.youtube.com/watch?v=NRoUC9P1PmA&t=14m19s

magicalhippo · on Sept 4, 2020

Possibly less potato-quality video about the same:

https://youtu.be/MsY-BafQgj4?t=442 (OpenZFS Basics by Matt Ahrens and George Wilson)

nix23 · on Sept 4, 2020

>Well, ZFS isn't exactly monolithic if you look under the hood: it has the ZPL

True, but i meant one monolithic Storage-System.

Kernel's are also not monolithic if you look under the hood.

throw0101a · on Sept 4, 2020

> On a distribution like Ubuntu, adding and maintaining ZFS is completely painless.

There are packages for most distros (they generally leverage DKMS):

* https://zfsonlinux.org

oarsinsync · on Sept 4, 2020

> > On a distribution like Ubuntu, adding and maintaining ZFS is completely painless.

> There are packages for most distros (they generally leverage DKMS):

As a Debian user using ZFS, I can assure you that it's absolutely not painless. Definitely worth the pain, however. (Same story with Wireguard + Debian currently too)

Ubuntu actually ships ZFS (and Wireguard) in their main line. Debian (and others that depend on DKMS) do not.

pnutjam · on Sept 4, 2020

You mean BTRFS... ;)

lorenzfx · on Sept 4, 2020

> [...] whereas ZFS, the most prominent candidate for this type of features, is not without hassle and it must be recompiled for every kernel update (although automation exists).

I have the feeling that this setup is even more of a hassle than ZFS.

nix23 · on Sept 4, 2020

>I have the feeling that this setup is even more of a hassle than ZFS.

Hehe so true!

ZFS: >dd if=/dev/zero of=/path/to/testfile bs=1M count=18000 conv=fdatasync

That's stupid, writing zeros to ZFS just ~tests the compression speed, your file will have the size of a bit more than zero.

PedroBatista · on Sept 4, 2020

Maybe not stupidity just an oversight, words do have weight.

I'm sure it was oversight from your part.

nix23 · on Sept 4, 2020

That's a Filesystem benchmark:

http://www.iozone.org/

and

https://linux.die.net/man/8/bonnie++

NOT dd NOT /dev/zero and NOT /dev/urandom

No it's stupid, benchmarks are already most of the time stupid.

nix23 · on Sept 6, 2020

BTW: /dev/zero gets faster even without the filesystem ;)

https://www.phoronix.com/scan.php?page=news_item&px=Faster-R...

hikarudo · on Sept 4, 2020

Unless compression is disabled.

Filligree · on Sept 4, 2020

Which, for a variety of reasons, you should never do on a modern system. Use at least compression=zle.

nix23 · on Sept 4, 2020

For movies/pictures and sound (well everything that is already compressed) you can deactivate it, zle too.

EDIT: On the other hand leave it on, zfs tests the file if it's compress-able..if not it makes nothing:

https://klarasystems.com/articles/openzfs1-understanding-tra...

>>ZFS Compression, Incompressible Data and Performance

>>You may be tempted to set “compression=off” on datasets which primarily have incompressible data on them, such as folders full of video or audio files. We generally recommend against this – for one thing, ZFS is smart enough not to keep trying to compress incompressible data, and to never store data compressed, if doing so wouldn’t save any on-disk blocks.

nix23 · on Sept 4, 2020

Yes...and dedup ;)

magicalhippo · on Sept 4, 2020

> As ZFS did not adhere to conv=fdatasync5, the main memory was restricted to 1GB [...]

I guess this might explain the latency tail for ZFS. As far as I understand, ZFS relies heavily on its caches for performance, both by design and implementation choices.

I'm also a bit puzzled by the read performance.

Would be nice to see a re-run of the benchmarks with 2 or 4 GB of memory, to see the effect of the cache.

fomine3 · on Sept 4, 2020

It makes the benchmark unrealistic. I suspect that all ZFS setup uses at least 2GB RAM.

curt15 · on Sept 4, 2020

As I understand it, dm-integrity is supposed to graft zfs-like checksumming onto any filesystem. How good is its ability to detect corruption and fix them in a RAID setup?

magicalhippo · on Sept 4, 2020

It's not quite the same, dm-integrity[1] only does per-sector checksums, while the checksums in ZFS form a Merkle tree[2].

[1]: https://www.kernel.org/doc/html/latest/admin-guide/device-ma...

[2]: https://en.wikipedia.org/wiki/ZFS#Data_integrity

rincebrain · on Sept 4, 2020

AIUI, dm-integrity doesn't handle repair at all, it "just" causes checksum errors to bubble up as IO errors on the affected blocks, so whatever you layer on top of it (e.g. Linux MD) can handle it like the disk actually errored out.

paulmd · on Sept 4, 2020

> One thing to note is that checks and resyncs of the proposed setup are significantly prolonged (by a factor of 3 to 4) compared to an mdraid without the integrity-layer underneath. The investigation so far has so far not revealed the underlying cause. It is not CPU-bound, indicating that the read-performance is not held back by checksumming, the latency figures above do not imply an increase in latency significant enough to cause a factor 3-4 prolonged check-time, disabling the journal did not change this either (as one would expect, as the journal is unrelated to read, which should be the only relevant mode for a raid-check).

> ZFS was roughly on par with mdraid without the integrity-layer underneath with regard to raid-check time.

> If anyone has an idea of the root cause of this behavior, feel encouraged to contact me, I’d be intrigued to know.

note that by default ZFS is tuned to prioritize online transactions at the expense of scrub/resilver speed, as this is a sensible choice for its intended use-case of business storage appliances. Businesses can't stop the world for a resilver.

For your average home NAS, you can tune ZFS to scrub/resilver faster, not really a big deal as you are not really hammering it anyway.

https://www.reddit.com/r/zfs/comments/6t799g/really_slow_scr...

https://www.ixsystems.com/community/threads/scrub-performanc...

fomine3 · on Sept 4, 2020

Don't use dd for benchmark, please use fio.

deegles · on Sept 4, 2020

It's funny that when using cloud storage (like S3) the probability of me losing my data due to a billing issue or user error is way higher than any underlying technical risk. It's like being afraid of dying in a plane crash but seeing no issue riding a motorcycle to the airport.

piercebot · on Sept 4, 2020

I've had a lot of good luck with ZFS over the past decade and a half. I just started building my third file server last week, actually. Going to try it with NVMe drives this time!

https://ajpierce.com/2020-09-02_file-server-pt1/

sigstoat · on Sept 4, 2020

how is the on-disk data repaired once an error is found?

if dm-integrity returns an error to mdraid, will mdraid rewrite the error'd blocks once it determines the correct value from a mirrored copy, or via parity?

magicalhippo · on Sept 4, 2020

From the man page[1]:

"In kernels prior to about 2.6.15, a read error would cause the same effect as a write error. In later kernels, a read-error will instead cause md to attempt a recovery by overwriting the bad block. i.e. it will find the correct data from elsewhere, write it over the block that failed, and then try to read it back again. If either the write or the re-read fail, md will treat the error the same way that a write error is treated, and will fail the whole device."

[1]: https://linux.die.net/man/4/md