The thing I never see mentioned when the pros and cons of ZFS are discussed is that ZFS is not zero-copy for things like sendfile. This is one reason why we (Netflix) use UFS for serving content rather than ZFS.
This is because ZFS is cached by the ARC, not the normal page cache. ARC is weird, and operates in 8K blocks (like sparc page size), rather than 4K pages. Zero-copy things like sendfile depend on referencing pages in the page cache, and have never been adapted to deal with ARC. So making sendfile zero-copy with ZFS is a hard project that would involve either teaching ARC to "loan" pages to sendfile, or ripping out ARC caching from ZFS and making it use the same page cache that all other filesystems use.
Thanks for the comment. I think it's a bigger issue: people advocating ZFS only promote the good features and aren't open about the downsides (or even try to downplay them with clear bullshit).
It all depends on the circumstances and requirements: (small) business application or some home-build NAS?
In your case, how much does it matter that some node experiences bitrot, and how big are those risks?
Our use case is far different than a home NAS (hundreds of TB of disk), I replied to that simply because it was also talking about potential downsides to ZFS.
For our use, bit rot is pretty low risk. We have tooling to catch corrupted files (and it happens surprisingly rarely). We don't care about any of the raid like features (if a drive dies, we tell clients to get their video elsewhere).
For our use case, ZFS would be attractive mainly because of the ability to keep metadata in the L2 ARC. One of our bigger sources of P99 latency is uncached metadata reads from mechanical drives. Our FS guys are currently solving that problem in other ways.
I use zfs in the home on one machine that runs nfsd and samba.
I started doing that because I saw corruptions on magnetic disks at home. Some files were silently corrupted. I had no redundancy. I didn't know which files were "good" either.
So now I have one multi disk machine running FreeBSD with zfs. Works well. Hardware isn't especially fancy. In the time since I have seen it catch hardware failures. I have seen it call out specific files as corrupt. This is a huge improvement over how I saw bad disks surface in the past.
I've now seen two sata controller failures thanks to zfs. One on my own machine amd one on my parents. Both presented as if a drive failed but it was actually the port on the controller that failed. It'd start up fine but randomly reads (and probably writes) would just be corrupted and the disk would eventually stop talking. Changing disks wouldn't fix it, which is how I kmew it was the controller but replacing the controller made everything happy. It the resilvered with no issues.
Memory has much fewer “moving parts” than disk. You cannot get bad “memory cable”, and (AFAIK) there are no cases when overloaded power supply caused memory errors,
Obviously it's not connected by cable, but you certainly get the equivalent to cable errors, and I've been plagued by them on certain systems. It's revealing when you have a lot of systems with monitoring of ECC errors. I wouldn't like to say whether multiple DIMMs are necessarily more reliable even than rotating disks.
Silent data corruption is a real thing, I have a few thousands files for the evidence.
If you are building a home NAS from quality rackmounted server parts, then maybe you are fine. But this was not an option for me, as I did not have dedicated server space and needed something quiet. And once you have to start with to mess with desktop cases full of hard drives, it is very easy to get corrupted data.
I run ZFS on my home NAS. Yes, it (probably) eats too much RAM, and it (probably) not the fastest thing, but at least my data is intact. I had to piece together my photo collection from multiple backups, it was not fun at all.
It provides another example that silent data corruption is a thing, and that it can happen. While SATA protocol has error detection, it is pretty weak (32 bit CRC) and it does not always help, especially since there is no way to tell how often the packets are retried.
I had a few cases of data damage. One of the worst ones was when I moved to a different place, and had to leave much of my stuff behind. I had half a dozen or so smaller drives in my PC (SATA + IDE) which were working just fine. I got about three new drives (I believe SATA 1TB?), installed them into PC, and copied all the files to the new drives. I then left the old PC, and only took the new drives with me.
This was Linux, ext3 and JBOD (no RAID or anything). I did not have a good filing system, so some data got copied multiple times.
Once I got into new place, I bought a new PC and installed the hard drives I had. I have noticed that some files are damaged. I had some checksumming scripts, and I was recording checksums, and found out that some checksums would not match - and each copy had a different set of damaged files. I ended up cherry-picking files from multiple copies to assemble a good set.
I don't know the exact reason, but I am fairly sure they were transfer errors. The original PC was working fine for a long time, so source data was likely clean. The new PC did not show any more data corruptions, and it was reading the same data every time. So my theory is either transfer errors while copying files, or silent data corruption on disk.
I don't know of any solutions that would have helped here except custom data checksum tools or ZFS (I suppose btrfs might have helped too, but I heard too many horror stories about it).
I actually had this come up the second time: when I moved again, and built another NAS box (desktop motherboard, 5x 4TB drives with ZFS), I started copying files off the old SATA drives (ext3) and saw the data transfer mismatch. It was pretty freaky: "rsync" the file, flush caches, md5sum source and destinations -- and they are different. kernel log was quiet, memtest was not showing any errors, so I got a beefier power supply and replaced all the SATA cables. This helped.
> ZFS is fine, but it is overkill for most home applications
Yeah, not having silent data corruption is "overkill", sure. /s Why not use ZFS? It takes 15 seconds to install, and its CLI is fairly intuitive. Works fine. Costs $0. Why not, even for "home" applications?
I could see how it could be unsuitable for "entreprise" applications where there are strict performance requirements etc, but for home, I wish I could use ZFS everywhere.
You can add them to RAID0/1 vdevs; you can't add them to RAIDZ vdevs. Which you don't have if you don't run ZFS. You might have RAID5, but then you also have a write hole.
Strongly disagree that the CLI is intuitive. It’s very easy to kick off unintended actions or back oneself into a corner while doing things that seem reasonable. It’s like using Git in that it’s very hard to do well without a good understanding of the fundamentals, which feel almost nothing like managing disks normally. Lots of things require multiple steps that aren’t obviously related and you’d better not screw them up.
It’s way at the far end of the “must RTFM to use safely, and then probably brush up on it again before actually doing anything unless you use it daily” spectrum of intuitiveness.
I like my ZFS mass storage volumes for my home server. I worry I’ll screw them up and/or burn an hour googling and reading the manual every time I have to touch them, though.
I've been burned by BTRFS too many times to trust it.
At work, our servers will get into a state where they just hang for a span from minutes to hours while BTRFS does "something". I'm not one of the admins though, so I don't know the exact details. I just know that this is a vendor-supported configuration and the vendor has been unable to tell us why this happens or offer any solution that makes it not happen. Our answer to this issue has been to rebuild servers with ext4 when things get bad. This has happened on multiple servers hosting different applications - the only commonality seems to be that write-heavy loads get it into this state. Servers that just have their OS on BTRFS but do all of their work on NFS volumes are fine.
At home, I once rebooted my OpenSuse Tumbleweed laptop and ended up with a BTRFS filesystem that couldn't be mounted RW. Fortunately I was able to mount it RO after booting off installation media and copy my data off, but I couldn't get the filesystem back into a state where it could be mounted RW. I ended up reinstalling. I never did figure out the root cause, but I suspect that some BTRFS-related process was running when I rebooted.
On the flip side, ZFS has never let me down in this way, but to be fair I've never subjected it to the same use-cases. Unfortunately, the inability to resize/reshape the filesystem is an issue for me. I believe that it's being worked on, but I don't think that work is production-ready yet.
Last I checked BTRFS RAID5/6 was a dumpster fire and unusable in production. Have they actually open sourced the ability to fix bitrot detection with mdraid? If not, it's kind of irrelevant.
So... once again down votes without response - BTRFS raid still isn't recommended and the file healing isn't compatible with MDRAID I assume and you just don't like the fact I pointed it out? The "I'm downvoting because you pointed out a flaw in my logic" @HN is disappointing.
If you're using RAID-Z on zfs, your comparison isn't fair. Rather than use RAID56 with btrfs, the equivalent would be to get 1 or 2 disk redundancy with raid1 or raid1c3.
RAID-Z is the equivalent of RAID-5. RAID-Z2 is the equivalent of RAID-6. RAID-Z3 would be the equivalent of RAID-7 (or whatever the standard is named for 3-disk parity).
This is strictly speaking to how it deals with data and parity, the implementations are obviously different.
BTRFS raid1 isn't mirroring drives though, it means there are two copies of each extent across the whole set of 2+ drives. and BTRFS and raid1c3 and raid1c4 are 3 and 4 copies.
ZFS is fine, but it is overkill for most home applications and it has a pitfall related to extensibility.
https://louwrentius.com/what-home-nas-builders-should-unders...
https://louwrentius.com/the-hidden-cost-of-using-zfs-for-you...
So it really depends on your needs. Statements as "ZFS is the best filesystem" are so meaningless.
P.S. SSD caching often has no tangible practical benefit for most applications. More RAM is often the better investment. As is with any filesystem.