I would advice people to think about what they need. ZFS is fine, but it is over...

drewg123 · on July 20, 2020

The thing I never see mentioned when the pros and cons of ZFS are discussed is that ZFS is not zero-copy for things like sendfile. This is one reason why we (Netflix) use UFS for serving content rather than ZFS.

This is because ZFS is cached by the ARC, not the normal page cache. ARC is weird, and operates in 8K blocks (like sparc page size), rather than 4K pages. Zero-copy things like sendfile depend on referencing pages in the page cache, and have never been adapted to deal with ARC. So making sendfile zero-copy with ZFS is a hard project that would involve either teaching ARC to "loan" pages to sendfile, or ripping out ARC caching from ZFS and making it use the same page cache that all other filesystems use.

louwrentius · on July 20, 2020

Thanks for the comment. I think it's a bigger issue: people advocating ZFS only promote the good features and aren't open about the downsides (or even try to downplay them with clear bullshit).

It all depends on the circumstances and requirements: (small) business application or some home-build NAS?

In your case, how much does it matter that some node experiences bitrot, and how big are those risks?

drewg123 · on July 20, 2020

Our use case is far different than a home NAS (hundreds of TB of disk), I replied to that simply because it was also talking about potential downsides to ZFS.

For our use, bit rot is pretty low risk. We have tooling to catch corrupted files (and it happens surprisingly rarely). We don't care about any of the raid like features (if a drive dies, we tell clients to get their video elsewhere).

For our use case, ZFS would be attractive mainly because of the ability to keep metadata in the L2 ARC. One of our bigger sources of P99 latency is uncached metadata reads from mechanical drives. Our FS guys are currently solving that problem in other ways.

louwrentius · on July 20, 2020

Thank you for this explanation, very interesting. If that's something you can share, would be a nice blogpost.

e40 · on July 20, 2020

UFS on what OS?

giantrobot · on July 20, 2020

NetFlix is a heavy FreeBSD shop, so I'd assume FreeBSD.

throw0101a · on July 20, 2020

> NetFlix is a heavy FreeBSD shop, so I'd assume FreeBSD.

Kind of: their edge-cache appliances run FreeBSD. IIRC they run Linux on their Amazon cloud for all their 'internal' stuff.

If you do some searches there's some good presentations on their work on getting encrypted streaming to go Very Fast:

* https://www.phoronix.com/scan.php?page=news_item&px=Netflix-...

* https://2019.eurobsdcon.org/talk-speakers/#numa

* https://netflixtechblog.com/serving-100-gbps-from-an-open-co...

drewg123 · on July 20, 2020

UFS on FreeBSD. And that's me :)

anthk · on July 20, 2020

FreeBSD.

asveikau · on July 20, 2020

I use zfs in the home on one machine that runs nfsd and samba.

I started doing that because I saw corruptions on magnetic disks at home. Some files were silently corrupted. I had no redundancy. I didn't know which files were "good" either.

So now I have one multi disk machine running FreeBSD with zfs. Works well. Hardware isn't especially fancy. In the time since I have seen it catch hardware failures. I have seen it call out specific files as corrupt. This is a huge improvement over how I saw bad disks surface in the past.

simcop2387 · on July 20, 2020

I've now seen two sata controller failures thanks to zfs. One on my own machine amd one on my parents. Both presented as if a drive failed but it was actually the port on the controller that failed. It'd start up fine but randomly reads (and probably writes) would just be corrupted and the disk would eventually stop talking. Changing disks wouldn't fix it, which is how I kmew it was the controller but replacing the controller made everything happy. It the resilvered with no issues.

mapgrep · on July 20, 2020

Disagree that data integrity is “overkill for most home applications.”

louwrentius · on July 20, 2020

Tell that to all those Mac, Linux and Windows users on their daily desktops and laptops.

Almost none of them have even ECC memory.

theamk · on July 20, 2020

Memory has much fewer “moving parts” than disk. You cannot get bad “memory cable”, and (AFAIK) there are no cases when overloaded power supply caused memory errors,

gnufx · on July 20, 2020

Obviously it's not connected by cable, but you certainly get the equivalent to cable errors, and I've been plagued by them on certain systems. It's revealing when you have a lot of systems with monitoring of ECC errors. I wouldn't like to say whether multiple DIMMs are necessarily more reliable even than rotating disks.

louwrentius · on July 20, 2020

Memory in a regular desktop/laptop is basically the only part not protected by some kind of ECC algorithm.

If you care so much about data integrity, please start there.

theamk · on July 20, 2020

Silent data corruption is a real thing, I have a few thousands files for the evidence.

If you are building a home NAS from quality rackmounted server parts, then maybe you are fine. But this was not an option for me, as I did not have dedicated server space and needed something quiet. And once you have to start with to mess with desktop cases full of hard drives, it is very easy to get corrupted data.

I run ZFS on my home NAS. Yes, it (probably) eats too much RAM, and it (probably) not the fastest thing, but at least my data is intact. I had to piece together my photo collection from multiple backups, it was not fun at all.

louwrentius · on July 20, 2020

The question is always: what did exactly happen and how would other solutions have fared.

A plain statement like this doesn't prove anything.

theamk · on July 20, 2020

It provides another example that silent data corruption is a thing, and that it can happen. While SATA protocol has error detection, it is pretty weak (32 bit CRC) and it does not always help, especially since there is no way to tell how often the packets are retried.

I had a few cases of data damage. One of the worst ones was when I moved to a different place, and had to leave much of my stuff behind. I had half a dozen or so smaller drives in my PC (SATA + IDE) which were working just fine. I got about three new drives (I believe SATA 1TB?), installed them into PC, and copied all the files to the new drives. I then left the old PC, and only took the new drives with me.

This was Linux, ext3 and JBOD (no RAID or anything). I did not have a good filing system, so some data got copied multiple times.

Once I got into new place, I bought a new PC and installed the hard drives I had. I have noticed that some files are damaged. I had some checksumming scripts, and I was recording checksums, and found out that some checksums would not match - and each copy had a different set of damaged files. I ended up cherry-picking files from multiple copies to assemble a good set.

I don't know the exact reason, but I am fairly sure they were transfer errors. The original PC was working fine for a long time, so source data was likely clean. The new PC did not show any more data corruptions, and it was reading the same data every time. So my theory is either transfer errors while copying files, or silent data corruption on disk.

I don't know of any solutions that would have helped here except custom data checksum tools or ZFS (I suppose btrfs might have helped too, but I heard too many horror stories about it).

I actually had this come up the second time: when I moved again, and built another NAS box (desktop motherboard, 5x 4TB drives with ZFS), I started copying files off the old SATA drives (ext3) and saw the data transfer mismatch. It was pretty freaky: "rsync" the file, flush caches, md5sum source and destinations -- and they are different. kernel log was quiet, memtest was not showing any errors, so I got a beefier power supply and replaced all the SATA cables. This helped.

m0zg · on July 20, 2020

> ZFS is fine, but it is overkill for most home applications

Yeah, not having silent data corruption is "overkill", sure. /s Why not use ZFS? It takes 15 seconds to install, and its CLI is fairly intuitive. Works fine. Costs $0. Why not, even for "home" applications?

I could see how it could be unsuitable for "entreprise" applications where there are strict performance requirements etc, but for home, I wish I could use ZFS everywhere.

stavros · on July 20, 2020

For one, because of the fact that you can't add an extra disk to your pool if you need to. GP is saying "plan better", no need to get snarky.

magnetic · on July 20, 2020

What do you mean? I've done this not so long ago on my pool.

https://unix.stackexchange.com/questions/530968/adding-disks...

stavros · on July 20, 2020

You can't add disks to a vdev, though, which is what most people with a home NAS (including me) want.

trasz · on July 20, 2020

You can add them to RAID0/1 vdevs; you can't add them to RAIDZ vdevs. Which you don't have if you don't run ZFS. You might have RAID5, but then you also have a write hole.

pantaloony · on July 20, 2020

Strongly disagree that the CLI is intuitive. It’s very easy to kick off unintended actions or back oneself into a corner while doing things that seem reasonable. It’s like using Git in that it’s very hard to do well without a good understanding of the fundamentals, which feel almost nothing like managing disks normally. Lots of things require multiple steps that aren’t obviously related and you’d better not screw them up.

It’s way at the far end of the “must RTFM to use safely, and then probably brush up on it again before actually doing anything unless you use it daily” spectrum of intuitiveness.

I like my ZFS mass storage volumes for my home server. I worry I’ll screw them up and/or burn an hour googling and reading the manual every time I have to touch them, though.

briffle · on July 20, 2020

or install BTRFS and be able to detect bitrot as well, but also be able to grow your storage by adding more disks. (even different sizes)

Mister_Snuggles · on July 20, 2020

I've been burned by BTRFS too many times to trust it.

At work, our servers will get into a state where they just hang for a span from minutes to hours while BTRFS does "something". I'm not one of the admins though, so I don't know the exact details. I just know that this is a vendor-supported configuration and the vendor has been unable to tell us why this happens or offer any solution that makes it not happen. Our answer to this issue has been to rebuild servers with ext4 when things get bad. This has happened on multiple servers hosting different applications - the only commonality seems to be that write-heavy loads get it into this state. Servers that just have their OS on BTRFS but do all of their work on NFS volumes are fine.

At home, I once rebooted my OpenSuse Tumbleweed laptop and ended up with a BTRFS filesystem that couldn't be mounted RW. Fortunately I was able to mount it RO after booting off installation media and copy my data off, but I couldn't get the filesystem back into a state where it could be mounted RW. I ended up reinstalling. I never did figure out the root cause, but I suspect that some BTRFS-related process was running when I rebooted.

On the flip side, ZFS has never let me down in this way, but to be fair I've never subjected it to the same use-cases. Unfortunately, the inability to resize/reshape the filesystem is an issue for me. I believe that it's being worked on, but I don't think that work is production-ready yet.

tw04 · on July 20, 2020

RAID-Z expansion is being worked on: https://github.com/openzfs/zfs/pull/8853

Last I checked BTRFS RAID5/6 was a dumpster fire and unusable in production. Have they actually open sourced the ability to fix bitrot detection with mdraid? If not, it's kind of irrelevant.

So... once again down votes without response - BTRFS raid still isn't recommended and the file healing isn't compatible with MDRAID I assume and you just don't like the fact I pointed it out? The "I'm downvoting because you pointed out a flaw in my logic" @HN is disappointing.

accelbred · on July 20, 2020

If you're using RAID-Z on zfs, your comparison isn't fair. Rather than use RAID56 with btrfs, the equivalent would be to get 1 or 2 disk redundancy with raid1 or raid1c3.

tw04 · on July 21, 2020

RAID-Z is the equivalent of RAID-5. RAID-Z2 is the equivalent of RAID-6. RAID-Z3 would be the equivalent of RAID-7 (or whatever the standard is named for 3-disk parity).

This is strictly speaking to how it deals with data and parity, the implementations are obviously different.

RAID-1 would be a mirror in ZFS parlance.

accelbred · on July 23, 2020

BTRFS raid1 isn't mirroring drives though, it means there are two copies of each extent across the whole set of 2+ drives. and BTRFS and raid1c3 and raid1c4 are 3 and 4 copies.

trasz · on July 21, 2020

Not really - RAIDZ is basically RAID5 without the write hole problem. ZFS equivalent to RAID1 is called 'mirror', and is... well, a mirror.

accelbred · on July 23, 2020

Yes, and btrfs's raid1 isn't a mirror. Btrfs's raid1c3 and raid1c4 are its alternative to raid5/6 without the write hole.

trasz · on July 24, 2020

No, it's still a mirror, just spread over more than two devices. Size overhead is still that of a mirror, ie quite a bit higher than that of raidz.

Geezus_42 · on July 20, 2020

Just avoid RAID 5/6...

Geezus_42 · on July 20, 2020

Just avoid RAID 5/6

paulie_a · on July 20, 2020

If it's cli it's a nonstarter

Even if it has a gui it's probably a non starter unless there are literally 2 options