Insightful article. Zfs is the best file system. Ability to know that you do not...

arianvanp · on July 20, 2020

I don't understand where minio suddenly comes from?

Minio is barely documented. I had to ask in Slack to interpret what it meant when minio said "your cluster is 5 red and 7 yellow" as the colours aren't event documented.

Every minio cluster I hosted had dataloss. Each one has lead to a reported issue on their GitHub that hasn't been closed to date. Nothing about recovery is documented. Documentation is slim in general. Really confused why you're naming it next to ZFS which is sublimely documented and withstood the test of time.

I'd advise something battletested through time like Ceph for object storage instead.

devonkim · on July 20, 2020

I'm super confused how anyone could recommend Minio for the foundation of a production system as well. I've understood it to be primarily for development purposes and have used it for some local mock tests for S3 compatibility behind a firewall or to simulate error test cases more reliably in my applications.

Even Ceph has its warts for distributed object storage as does basically... anything in existence worth considering that's OSS-ish (GlusterFS, HDFS, Lustre) but comparing Minio to ZFS is confusing given how drastically different in purpose, functionality, and engineering hardening has happened between the two projects. With that said, the AWS S3 team really is impressive in what they've built out and deserves more shout-outs from people outside Amazon.

wfn · on July 20, 2020

Could you expand more re: your minio experience? We're about to enter production with minio as our S3 interface-providing file storage system, and have found documentation to be sufficient.

However, I'm slightly concerned that we haven't tested it enough, and that doc and support may prove to be lacking (like in your case) when we hit edge cases and failure scenarios.

ddorian43 · on July 20, 2020

I'm checking it out too.

No LTS version. Bugs will be fixed in a new version, which may include new bugs. Some upgrades need full cluster restart. If you ask too many question you may get "it's better start a subscription".

seqizz · on July 20, 2020

We've been running for about a year now and had zero problems with it. (~20TB total, few single instances, serving as S3 backend to Restic - handmade HA - backup service)

ddorian43 · on July 20, 2020

What do you mean with "handmade HA" ?

infogulch · on July 20, 2020

Perhaps a high-availability system they made themselves?

ddorian43 · on July 20, 2020

That's the whole point of using it in the first place. And the solomon reed encoding/replication.

thr0w3345 · on July 20, 2020

I highly recommend you look at leofs instead — it’s been absolutely rock solid for us.

craigyk · on July 20, 2020

It's a bit on a non-sequitur, but I am also looking at using MinIO as a S3 interface to a ZFS filesystem. Would be interested to hear from others about this use case for MinIO and possible alternatives.

louwrentius · on July 20, 2020

I would advice people to think about what they need.

ZFS is fine, but it is overkill for most home applications and it has a pitfall related to extensibility.

https://louwrentius.com/what-home-nas-builders-should-unders...

https://louwrentius.com/the-hidden-cost-of-using-zfs-for-you...

So it really depends on your needs. Statements as "ZFS is the best filesystem" are so meaningless.

P.S. SSD caching often has no tangible practical benefit for most applications. More RAM is often the better investment. As is with any filesystem.

drewg123 · on July 20, 2020

The thing I never see mentioned when the pros and cons of ZFS are discussed is that ZFS is not zero-copy for things like sendfile. This is one reason why we (Netflix) use UFS for serving content rather than ZFS.

This is because ZFS is cached by the ARC, not the normal page cache. ARC is weird, and operates in 8K blocks (like sparc page size), rather than 4K pages. Zero-copy things like sendfile depend on referencing pages in the page cache, and have never been adapted to deal with ARC. So making sendfile zero-copy with ZFS is a hard project that would involve either teaching ARC to "loan" pages to sendfile, or ripping out ARC caching from ZFS and making it use the same page cache that all other filesystems use.

louwrentius · on July 20, 2020

Thanks for the comment. I think it's a bigger issue: people advocating ZFS only promote the good features and aren't open about the downsides (or even try to downplay them with clear bullshit).

It all depends on the circumstances and requirements: (small) business application or some home-build NAS?

In your case, how much does it matter that some node experiences bitrot, and how big are those risks?

drewg123 · on July 20, 2020

Our use case is far different than a home NAS (hundreds of TB of disk), I replied to that simply because it was also talking about potential downsides to ZFS.

For our use, bit rot is pretty low risk. We have tooling to catch corrupted files (and it happens surprisingly rarely). We don't care about any of the raid like features (if a drive dies, we tell clients to get their video elsewhere).

For our use case, ZFS would be attractive mainly because of the ability to keep metadata in the L2 ARC. One of our bigger sources of P99 latency is uncached metadata reads from mechanical drives. Our FS guys are currently solving that problem in other ways.

louwrentius · on July 20, 2020

Thank you for this explanation, very interesting. If that's something you can share, would be a nice blogpost.

e40 · on July 20, 2020

UFS on what OS?

giantrobot · on July 20, 2020

NetFlix is a heavy FreeBSD shop, so I'd assume FreeBSD.

throw0101a · on July 20, 2020

> NetFlix is a heavy FreeBSD shop, so I'd assume FreeBSD.

Kind of: their edge-cache appliances run FreeBSD. IIRC they run Linux on their Amazon cloud for all their 'internal' stuff.

If you do some searches there's some good presentations on their work on getting encrypted streaming to go Very Fast:

* https://www.phoronix.com/scan.php?page=news_item&px=Netflix-...

* https://2019.eurobsdcon.org/talk-speakers/#numa

* https://netflixtechblog.com/serving-100-gbps-from-an-open-co...

drewg123 · on July 20, 2020

UFS on FreeBSD. And that's me :)

anthk · on July 20, 2020

FreeBSD.

asveikau · on July 20, 2020

I use zfs in the home on one machine that runs nfsd and samba.

I started doing that because I saw corruptions on magnetic disks at home. Some files were silently corrupted. I had no redundancy. I didn't know which files were "good" either.

So now I have one multi disk machine running FreeBSD with zfs. Works well. Hardware isn't especially fancy. In the time since I have seen it catch hardware failures. I have seen it call out specific files as corrupt. This is a huge improvement over how I saw bad disks surface in the past.

simcop2387 · on July 20, 2020

I've now seen two sata controller failures thanks to zfs. One on my own machine amd one on my parents. Both presented as if a drive failed but it was actually the port on the controller that failed. It'd start up fine but randomly reads (and probably writes) would just be corrupted and the disk would eventually stop talking. Changing disks wouldn't fix it, which is how I kmew it was the controller but replacing the controller made everything happy. It the resilvered with no issues.

mapgrep · on July 20, 2020

Disagree that data integrity is “overkill for most home applications.”

louwrentius · on July 20, 2020

Tell that to all those Mac, Linux and Windows users on their daily desktops and laptops.

Almost none of them have even ECC memory.

theamk · on July 20, 2020

Memory has much fewer “moving parts” than disk. You cannot get bad “memory cable”, and (AFAIK) there are no cases when overloaded power supply caused memory errors,

gnufx · on July 20, 2020

Obviously it's not connected by cable, but you certainly get the equivalent to cable errors, and I've been plagued by them on certain systems. It's revealing when you have a lot of systems with monitoring of ECC errors. I wouldn't like to say whether multiple DIMMs are necessarily more reliable even than rotating disks.

louwrentius · on July 20, 2020

Memory in a regular desktop/laptop is basically the only part not protected by some kind of ECC algorithm.

If you care so much about data integrity, please start there.

theamk · on July 20, 2020

Silent data corruption is a real thing, I have a few thousands files for the evidence.

If you are building a home NAS from quality rackmounted server parts, then maybe you are fine. But this was not an option for me, as I did not have dedicated server space and needed something quiet. And once you have to start with to mess with desktop cases full of hard drives, it is very easy to get corrupted data.

I run ZFS on my home NAS. Yes, it (probably) eats too much RAM, and it (probably) not the fastest thing, but at least my data is intact. I had to piece together my photo collection from multiple backups, it was not fun at all.

louwrentius · on July 20, 2020

The question is always: what did exactly happen and how would other solutions have fared.

A plain statement like this doesn't prove anything.

theamk · on July 20, 2020

It provides another example that silent data corruption is a thing, and that it can happen. While SATA protocol has error detection, it is pretty weak (32 bit CRC) and it does not always help, especially since there is no way to tell how often the packets are retried.

I had a few cases of data damage. One of the worst ones was when I moved to a different place, and had to leave much of my stuff behind. I had half a dozen or so smaller drives in my PC (SATA + IDE) which were working just fine. I got about three new drives (I believe SATA 1TB?), installed them into PC, and copied all the files to the new drives. I then left the old PC, and only took the new drives with me.

This was Linux, ext3 and JBOD (no RAID or anything). I did not have a good filing system, so some data got copied multiple times.

Once I got into new place, I bought a new PC and installed the hard drives I had. I have noticed that some files are damaged. I had some checksumming scripts, and I was recording checksums, and found out that some checksums would not match - and each copy had a different set of damaged files. I ended up cherry-picking files from multiple copies to assemble a good set.

I don't know the exact reason, but I am fairly sure they were transfer errors. The original PC was working fine for a long time, so source data was likely clean. The new PC did not show any more data corruptions, and it was reading the same data every time. So my theory is either transfer errors while copying files, or silent data corruption on disk.

I don't know of any solutions that would have helped here except custom data checksum tools or ZFS (I suppose btrfs might have helped too, but I heard too many horror stories about it).

I actually had this come up the second time: when I moved again, and built another NAS box (desktop motherboard, 5x 4TB drives with ZFS), I started copying files off the old SATA drives (ext3) and saw the data transfer mismatch. It was pretty freaky: "rsync" the file, flush caches, md5sum source and destinations -- and they are different. kernel log was quiet, memtest was not showing any errors, so I got a beefier power supply and replaced all the SATA cables. This helped.

m0zg · on July 20, 2020

> ZFS is fine, but it is overkill for most home applications

Yeah, not having silent data corruption is "overkill", sure. /s Why not use ZFS? It takes 15 seconds to install, and its CLI is fairly intuitive. Works fine. Costs $0. Why not, even for "home" applications?

I could see how it could be unsuitable for "entreprise" applications where there are strict performance requirements etc, but for home, I wish I could use ZFS everywhere.

stavros · on July 20, 2020

For one, because of the fact that you can't add an extra disk to your pool if you need to. GP is saying "plan better", no need to get snarky.

magnetic · on July 20, 2020

What do you mean? I've done this not so long ago on my pool.

https://unix.stackexchange.com/questions/530968/adding-disks...

stavros · on July 20, 2020

You can't add disks to a vdev, though, which is what most people with a home NAS (including me) want.

trasz · on July 20, 2020

You can add them to RAID0/1 vdevs; you can't add them to RAIDZ vdevs. Which you don't have if you don't run ZFS. You might have RAID5, but then you also have a write hole.

pantaloony · on July 20, 2020

Strongly disagree that the CLI is intuitive. It’s very easy to kick off unintended actions or back oneself into a corner while doing things that seem reasonable. It’s like using Git in that it’s very hard to do well without a good understanding of the fundamentals, which feel almost nothing like managing disks normally. Lots of things require multiple steps that aren’t obviously related and you’d better not screw them up.

It’s way at the far end of the “must RTFM to use safely, and then probably brush up on it again before actually doing anything unless you use it daily” spectrum of intuitiveness.

I like my ZFS mass storage volumes for my home server. I worry I’ll screw them up and/or burn an hour googling and reading the manual every time I have to touch them, though.

briffle · on July 20, 2020

or install BTRFS and be able to detect bitrot as well, but also be able to grow your storage by adding more disks. (even different sizes)

Mister_Snuggles · on July 20, 2020

I've been burned by BTRFS too many times to trust it.

At work, our servers will get into a state where they just hang for a span from minutes to hours while BTRFS does "something". I'm not one of the admins though, so I don't know the exact details. I just know that this is a vendor-supported configuration and the vendor has been unable to tell us why this happens or offer any solution that makes it not happen. Our answer to this issue has been to rebuild servers with ext4 when things get bad. This has happened on multiple servers hosting different applications - the only commonality seems to be that write-heavy loads get it into this state. Servers that just have their OS on BTRFS but do all of their work on NFS volumes are fine.

At home, I once rebooted my OpenSuse Tumbleweed laptop and ended up with a BTRFS filesystem that couldn't be mounted RW. Fortunately I was able to mount it RO after booting off installation media and copy my data off, but I couldn't get the filesystem back into a state where it could be mounted RW. I ended up reinstalling. I never did figure out the root cause, but I suspect that some BTRFS-related process was running when I rebooted.

On the flip side, ZFS has never let me down in this way, but to be fair I've never subjected it to the same use-cases. Unfortunately, the inability to resize/reshape the filesystem is an issue for me. I believe that it's being worked on, but I don't think that work is production-ready yet.

tw04 · on July 20, 2020

RAID-Z expansion is being worked on: https://github.com/openzfs/zfs/pull/8853

Last I checked BTRFS RAID5/6 was a dumpster fire and unusable in production. Have they actually open sourced the ability to fix bitrot detection with mdraid? If not, it's kind of irrelevant.

So... once again down votes without response - BTRFS raid still isn't recommended and the file healing isn't compatible with MDRAID I assume and you just don't like the fact I pointed it out? The "I'm downvoting because you pointed out a flaw in my logic" @HN is disappointing.

accelbred · on July 20, 2020

If you're using RAID-Z on zfs, your comparison isn't fair. Rather than use RAID56 with btrfs, the equivalent would be to get 1 or 2 disk redundancy with raid1 or raid1c3.

tw04 · on July 21, 2020

RAID-Z is the equivalent of RAID-5. RAID-Z2 is the equivalent of RAID-6. RAID-Z3 would be the equivalent of RAID-7 (or whatever the standard is named for 3-disk parity).

This is strictly speaking to how it deals with data and parity, the implementations are obviously different.

RAID-1 would be a mirror in ZFS parlance.

accelbred · on July 23, 2020

BTRFS raid1 isn't mirroring drives though, it means there are two copies of each extent across the whole set of 2+ drives. and BTRFS and raid1c3 and raid1c4 are 3 and 4 copies.

trasz · on July 21, 2020

Not really - RAIDZ is basically RAID5 without the write hole problem. ZFS equivalent to RAID1 is called 'mirror', and is... well, a mirror.

accelbred · on July 23, 2020

Yes, and btrfs's raid1 isn't a mirror. Btrfs's raid1c3 and raid1c4 are its alternative to raid5/6 without the write hole.

trasz · on July 24, 2020

No, it's still a mirror, just spread over more than two devices. Size overhead is still that of a mirror, ie quite a bit higher than that of raidz.

Geezus_42 · on July 20, 2020

Just avoid RAID 5/6...

Geezus_42 · on July 20, 2020

Just avoid RAID 5/6

paulie_a · on July 20, 2020

If it's cli it's a nonstarter

Even if it has a gui it's probably a non starter unless there are literally 2 options

diffeomorphism · on July 20, 2020

Can you comment on btrfs by comparison? Also, what is the status of the license incompatibility/integration of ZFS with distros? Canonical seems to think that shipping it with ubuntu is legal, but other distros seem less sure.

curt15 · on July 20, 2020

Fedora is proposing to default to btrfs for its next release: https://fedoraproject.org/wiki/Changes/BtrfsByDefault

throw0101a · on July 20, 2020

Contrast with Debian's warnings on the topic:

* https://wiki.debian.org/Btrfs#Warnings

viraptor · on July 20, 2020

Btrfs has pretty much the same features listed above minus the cache drive handling. (You can still get the cache drive behaviour over bcache device) The raid has different modes, so you'll have to decide if what's available is enough for you.

You can check the status of each feature here: https://btrfs.wiki.kernel.org/index.php/Status

mapgrep · on July 20, 2020

It’s also still considered experimental and according to kernel.org wiki “under heavy development.” ZFS was released for production use 14 years ago.

viraptor · on July 20, 2020

https://github.com/torvalds/linux/blob/master/fs/btrfs/Kconf...

None of the options are marked experimental. (Specific features are marked unstable on the status page)

"Under heavy development" does not mean anything about stability. The kernel itself is under heavy development. ZoL is under heavy development. The disk format is stable, which is what matters.

SUSE provides commercial support for btrfs and uses it as the default. That's pretty much as non-experimental as is gets.

mapgrep · on July 20, 2020

In the past, when data integrity issues emerged, btrfs devs have stated that it is not ready for production. Has there been an announcement to the contrary? If btrfs is production ready, is this clearly stated somewhere? Solaris has defaulted to ZFS at least since version 11 first released eight years ago.

Update - also it appears Suse enterprise uses btrfs for the root os filesystem but xfs for everything else including by default /home. To me, this seems telling? If it’s so solid why not use it for /home?

spindle · on July 20, 2020

It's trivial to use ZFS on NixOS, including on the root partition.

The legal position is that ZFS is open source under a copyleft license but that many people think that it's illegal to bundle it with the Linux kernel because of some (I think unintended) incompatibilities between its license and the Linux kernel's license. Canonical (and some others) disagree, and think that it's legal. It's only the bundling that's at issue - everyone agrees that it's legal to use with Linux once you have both.

See https://ubuntu.com/blog/zfs-licensing-and-linux for Canonical's opinion.

capitol_ · on July 20, 2020

The incompatibility was very much intended, Sun needed a way to compete with linux and didn't want to be assimilated into the linux ecosystem, so they released opensolaris under CDDL instead of GPL or MIT. Oracle haven't re-licensed it for their own reasons.

Here is some more info on the subject: https://sfconservancy.org/blog/2016/feb/25/zfs-and-linux/

throw0101a · on July 20, 2020

> The incompatibility was very much intended

Nope. Simon Phipps, Sun's Chief Open Source Officer at the time, and the boss of Danese Cooper—who is the source of the claims it was intended—has stated it was not:

* https://en.wikipedia.org/wiki/Common_Development_and_Distrib...

Brian Cantrill, the co-author of Dtrace, has also stated that that they were expecting the Linux folks to potentially incorporate their work:

* https://old.reddit.com/r/IAmA/comments/31ny87#cq3bs9z

Do you have any citations that can corroborate the/your claim that incompatibility was intended?

> Here is some more info on the subject:

See also:

* https://softwarefreedom.org/resources/2016/linux-kernel-cddl...

Wowfunhappy · on July 20, 2020

Do you happen to know where the claim that it was intended came from?

ghaff · on July 20, 2020

Because Danese Cooper said so. (See video link from the Wikipedia article: https://en.wikipedia.org/wiki/Common_Development_and_Distrib...)

I think there's also a general suspicion that Sun could have just chosen the GPL if they cared about compatibility. Although, for various reasons, it's probably at least somewhat more complicated than that because of patent protection, etc.

throw0101a · on July 20, 2020

> I think there's also a general suspicion that Sun could have just chosen the GPL if they cared about compatibility.

There were 'technical' reasons why they did go not with GPL, and specifically GPLv2 (GPLv3 was not out yet). IIRC, they did consider waiting for GPLv3, but it was unknown when it would be out, and one thing they desired was a patent grant, which v2 does not have.

Another condition was that they wanted a file-based copyright rather than a work-based copyright (i.e., applies to any individual files of ZFS as opposed to "ZFS" in aggregate).

* https://nawilson.com/2007/12/02/why-the-dislike-for-cddl/

ghaff · on July 20, 2020

I had forgotten about some of the reasons they specifically wanted file-based copyright. Sun were clients at the time and I spoke fairly frequently with the open source folks there. But I didn't remember all the details and was certainly not privy to all the internal discussions.

rleigh · on July 20, 2020

"Sun could have just chosen the GPL if they cared about compatibility."

That's a very loaded statement. I've seen it said quite a lot over the years. But, have you thought about its implications?

The implicit assumption here is the primacy of the GPL over all other open source licences. Why should other companies and organisations treat it as "more special" than any other free/open source licence when it comes down to interoperability?

When it comes down to compatibility, the GPL is one of the last licences you should choose. Because by its very nature it is deliberately and intentionally incompatible with everything other than the most permissive licences. The problem with "viral" licences like the GPL is that "there can only be one" because they are mutually incompatible by nature. Why should the MPL/Apache/CDDL licences make special exemptions to lessen their requirements so that they can be GPL-compatible?

ghaff · on July 20, 2020

I should have written compatibility with the GPL (or really the Linux kernel which was what was most relevant from the perspective of Solaris). And, of course, Sun could have chosen a fully permissive license but AFAIK nothing like that was seriously considered.

Nit: Apache 2.0 is compatible with GPLv3 (but not v2).

thecureforzits · on July 20, 2020

I agree it was intended and I remember Sun talking about that intention at the time. Sun specifically removed the multiple license compatibility language (section 13) from the Mozilla MPL when creating the CDDL:

https://web.archive.org/web/20060816050912/http://www.sun.co...

trasz · on July 20, 2020

Interesting, given that the old MPL is GPL-incompatible too.

How would it work anyway? It's not CDDL or MPL that causes the incompatibility, it's the GPL.

trasz · on July 20, 2020

FWIW, CDDL is pretty much just Mozilla license. The incompatibility is caused by GPL, which, according to FSF, cannot be linked against anything that's not a subset of GPL. You'd get the same incompatibility if ZFS was covered by GPLv3, for example.

viraptor · on July 20, 2020

https://www.gnu.org/licenses/license-list.html#CDDL

"This means a module covered by the GPL and a module covered by the CDDL cannot legally be linked together."

I don't think the incompatibilities are really unintended. If anyone has enough lawyers and money to relicense something correctly, it's Oracle.

trasz · on July 20, 2020

Except then you'd loose the patent protection provided by CDDL. And you'd cause licensing problems for literally everyone else but Linux. And you probably wouldn't gain anything in the long run anyway; AdvFS was released under GPL and went nowhere.

nabla9 · on July 20, 2020

Even if canonical would be wrong it seems that the change that anyone having a interest to sue Canonical over unintended legal technicality is nil.

ZFS on Linux does not break the intention and spirit of the kernel licence.

luma · on July 20, 2020

Relying on Oracle not to sue people for questionable reasons seems like a precarious position.

trasz · on July 20, 2020

So basically the problem with ZFS is that Oracle would sue someone for GPL violation?

vetinari · on July 20, 2020

Or Linux Kernel people. If you are distributing their work, you can do it because GPL allows it. But for GPL to allow it, you cannot break it or ignore it, otherwise you will lose the distribution rights.

stavros · on July 20, 2020

I have a home NAS that has an SSD on it for the OS and four HDDs in RAIDZ. Does anyone know if/how I can use a small part/partition of the SSD for the cache? I don't need an entire SSD's worth of cache, and I'd rather not have to buy an extra one.

notabee · on July 20, 2020

Most of the time you don't need the extra SSD cache. If you do use one for the ZIL (writes), it will only help synchronous writes, it should be redundant if possible, and it should be extremely low latency like Intel Optane. L2ARC (reads) on an SSD is not as good as just adding more RAM.

magicalhippo · on July 20, 2020

You can add partitions to a pool, either for storage, for SLOG or L2ARC.

Unless you're ok with data loss in the case of power-outage, you'll want to use a mirror for SLOG at the very least. You can do that by making a mirror from two partitions on separate SSDs and then adding that as the SLOG. The partitions do not have to be very large, just enough to handle a minute or so of writes, so 10GB or so should is often plenty.

Also keep in mind that ZFS does a lot of shuffling for the L2ARC. I had <1% L2ARC hits on my pool with a 128GB L2ARC partition, but almost a TB of writes per day to the L2ARC due to ZFS rotating data in and out of it.

louwrentius · on July 20, 2020

I would really advise not to use the SSD for cache: you'll only wear it out and do you have any evidence that you benefit from it?

stavros · on July 20, 2020

Probably not, since it's a home NAS the reads won't be very repeatable. I wanted to prevent the hard disks from waking up, but it's mostly writes, so it doesn't help there either.

washadjeffmad · on July 20, 2020

Add the SSD to its own pool, set up as a working share, then use a utility or script to sync contents a few times a day?

stavros · on July 20, 2020

That would probably be easier, yep, thanks. It's more straightforward to just make a partition on it and rsync twice a day.

lscotte · on July 20, 2020

Yes, you can. It's called L2ARC.

stavros · on July 20, 2020

Thanks for the tip, I'll look into it and deploy it!

jiveturkey · on July 20, 2020

> Zfs is the best file system.

Of course it isn't. There can't be a single best filesystem. For the use case of zfs, yes it is the best filesystem.