Hacker News new | past | comments | ask | show | jobs | submit login
Linus: Don't Use ZFS (realworldtech.com)
572 points by rbanffy on Jan 9, 2020 | hide | past | favorite | 555 comments



Here's his reasoning:

"honestly, there is no way I can merge any of the ZFS efforts until I get an official letter from Oracle that is signed by their main legal counsel or preferably by Larry Ellison himself that says that yes, it's ok to do so and treat the end result as GPL'd.

Other people think it can be ok to merge ZFS code into the kernel and that the module interface makes it ok, and that's their decision. But considering Oracle's litigious nature, and the questions over licensing, there's no way I can feel safe in ever doing so.

And I'm not at all interested in some "ZFS shim layer" thing either that some people seem to think would isolate the two projects. That adds no value to our side, and given Oracle's interface copyright suits (see Java), I don't think it's any real licensing win either."


Btrfs crashed for me on two occations, last time, around 2 years back I have installed zfs (which I am using for ~10 years on FreeBSD server) which works like a charm since then.

I understand Linus reasoning but there is just no way I will install btrfs, like ever. I rather dont update kernel (I am having zfs on fedora root with degular kernel updates and scripts which verify that everything is with kernel modules prior to reboot) than use file system that crashed twice in two years.

Yes it is very annoying if update crashes fs, but currently:

- in 2 years two time btrfs crashed itself

- in next 2 years update never broke zfs

As far as I am concerned, the case for zfs is clear.

This might be helpful to someone: https://www.csparks.com/BootFedoraZFS/index.md

Anyway Linus is going too far with his GPL agenda, the MODUL_LICENSE writting kernel moduls explains why the hardware is less supported on linux - instead of devs. focusing on more support from 3rd party companies, they try to force them to do GPL. Once you set MODUL_LICENSE to non GPL, you quickly figure out that you can't use most of kernel calls. Not the code. Calls.


The Linux kernel has been released under GPL2 license since day 1, and I don't think that's ever going to change. Linus is more pragmatic than many of his detractors think - he thankfully refused to migrate to GPL3 because the stricter clauses would have scared away a lot of for-profit users and contributors.

Relaxing on anything more permissive than GPL2 would instead mean the end of Linux as we know it. A more permissive license means that nothing would prevent Google or Microsoft from releasing their own closed-source Linux, or replacing the source code of most of the modules with hex bloats.

I believe that GPL2 is a good trade-off for a project like Linux, and it's good that we don't compromise on anything less than that.

Even though I agree on the superiority of ZFS for many applications, I think that the blame for the missed inclusion in the kernel is on Oracle's side. The lesson learned from NTFS should be that if a filesystem is good and people want to use it, then you should make sure that the drivers for that filesystem are as widely available as possible. If you don't do it, then someone sooner or later will reverse engineer the filesystem anyway. The success of a filesystem is measured by the number of servers that use it, not by the amount of money that you can make out of it. For once Oracle should act more like a tech company and less like a legal firm specialised in patent exploitation.


The blame is on Oracle side for sure. No question about it.

> or replacing the source code of most of the modules with hex bloats.

Ok good point, I am no longer pissed off on MODULE_LICENSE, didn't even thought about that.


I agree with the stand btrfs, around same time (2 years back), it crashed on me while I was trying to use it for external hard disk attached to raspberry pi. nothing fancy. since then, I cant tolerate the fs crashes, for a user, its supposed to be one of the most reliable layers.


Concerning the BTRFS fs:

I did use it as well many years ago (probably around 2012-2015) in a raid5-configuration after reading a lot of positive comments about this next-gen fs => after a few weeks my raid started falling apart (while performing normal operations!) as I got all kind of weird problems => my conclusion was that the raid was corrupt and it couldn't be fixed => no big problem as I did have a backup, but that definitely ruined my initial BTRFS-experience. During those times even if the fs was new and even if there were warnings about it (being new), everybody was very optimistic/positive about it but in my case that experiment has been a desaster.

That event held me back until today from trying to use it again. I admit that today it might be a lot better than in the past but as people have already been in the past positive about it (but then in my case it broke) it's difficult for me now to say "aha - now the general positive opinion is probably more realistic then in the past", due e.g. to that bug that can potentially still destroy a raid (the "write hole"-bug): personally I think that if BTRFS still makes that raid-functionality available while it has such a big bug while at the same time advertising it as a great feature of the fs, the "irrealistically positive"-behaviour is still present, therefore I still cannot trust it. Additionally that bug being open since forever makes me think that it's really hard to fix, which in turn makes me think that the foundation and/or code of BTRFS is bad (which is the reason why that bug cannot be fixed quickly) and that therefore potentially in the future some even more complicated bugs might show up.

Concerning alternatives:

I am writing and testing since a looong time a program which ends up creating a big database (using "Yandex Clickhouse" for the main DB) distributed on multiple hosts where each one uses multiple HDDs to save the data and that at the same time is able to fight against potential "bitrot" ( https://en.wikipedia.org/wiki/Data_degradation ) without having to resync the whole local storage each time that a byte on some HDD lost its value. Excluding BTRFS, the only other candidate that I found is ZFSoL that perform checksums on data (both XFS and NILFS2 do checksums but only on metadata).

Excluding BTRFS because of the reasons mentioned above, I was left only with ZFS.

I'm now using ZFSoL since a couple of months and so far everything went very well (a bit difficult to understand & deal with at the beginning, but extremely flexible) and performance is as well good (but to be fair that's easy in combination with the Clickhouse DB, as the DB itself writes data already in a CoW-way, therefore blocks of a table stored on ZFS are always very likely to be contiguous).

On one hand, technically, now I'm happy. On the other hand I do admit that the problems about licensing and the non-integration of ZFSoL in the kernel do have risks. Unluckily I just don't see any alternative.

I do donate monthly something to https://www.patreon.com/bcachefs but I don't have high hopes - not much happening and BCACHE (even if currently integrated in the kernel) hasn't been in my experience very good (https://github.com/akiradeveloper/dm-writeboost worked A LOT better, but I'm not using it anymore as I don't have a usecase for it anymore, and it was a risk as well as not yet included in the kernel) therefore BCACHEFS might end up being the same.

Bah :(


I'd avoid making an argument for or against a filesystem on the basis of anecdotal evidence.


For your own personal use, your own personal anecdotes are really all that matter.


Your personal anecdotes are indeed all that matter when it comes to describing your past.

When it comes to predicting your future, though, your personal anecdotes may not hold water to more substantial data.


Btrfs like OCFS is pretty much junk. You can do everything you need to on local disk with XFS and if you need clever features buy a NetApp.


Both ZFS and BTRFS are essentially Oracle now. BTRFS was an effort largely from Oracle to copy SUN's ZFS advantages in a crappy way which became moot once their acquired SUN. ZFS also requires (a lot of) ECC memory for reliable operation. It's a great tech, pity it's dying slow death.


I’d argue that other file systems also require ECC RAM to maximize reliability. Zfs just makes it much more explicit in their docs and surfaces errors rather than silently handing back any memory corrupted data.


ZFS needs ECC just as much as any other file system. That is, it has no way of detecting in memory errors. So if you want your data to actually be written correctly, it's a good idea to use ECC. But the myth that you "need" ECC with ZFS is completely wrong. It would be better if you did have ECC, but don't let that stop you from using ZFS.

As far as it needing a lot of memory, that is also not true. The ARC will use your memory if it's available, because it's available! You paid good money for it, so why not actually use it to make things faster?


I worked at SUN when ZFS was "invented" and the emphasis on a large amount of proper ECC memory was strong, especially in conjunction with Solaris Zones. I can't recall if it was 1GB of RAM per 1TB of storage or something similar due to how it performed deduplication and stored indices in hot memory. And that was also the reason for insisting on ECC, in order to make sure you won't get your stored indices and shared blocks messed up, leading to major uncorrectable errors.


I can see how a (perhaps, less than competitive) hardware company would want you to think that :)


Sure, all about internal marketing, right? :D

But there was nothing like that on the market at that time anyway.


I have examined all the counterarguments against ZFS myself and none of them have been confirmed. ZFS is stable and not RAM-hungry as is constantly claimed. It has sensible defaults, namely to use all RAM that is available and to release it quickly when it is used elsewhere. ZFS on a Raspberry Pi? No problem. I myself have a dual socket, 24 Core Intel Server with 128 GB RAM and a virtual Windows SQL Server instance running on it. For fun, I limited the amount of RAM for ZFS to 40 MB. Runs without problems.


That's his reasoning for not merging ZFS code, not for generally avoiding ZFS.


Here are his reasons for generally avoiding ZFS from what I consider most important to least.

- The kernel team may break it at any time, and won't care if they do.

- It doesn't seem to be well-maintained.

- Performance is not that great compared to the alternatives.

- Using it opens you up to the threat of lawsuits from Oracle. Given history, this is a real threat. (This is one that should be high for Linus but not for me - there is no conceivable reason that Oracle would want to threaten me with a lawsuit.)


I'm baffled by such arguments.

> It doesn't seem to be well-maintained.

The last commit is from 3 hours ago: https://github.com/zfsonlinux/zfs/commits/master. They have dozens of commits per month. The last minor release, 0.8, brought significant improvements (my favorite: FS-level encryption).

Or maybe this is referred to the 5.0 kernel (initial) incompatibility? That wasn't the ZFS dev team's fault.

> Performance is not that great compared to the alternatives.

There are no (stable) alternatives. BTRFS certainly not, as it's "under heavy development"¹ (since... forever).

> The kernel team may break it at any time, and won't care if they do.

That's true, however, the amount is breakage is no different from any other out-of-tree module, and it unlikely to happen with a patch version of a working kernel (in fact, it happen with the 5.0 release).

> Using it opens you up to the threat of lawsuits from Oracle. Given history, this is a real threat. (This is one that should be high for Linus but not for me - there is no conceivable reason that Oracle would want to threaten me with a lawsuit.)

"Using" it won't open to lawsuits; ZFS has a CDDL license, which is a free and open-source software license.

The problem is (taking Ubuntu as representative) shipping the compiled module along with the kernel, which is an entirely different matter.

---

[¹] https://btrfs.wiki.kernel.org/index.php/Main_Page#Stability_...


> ZFS has a CDDL license

Java is GPLv2+CPE. That didn't stop Oracle because, as Linus pointed out in the email, Oracle regards their APIs as a separate entity to their code.


Googles Java implementation wasn't GPL licensed, so neither its implementation nor its interface could have been covered by the OpenJDK being GPLv2. I think RMS wouldn't sit by idly either if someone took GCC and forked it under the Apache license.


But Google didn't fork the OpenJDK; they forked Apache Harmony, which was already Apache-licensed.

So it's not comparable with GCC; but comparable to forking clang and keeping clang license. I doubt RMS could be able to say anything.


> There are no (stable) alternatives. BTRFS certainly not, as it's "under heavy development"¹ (since... forever).

Note that they don't mean "it's unstable," just "there are significant improvements between versions." Most importantly:

> The filesystem disk format is stable; this means it is not expected to change unless there are very strong reasons to do so. If there is a format change, filesystems which implement the previous disk format will continue to be mountable and usable by newer kernels.

...and only _new features_ are expected to stabilise:

> As with all software, newly added features may need a few releases to stabilize.

So overall, at least as far as their own claims go, this is not "heavy development" as in "don't use."


Some features such as Raid5 were still firmly in "don't use if you value your data" territory last I looked. So it is important to be informed as to what can be used and what might be more dangerous with btrfs


Keep in mind that RAID5 isn’t feasible with multi-TB disks (the probability of failed blocks when rebuilding the array is far too high). That said, RAID6 also suffers the same write-hole problem with Btrfs. Personally I choose RAIDZ2 instead.


> Keep in mind that RAID5 isn’t feasible with multi-TB disks (the probability of failed blocks when rebuilding the array is far too high).

What makes you say that? I've seen plenty of people make this claim based on URE rates, but I've also not seen any evidence that it is a real problem for a 3-4 drive setup. Modern drives are specced at 1 URE per 10^15 bits read (or better), so less than 1 URE in 125 TB read. Even if a rebuild did fail, you could just start over from a backup. Sure, if the array is mission critical and you have the money, use something with more redundancy, but I don't think RAID5 is infeasible in general.


Last time I checked (a few years ago I must say), a 10^15 URE was only for enteprise-grade drives and not for consumer-level, where most drives have a 10^14 URE. Which means your build is almost guaranteed to fail on a large-ish raid setup. So yeah, RAID is still feasible with multi-TB disks if you have the money to buy disks with the appropriate reliability. For the common folk, raid is effectively dead with today's disk sizes.


Theoretically, if you have a good RAID5, without serious wire-hole and similar issues, then it is strictly better than no RAID and worse than RAID5 and RAID1.

* All localized error are correctable, unless they overlap on different disks, or result in drive ejection. This precisely fixes UREs of non-raid drives.

* If a complete drive fails, then you have a chance of losing some data from the UREs / localized errors. This is approximately the same as if you used no RAID.

As for URE incidence rate - people use multi-TB drives without RAID, yet data loss does not seem prevalent. I'd say it depends .. a lot.

If you use a crappy RAID5, that ejects a drive on a drive partial/transient/read failure, then yes, it's bad, even worse than no RAID.

That being said, I have no idea whether a good RAID5 implementation is available, one that is well interfaced or integrated into filesystem.


I have a couple of Seagate IronWolf drives that are rated at 1 URE per 10^15 bits read and, sure, depending on the capacity you want (basically 8 TB and smaller desktop drives are super cheap), they do cost up to 40% more than their Barracuda cousins, but we're still well within the realm of cheap SATA storage.


Manufacturer-specified UBE rates are extremely conservative. If UBE were a thing then you'd notice transient errors during ZFS scrubs, which are effectively a "rebuild" that doesn't rebuild anything.


To be sure, it's entirely feasible, just not prudent with today's typical disk capacities.


Feasible is different than possible, and carries a strong connotation of being suitable/able to be done successfully. Many things are possible, many of those things are not feasible.


Btrfs has many more problems than dataloss with RAID5.

It has terrible performance problems under many typical usage scenarios. This is a direct consequence in the choice of core on-disc data structures. There's no workaround without a complete redesign.

It can become unbalanced and cease functioning entirely. Some workloads can trigger this in a matter of hours. Unheard of for any other filesystem.

It suffers from critical dataloss bugs in setups other than RAID5. They have solved a number of these, but when reliability is its key selling point many of us have concerns that there is still a high chance that many still exist, particularly in poorly-exercised codepaths which are run in rare circumstances such as when critical faults occur.

And that's only getting started...


There's differing opinions of BTRFS's suitability in production - it's the default filesystem of SUSE on one hand, on the other RedHat has deprecated BTRFS support because they see it as not being production ready and they don't see it being production ready in the near future. They also feel that the more legacy linux filesystems have added features to compete.



But then, your personal requirements/use cases might not be the same as Facebook's. (And this does not only apply to Btrfs[1]/ZFS, it also applies to GlusterFS, use of specific hardware, ...)

[1] which I used for nearly two years on a small desktop machine on a daily basis; ended up with (minor?) errors on the file system that could not be repaired and decided to switch to ZFS. No regrets, nor similar errors since.


It's also the default file system of millions of Synology NASes running in consumer hands (although Synology shimmed on their own RAID5/6 support)


Kroger (and their subsidiaries like QFC, Fred Meyer, Fry's Marketplace, etc), Walmart, Safeway (and Albertsons/Randalls) all use Suse with BTRFS for their point of sale systems.


Synology uses standard linux md (for btrfs too). Even SHR (Synology Hybrid RAID) is just different partitions on the drive allocated to different volumes, so you can use mixed-capacity drives effectively.


Right, instead of BTRFS RAID5/6, they use Linux md raid, but I believe they have custom patches to BTRFS to "punch through" information from md, so that when BTRFS has a checksum mismatch it can use the md raid mirror disk for repair.


Check what features of BTRFS SUSE actually uses and considers supported/supportable.


bcachefs should be heavily supported, it doesn't get nearly enough for what it supposes to do: https://www.patreon.com/bcachefs


I've been looking forward to using bcachefs as I had a few bad experiences with btrfs.

Is bcachefs more-or-less ready for some use cases now? Does it still support caching layers like bcache did?


It's quite usable, but of course, do not trust it with your unique unbacked-up data yet. I use it as a main FS for a desktop workstation and I'm pretty happy with it. Waiting impatiently for EC to be implemented for efficient pooling of multiple devices.

Regarding caching: "Bcachefs allows you to specify disks (or groups thereof) to be used for three categories of I/O: foreground, background, and promote. Foreground devices accept writes, whose data is copied to background devices asynchronously, and the hot subset of which is copied to the promote devices for performance."


To my knowledge, caching layers are supported but require some setup and don't have much documentation to setup rn.

If all you need is a simple root FS that is CoW and checksummed, bcachefs works pretty good, in my experience. I've been using it productively as a root and home FS for about two years or so.


Many of the advanced features aren't implemented yet though, like compression, encryption, snapshots, RAID5/6....


Compression and encryption have been implemented, but not snapshots and RAID5/6.


why would you want to embed raid5/6 in the filesystem layer? Linux has battle-tested mdraid for this, I'm not going to trust a new filesystem's own implementation over it.

Same for encryption, there are already existing crypto layers both on the block and filesystem (as an overlay) level.


Because the FS can be deeply integrated with the RAID implementation. With a normal RAID, if the data at some address is different between the two disks, there's no way for the fs to tell which is correct, because the RAID code essentially just picks one, it can't even see the other. With ZFS for example, there is a checksum stored with the data, so when you read, zfs will check the data on both and pick the correct one. It will also overwrite the incorrect version with the correct one, and log the error. It's the same kind of story with encryption, if its built in you can do things like incremental backups of an encrypted drive, without ever decrypting it on the target.


> when you read, zfs will check the data on both and pick the correct one.

Are you sure about that? Always reading both doubles read I/O, and benchmarks show no such effect.

> there's no way for the fs to tell which is correct

This is not an immutable fact that precludes keeping the RAID implementation separate. If the FS reads data and gets a checksum mismatch, it should be able to use ioctls (or equivalent) to select specific copies/shards and figure out which ones are good. I work on one of the four or five largest storage systems in the world, and have written code to do exactly this (except that it's Reed-Solomon rather than RAID). I've seen it detect and fix bad blocks, many times. It works, even with separate layers.

This supposed need for ZFS to absorb all RAID/LVM/page-cache behavior into itself is a myth; what really happened is good old-fashioned NIH. Understanding other complex subsystems is hard, and it's more fun to write new code instead.


> If the FS reads data and gets a checksum mismatch, it should be able to use ioctls (or equivalent) to select specific copies/shards and figure out which ones are good. I work on one of the four or five largest storage systems in the world, and have written code to do exactly this (except that it's Reed-Solomon rather than RAID).

This is all great, and I assume it works great. But it is no way generalizable to all the filesystems Linux has to support (at least at the moment). I could only see this working in a few specific instances with a particular set of FS setups. Even more complicating is the fact that most RAIDS are hardware based, so just using ioctls to pull individual blocks wouldn’t work for many (all?) drivers. Convincing everyone to switch over to software raids would take a lot of effort.

There is a legitimate need for these types of tools in the sub-PB, non-clustered, storage arena. If you’re working on a sufficiently large storage system, these tools and techniques are probably par for the course. That said, I definitely have lost 100GBs of data from a multi-PB storage system from a top 500 HPC system due to bit rot. (One bad byte in a compressed data file left the data after the bad byte unrecoverable). This would not have happened on ZFS.

ZFS was/is a good effort to bring this functionality lower down the storage hierarchy. And it worked because it had knowledge about all of the storage layers. Checksumming files/chunks helps best if you know about the file system and which files are still present. And it only makes a difference if you can access the lower level storage devices to identify and fix problems.


> it is no way generalizable to all the filesystems Linux has to support

Why not? If it's a standard LVM API then it's far more general than sucking everything into one filesystem like ZFS did. Much of this block-mapping interface already exists, though I'm not sure whether it covers this specific use case.


> This supposed need for ZFS to absorb all RAID/LVM/page-cache behavior into itself is a myth; what really happened is good old-fashioned NIH.

At the time that ZFS was written (early 2000s) and released to the public (2006), this was not a thing and the idea was somewhat novel / 'controversial'. Jeff Bonwick, ZFS co-creator, lays out their thinking:

* https://blogs.oracle.com/bonwick/rampant-layering-violation

Remember: this was a time when Veritas Volume Manager (VxVM) and other software still ruled the enterprise world.

* https://en.wikipedia.org/wiki/Veritas_Storage_Foundation


I debated some of this with Bonwick (and Cantrill who really had no business being involved but he's pernicious that way) at the time. That blog post is, frankly, a bit misleading. The storage "stack" isn't really a stack. It's a DAG. Multiple kinds of devices, multiple filesystems plus raw block users (yes they still exist and sometimes even have reason to), multiple kinds of functionality in between. An LVM API allows some of this to have M users above and N providers below, for M+N total connections instead of M*N. To borrow Bonwick's own condescending turn of phrase, that's math. The "telescoping" he mentions works fine when your storage stack really is a stack, which might have made sense in a not-so-open Sun context, but in the broader world where multiple options are available at every level it's still bad engineering.


> ... but in the broader world where multiple options are available at every level it's still bad engineering.

When Sun added ZFS to Solaris, they did not get rid of UFS and/or SVM, nor prevent Veritas from being installed. When FreeBSD added ZFS, they did not get rid of UFS or GEOM either.

If an admin wanted or wants (or needs) to use the 'old' way of doing things they can.


Sorry, I'm pernicious in what way, exactly?


Heh. I was wondering if you were following (perhaps participating in) this thread. "Pernicious" was perhaps a meaner word than I meant. How about "ubiquitous"?


The fact that traditionally RAID, LVM, etc. are not part of the filesystem is just an accident of history. It's just that no one wanted to rewrite their single disk filesystems now that they needed to support multiple disks. And the fact that administering storage is so uniquely hard is a direct result of that.


However it happened, modularity is still a good thing. It allows multiple filesystems (and other things that aren't quite filesystems) to take advantage of the same functionality, even concurrently, instead of each reinventing a slightly different and likely inferior wheel. It should not be abandoned lightly. Is "modularity bad" really the hill you want to defend?


> However it happened, modularity is still a good thing.

It may be a good thing, and it may not. Linux has a bajillion file systems, some more useful than others, and that is unique in some ways.

Solaris and other enterprise-y Unixes at the time only had one. Even the BSDs generally only have a few that they run on instead of ext2/3/4, XFS, ReiserFS (remember when that was going to take over?), btrfs, bcachefs, etc, etc, etc.

At most, a company may have purchased a license for Veritas:

* https://en.wikipedia.org/wiki/Veritas_Storage_Foundation

By rolling everything together, you get ACID writes, atomic space-efficient low-overhead snapshots, storage pools, etc. All this just be removing one layer of indirection and doing some telescoping:

* https://blogs.oracle.com/bonwick/rampant-layering-violation

It's not "modularity bad", but that to achieve the same result someone would have had to write/expand a layer-to-layer API to achieve the same results, and no one did. Also, as a first-order estimate of complexity: how many lines of code (LoC) are there in mdraid/LVM/ext4 versus ZFS (or UFS+SVM on Solaris).


Other than esoteric high performance use cases, I'm not really sure why you would really need a plethora of filesystems. And the list of them that can be actually trusted is very short.


I'd like to agree, but I don't think the exceptions are all that esoteric. Like most people I'd consider XFS to be the default choice on Linux. It's a solid choice all around, and also has some features like project quota and realtime that others don't. OTOH, even in this thread there's plenty of sentiment around btrfs and bcachefs because of their own unique features (e.g. snapshots). Log-structured filesystems still have a lot of promise to do better on NVM, though that promise has been achingly slow to materialize. Most importantly, having generic functionality implemented in a generic subsystem instead of in a specific filesystem allows multiple approaches to be developed and compared on a level playing field, which is better for innovation overall. Glomming everything together stifles innovation on any specific piece, as network/peripheral-bus vendors discovered to their chagrin long ago.


>I work on one of the four or five largest storage systems in the world

What would you recommend over zfs for small-scale storage servers? XFS with mdraid?

I'd also love to hear your opinion on the Reiser5 paper.


> With a normal RAID, if the data at some address is different between the two disks, there's no way for the fs to tell which is correct, because the RAID code essentially just picks one, it can't even see the other.

That's problem only with RAID1, only when copies=2 (granted, most often used case) and only when the underlying device cannot report which sector has gone bad.


why would you want to embed raid5/6 in the filesystem layer?

There are valid reasons, most having to do with filesystem usage and optimization. Off the top of my head:

- more efficient re-syncs after failure (don't need to re-sync every block, only the blocks that were in use on the failed disk)

- can reconstruct data not only on disk self-reporting, but also on filesystem metadata errors (CRC errors, inconsistent dentries)

- different RAID profiles for different parts of the filesystem (think: parity raid for large files, raid10 for database files, no raid for tmp, N raid1 copies for filesystem metadata)

and for filesystem encryption:

- CBC ciphers have a common weakness: the block size is constant. If you use FS-object encryption instead of whole-FS encryption, the block size, offset and even the encryption keys can be varied across the disk.


I think to even call volume management a "layer" as though traditional storage was designed from first principles, is a mistake.

Volume management is a just a hack. We had all of these single-disk filesystems, but single disks were too small. So volume management was invented to present the illusion (in other words, lie) that they were still on single disks.

If you replace "disk" with "DIMM", it's immediately obvious that volume management is ridiculous. When you add a DIMM to a machine, it just works. There's no volume management for DIMMs.


Indeed there is no volume management for RAM. You have to reboot to rebuild the memory layout! RAM is higher in the caching hierarchy and can be rebuilt at smaller cost. You can't resize RAM while keeping data because nobody bothered to introduce volume management for RAM.

Storage is at the bottom of the caching hierarchy where people get inventive to avoid rebuilding. Rebuilding would be really costly there. Hence we use volume management to spare us the cost of rebuilding.

RAM also tends to have uniform performance. Which is not true for disk storage. So while you don't usually want to control data placement in RAM, you very much want to control what data goes on what disk. So the analogy confuses concepts rather than illuminating commonalities.


One of my old co-workers said that one of the most impressive things he's seen in his career was a traveling IBM tech demo in the back of a semi truck where they would physically remove memory, CPUs, and disks from the machine without impacting the live computation being executed apart from making it slower, and then adding those resources back to the machine and watching them get recognized and utilized again.


> why would you want to embed raid5/6 in the filesystem layer?

One of the creators of ZFS, Jess Bonwick, explained it in 2007:

> While designing ZFS we observed that the standard layering of the storage stack induces a surprising amount of unnecessary complexity and duplicated logic. We found that by refactoring the problem a bit -- that is, changing where the boundaries are between layers -- we could make the whole thing much simpler.

* https://blogs.oracle.com/bonwick/rampant-layering-violation


It's not about ZFS. It's about CoW filesystems in general; since they offer functionalities beyond the FS layer, they are both filesystems and logical volume managers.


Why does ZFS do RAIDZ in the filesystem layer?


It doesn't.

RAIDZ is part of the VDEV (Virtual Device) layer. Layered on top of this is the ZIO (ZFS I/O layer). Together, these form the SPA (Storage Pool Allocator).

On top of this layer we have the ARC, L2ARC and ZIL. (Adaptive Replacement Caches and ZFS Intent Log).

Then on top of this layer we have the DMU (Data Management Unit), and then on top of that we have the DSL (Dataset and Snapshot Layer). Together, the SPA and DSL layers implement the Meta-Object Set layer, which in turn provides the Object Set layer. These implement the primitives for building a filesystem and the various file types it can store (directories, files, symlinks, devices etc.) along with the ZPL and ZAP layers (ZFS POSIX Layer and ZFS Attribute Processor), which hook into the VFS.

ZFS isn't just a filesystem. It contains as many, if not more, levels of layering than any RAID and volume management setup composed of separate parts like mdraid+LVM or similar, but much better integrated with each other.

It can also store stuff that isn't a filesystem. ZVOLs are fixed size storage presented as block devices. You could potentially write additional storage facilities yourself as extensions, e.g. an object storage layer.


Honestly just use ZFS. We've wasted enough effort over obscure licensing minutia.


> We've wasted enough effort over obscure licensing minutia.

Which was precisely Sun/Oracle's goal when they released ZFS under the purposefully GPL incompatible CDDL. Sun was hoping to make OpenSolaris the next Linux whilst ensuring that no code from OpenSolaris could be moved back to linux. I can't think of another plausible reason why they would write a new open source license for their open source operating system and making such a license incompatible with the GPL.


https://en.wikipedia.org/wiki/Common_Development_and_Distrib...

Some people argue that Sun (or the Sun engineer) as creator of the license made the CDDL intentionally GPL incompatible.[13] According to Danese Cooper one of the reasons for basing the CDDL on the Mozilla license was that the Mozilla license is GPL-incompatible. Cooper stated, at the 6th annual Debian conference, that the engineers who had written the Solaris kernel requested that the license of OpenSolaris be GPL-incompatible.[18]

    Mozilla was selected partially because it is GPL incompatible. That was part
    of the design when they released OpenSolaris. ... the engineers who wrote Solaris 
    ... had some biases about how it should be released, and you have to respect that.


And the very next paragraph states:

> Simon Phipps (Sun's Chief Open Source Officer at the time), who had introduced Cooper as "the one who actually wrote the CDDL",[19] did not immediately comment, but later in the same video, he says, referring back to the license issue, "I actually disagree with Danese to some degree",[20] while describing the strong preference among the engineers who wrote the code for a BSD-like license, which was in conflict with Sun's preference for something copyleft, and that waiting for legal clearance to release some parts of the code under the then unreleased GNU GPL v3 would have taken several years, and would probably also have involved mass resignations from engineers (unhappy with either the delay, the GPL, or both—this is not clear from the video). Later, in September 2006, Phipps rejected Cooper's assertion in even stronger terms.[21]

So of the available licenses at the time, Engineering wanted BSD and Legal wanted GPLv3, so the compromise was CDDL.


Wow... talk about cutting off your nose to spite your face. Oracle ended up abandoning OpenSolaris within a year or so.

Edit: Nevermind, debunked by Bryan Cantrill. It was to allow for proprietary drivers.


Not at all really. Danese Cooper says that Cantrill is not a reliable witness and one can say he also has an agenda to distort the facts in this way [1].

[1] https://news.ycombinator.com/item?id=22008921


And Cooper's boss:

> Simon Phipps (Sun's Chief Open Source Officer at the time), who had introduced Cooper as "the one who actually wrote the CDDL",[19] did not immediately comment, but later in the same video, he says, referring back to the license issue, "I actually disagree with Danese to some degree",[20] while describing the strong preference among the engineers who wrote the code for a BSD-like license, which was in conflict with Sun's preference for something copyleft, and that waiting for legal clearance to release some parts of the code under the then unreleased GNU GPL v3 would have taken several years, and would probably also have involved mass resignations from engineers (unhappy with either the delay, the GPL, or both—this is not clear from the video). Later, in September 2006, Phipps rejected Cooper's assertion in even stronger terms.[21]

* https://en.wikipedia.org/wiki/Common_Development_and_Distrib...

So of the available licenses at the time, Engineering wanted BSD and Legal wanted (to wait for) GPLv3, so the compromise was CDDL.


There were genuine reasons for the CDDL - it wasn't an anti-gpl thing. https://www.youtube.com/watch?v=-zRN7XLCRhc&feature=youtu.be...


Danese Cooper, one of the people at Sun who helped create the CDDL, responded in the comment section of that very video:

Lovely except it really was decided to explicitly make OpenSolaris incompatible with GPL. That was one of the design points of the CDDL. I was in that room, Bryan and you were not, but I know its fun to re-write history to suit your current politics. I pleaded with Sun to use a BSD family license or the GPL itself and they would consider neither because that would have allowed D-Trace to end up in Linux. You can claim otherwise all you want...this was the truth in 2005.


This needs to be more widely known. Sun was never as open or innovative as its engineer/advertisers claim, and the revisionism is irksome. I saw what they had copied from earlier competitors like Apollo and then claimed as their own ideas. I saw the protocol fingerprinting their clients used to make non-Sun servers appear slower than they really were. They did some really good things, and they did some really awful things, but to hear proponents talk it was all sunshine and roses except for a few misguided execs. Nope. It was all up and down the organization.


The thing is - it was a time of pirates. In an environment defined by the ruthlessness of characters like Gates, Jobs, and Ellison, they were among the best-behaved of the bunch. Hence the reputation for being nice: they were markedly nicer than the hive of scum and villainy that the sector was at the time. And they did some interesting things that arguably changed the landscape (Java etc), even if they failed to fully capitalize on them.

(In many ways, it still is a time of pirates, we just moved a bit higher in the stack...)


> In an environment ... they were among the best-behaved

I wouldn't say McNealy was that different than any of those, though others like Joy and Bechtolsheim had a more salutary influence. To the extent that there was any overall difference, it seemed small. Working on protocol interop with DEC products and Sun products was no different at all. Sun went less-commodity with SPARC and SBus, they got in bed with AT&T to make their version of UNIX seem more standard than competitors' even though it was more "unique" in many ways, there were the licensing games, etc. Better than Oracle, yeah, but I wouldn't go too much further than that.


> Sun was never as open or innovative as its engineer/advertisers claim, and the revisionism is irksome.

For (the lack of) openness, I agree, but the claim that they were not innovative needs stronger evidence.


Just to be clear, I'm not saying they weren't innovative. I'm saying they weren't as innovative as they claim. Apollo, Masscomp, Pyramid, Sequent, Encore, Stellar, Ardent, Elxsi, Cydrome, and others were also innovating plenty during Sun's heyday, as were DEC and even HP. To hear ex-Sun engimarketers talk, you'd think they were the only ones. Reality is that they were in the mix. Their fleetingly greater success had more to do with making some smart (or lucky?) strategic choices than with any overall level of innovation or quality, and mistaking one for the other is a large part of why that success didn't last.


Java was pretty innovative. The worlds most advanced virtual machine, a JIT that often outperforms C in long running server scenarios, and the foundation of probably 95% of enterprise software.


ANDF had already done (or at least tried to do) the "write once, run anywhere" thing. The JVM followed in the footsteps of similar longstanding efforts at UCSD, IBM and elsewhere. There was some innovation, but "world's most advanced virtual machine" took thousands of people (many of them not at Sun) decades to achieve. Sun's contribution was primarily in popularizing these ideas. Technically, it was just one more step on an established path.


Sure plenty of the ideas in Java were invented before, standing on the shoulders of giants and all that. The JIT came from Self, the Object system from Smalltalk, but Java was the first implementation that put all those together into a coherent platform.


Yeah, it's hard to understand this without context. Sun saw D-Trace and ZFS as the differentiators of Solaris from Linux, a massive competitive advantage that they simply could not (and would not) relinquish. Opensourcing was a tactical move, they were not going to give away their crown jewels with it.

The whole open-source steer by SUN was a very disingenous strategy, forced by the changed landscape in order to try and salvage some parvence of relevance. Most people saw right through it, which is why SUN ended up as it did shortly thereafter: broke, acquired, and dismantled.


And Cooper's boss:

> Simon Phipps (Sun's Chief Open Source Officer at the time), who had introduced Cooper as "the one who actually wrote the CDDL",[19] did not immediately comment, but later in the same video, he says, referring back to the license issue, "I actually disagree with Danese to some degree",[20] while describing the strong preference among the engineers who wrote the code for a BSD-like license, which was in conflict with Sun's preference for something copyleft, and that waiting for legal clearance to release some parts of the code under the then unreleased GNU GPL v3 would have taken several years, and would probably also have involved mass resignations from engineers (unhappy with either the delay, the GPL, or both—this is not clear from the video). Later, in September 2006, Phipps rejected Cooper's assertion in even stronger terms.[21]

So of the available licenses at the time, Engineering wanted BSD and Legal wanted GPLv3, so the compromise was CDDL.


I stand corrected!


I don't think something that is the subject of an ongoing multi-billion-dollar lawsuit can rightly be called "obscure licensing minutia." It is high-profile and its actual effects have proven pretty significant.


> Honestly just use ZFS. We've wasted enough effort over obscure licensing minutia.

I am willing to bet that Google had the same thought. And I am also willing to bet that Google is regretting that thought now.


It's not just licensing. ZFS has some deep-rooted flaws that can only be solved by block pointer rewrite, something that has an ETA of "maybe eventually".


Care to elaborate?


You can't make a copy-on-write copy of a file. You can't deduplicate existing files, or existing snapshots. You can't defragment. You can't remove devices from a pool.

That last one is likely to get some kind of hacky workaround. But nobody wants to do the invasive changes necessary for actual BPR to enable that entire list.


Wow. As a casual user - someone who at one point was trying to choose between RAID, LVM and ZFS for an old NAS - some of those limitations of ZFS seem pretty basic. I would have taken it for granted that I could remove a device from a pool or defragment.


> There are no (stable) alternatives. BTRFS certainly not, as it's "under heavy development"¹ (since... forever).

Unless you are living in 2012 on a RHEL/CENTOS 6/7 machine, btrfs has been stable for way too long. I have been using btrfs as the sole filesystem on my laptop in standard mode, on my desktop as RAID0 and my NAS as RAID1 for more that two years. I have experienced absolutely zero data loss. Infact, btrfs recovered my laptop and desktop from broken package updates many times.

You might have had some issues when you tried btrfs on distros like RHEL that did not backport the patches to their stable versions because they don't support btrfs commercially. Try something like openSUSE that backports btrfs patches to stable versions or use something like arch.

> That's true, however, the amount is breakage is no different from any other out-of-tree module, and it unlikely to happen with a patch version of a working kernel (in fact, it happen with the 5.0 release).

This is a filesystem that we are talking. In no circumstances will any self respecting sysadmin use a file system that has even a small change of breaking with a system update.


I also used btrfs not too long ago in RAID1. I had a disk failure and voila, the array would be read-only from now on and I would have to recreate it from scratch and copy data over. I even utilized the different data recovery methods (at some point the array would not be mountable no matter what) and in the end that resulted in around 5% of the data being corrupt. I won't rule out my own stupidity in the recovery steps, but after this and the two other times when my RAID1 array went read-only _again_ I just can't trust btrfs for anything other than single device DUP mode operation.

Meanwhile ZFS has survived disk failures, removing 2 disks from an 8 disk RAIDZ3 array and then putting them back, random SATA interface connection issues that were resolved by reseating the HDD, and will probably survive anything else that I throw at it.


I believe he's referring to the raid 5/6 issues


RAID 5/6 issue is the write hole, which is common to all software RAID 5/6 implementations. If it is a problem for you, use either BBU or UPS.

RAIZ/Z2 avoids the issue by having slightly different semantics. That's why it is Z/Z2, not 5/6.


A former employer was threatened by Oracle because some downloads for the (only free for noncommercial use) VirtualBox Extension Pack came from an IP block owned by the organization. Home users are probably safe, but Oracle's harassment engine has incredible reach.


My employer straight up banned the use of VirtualBox entirely _just in case_. They'd rather pay for VMWare Fusion licenses than deal with any potential crap from Oracle.


Anecdotal, but VirtualBox has always been a bit flaky for me.

VMWare Fusion, on the other hand, powers the desktop environment I've used as a daily work machine for the last 6 months, and I've had absolutely zero problems other than trackpad scrolling getting emulated as mouse wheel events (making pixel-perfect scroll impossible).

Despite that one annoyance, it's definitely worth paying for if you're using it for any serious or professional purpose.


On the other hand, VMWare Fusion kernel extension is the only culprit, why I've seen kernel panic on Mac.


This is throwing the baby along with the bathwater.

VirtualBox itself is GPL. There is no lawsuit risk.

What requires "commercial considerations" is the extension pack.

The extension pack is required for:

> USB 2.0 and USB 3.0 devices, VirtualBox RDP, disk encryption, NVMe and PXE boot for Intel cards

If licensing needs to be considered (ie. in a corporate environment), but one doesn't need the functionalities above, then there's no issue.


> This is throwing the baby along with the bathwater.

It might be, but let's just say that Oracle aren't big fans of $WORK, and our founders are big fans of them. Thus our legal department are rather tetchy about anything that could give them even the slightest chance of doing anything.

> What requires "commercial considerations" is the extension pack.

And our legal department are nervous about that being installed, even by accident, so they prefer to minimise the possibility.


Well ... that sounds initially unreasonable, but then if I think about it a bit more I'm not sure how you'd actually enforce a non-commercial use only license without some basic heuristic like "companies are commercial".

Is the expectation here that firms offering software under non-commercial-use-is-free licenses just run it entirely on the honour system? And isn't it true that many firms use unlicensed software, hence the need for audits?


IIRC VirtualBox offers to download the Extension Pack without stating it's not free for commercial use. There isn't even a link to the EULA in the download dialog as far as I can tell (from Google Images, at least). Conversely, VirtualBox itself is free for commercial use. Feels more like a honeypot than license auditing.

They can also apply stronger heuristics, like popping up a dialogue box if the computer is centrally-managed (e.g.: Mac MDM, Windows domain, Windows Pro/Enterprise, etc.).


Wait is this the pack that gets screen resizing and copy/paste working?


You're thinking of the Guest Additions which is part of the base Virtualbox package and free for commercial use.

The (commercially licensed) Extensions pack provide "Support for USB 2.0 and USB 3.0 devices, VirtualBox RDP, disk encryption, NVMe and PXE boot for Intel cards"[1] and some other functionality e.g. webcam passthrough [2]. There may be additional functionality enabled by the Extension pack I cannot find at a glance, but those are the main things.

[1] https://www.virtualbox.org/wiki/Downloads [2] https://www.virtualbox.org/manual/ch01.html#intro-installing


A tad offtopic, but on my 2017 Macbook Pro the "pack" was called VMWare Fusion.

With my MBP as host and Ubuntu as guest, I found that VirtualBox (with and without guest extensions installed) had a lot of graphical performance issues that Fusion did not.


They harass universities about it too. Which is ludicrous, because universities often have residence halls, and people who live there often download VirtualBox extensions.


Their PUEL license even has a grant specifically for educational use.


It does, but it's not 100% clear if administrative employees of universities count as educational. Sure, if you are teaching a class with it, go for it; but running a VM in it for the university accounting office is not as clear.


Education might not be the same as research in this license's terms. And there are even software vendors picking nits about writing a thesis being either research or education, depending on their mood and purse fill level...


> There is no conceivable reason that Oracle would want to threaten me with a lawsuit.

I don't think it has to be conceivable with Oracle...

Unfortunately I have to agree with Linus on this one. Messing with Oracle's stuff is dangerous if you can't afford a comparable legal team.


"Oracle's stuff" can most often be described more accurately as "what Oracle considers its stuff".


Linus is distributing the kernel, a very different beast from using a kernel module. I can't imagine Oracle targeting someone for using ZFS on Linux without first establishing that the distribution of ZFS on Linux is illegal.


> there is no conceivable reason that Oracle would want to threaten me with a lawsuit.

Money. Anecdotally that's the primary reason Oracle do anything.


If anyone thinks this is hyperbole :

I worked for a tiny startup (>2 devs full time) where Oracle tried to extract money from us because we used MariaDB on AWS.

If you think this sounds ridiculous you probably got it right.

(Why? Because someone inexperienced with Oracle had filled out the form while downloading the mySQL client.)


Re-reading my comment in daylight I realize I got one detail almost exactly wrong: we were always <= 2 developers, but it seems everyone understood the point anyway - we were tiny, but not too tiny for Oracles licensing department.


Well... Serves you about right for choosing MySQL over PostgreSQL :)


In my defense it wasn't my choice ;-)


"there is no conceivable reason that Oracle would want to threaten me with a lawsuit."

Don't be so sure about this.


None of these are good reasons to purposely hinder the optional use of ZFS as a third party module by users, which is what Linux is doing.


Can you expand? I'm no expert - use linux daily but have always just used distro default file system. Linus' reasons for not integrating seems pretty sensible to me. Oracle certainly has form on the litigation front.


Linus' reasons for not integrating ZFS are absolutely valid and it's no doubt that ZFS can never be included in the mainline kernel. There's absolutely no debate there.

However the person he is replying to was not actually asking to have ZFS included in the mainline kernel. As noted above, that could never happen, and I believe that Linus is only bringing it up to deflect from the real issue. What they were actually asking is for Linux to revert a change that was made for no other reason than to hinder the use of ZFS.

Linux includes a system which restricts what APIs are available to each module based on the license of the module. GPL modules get the full set of APIs whereas non-GPL modules get a reduced set. This is done strictly for political reasons and has no known legal basis as far as I'm aware.

Not too long ago a change was made to reduce the visibility of a certain API required by ZFS so only GPL modules could use it. It's not clear why the change was made, but it was certainly not to improve the functionality of the kernel in any way. So the only plausible explanation to me is that it was done just to hinder the use of ZFS with Linux, which has been a hot political issue for some time now.


If I remember correctly, the reasoning for the GPL module stuff was/is, that if kernel modules integrate deeply with the kernel, they fall under gpl. So the GPL flag is basically a guideline of what kernel developers believe is safe to use from non gpl-compatible modules


But from what I can see, marking the "save SIMD registers" function as GPL is a blatant lie by a kernel developer that wanted to spite certain modules.

Saving and restoring registers is an astoundingly generic function. If you list all the kernel exports and sort by how much they make your work derivative, it should be near the very bottom.


You are not supposed to use FP/SSE in kernel mode.

It was always frowned upon:

> In other words: it's still very much a special case, and if the question was "can I just use FP in the kernel" then the answer is still a resounding NO, since other architectures may not support it AT ALL.

> Linus Torvalds, 2003

and these specific functions, that were marked as GPL were already deprecated for well over a decade.


> You are not supposed to use FP/SSE in kernel mode.

> It was always frowned upon

Whether it's frowned upon is a completely different issue from whether it intertwines your data so deeply with the kernel that it makes your code a derivative work subject to the GPL license. Which it doesn't.

> if the question was "can I just use FP in the kernel" then the answer is still a resounding NO, since other architectures may not support it AT ALL.

It's not actually using floating point, it's using faster instructions for integer math, and it has a perfectly viable fallback for architectures that don't have those instructions. But why use the slower version when there's no real reason to?

> and these specific functions, that were marked as GPL were already deprecated for well over a decade.

But the GPL export is still there, isn't it? It's not that functionality is being removed, it's that functionality is being shifted to only have a GPL export with no license-based justification for doing so.


So what meets the criteria of being a "special case" and what doesn't? One of the examples that Linus gives is RAID checksumming. How come RAID checksumming is a special case but ZFS checksumming isn't? I don't think it has anything to do with the nature of the usage, the only problem is that the user is ZFS.


RAID checksuming is in the kernel, and when Linus says jump, the RAID folks ask back how high.

He is not going to beg people outside kernel, whether he is allowed to change something that may break their module. On the contrary, they must live with any breackage that is thrown at them.

Again, that symbol was deprecated for well over a decade. How long does it take to be allowed to remove it?


Sometimes in life we do things even though we are not explicitly obligated to do them. Nobody is asking for ZFS to get explicitly maintained support in the Linux kernel. They are simply asking for this one small inconsequential change to be reverted just this one time, since it would literally be no harm to the kernel developers to do so, and it would provide substantial benefits to any user wanting to use ZFS. Furthermore the amount of time that kernel developers have spent arguing in favour of this change has been significantly greater than the time it would have taken to just revert it.

> Again, that symbol was deprecated for well over a decade.

But not the GPL equivalent of the symbol. That symbol is not deprecated.


This is the commonly recited argument but I don't believe it was ever proven to be legally necessary. Furthermore, even if it was, it's not clear what level of integration is "too deep". So in practice, it's just a way for kernel developers to add political restrictions as they see fit.


Proven legally necessary, as in, a court telling them to stop doing something? I'm pretty sure they don't want it to get to that point.


Proven legally necessary, as in, a court ever telling anyone in that situation to stop doing it. Or even to start doing it in the first place. There's just no legal justification behind it whatsoever.


"Proven" is a maybe impossible standard: Kernel devs hint at the GPLonly exports having been useful in certain cases they prefer not to discuss on a ML. https://lore.kernel.org/lkml/20190110131132.GC20217@kroah.co...

One can interpret this as something legally significant, or an embarrassing private anecdote, or nothing substantial at all, maybe even just talk. However, I'd give them the benefit of the doubt. Not the least since they could be the ones against Oracle's legal dept...


What he is referring to is the use of the GPL export restriction to strong-arm companies into releasing their code as GPL. It's nothing to do with a legal requirement, he is just an open source licensing hardhead. See: https://lwn.net/Articles/603145/


Surely the kernel developers can do whatever the hell they like.

If you don’t like that don’t use it.


>This is done strictly for political reasons and has no known legal basis as far as I'm aware.

let me stop you right there. This being "Oracle," and its litigious nature, how can you truly be aware or sure?

Linus is literally saying there is a legal basis.


> This being "Oracle," and its litigious nature, how can you truly be aware or sure?

The functionality I'm describing has absolutely nothing to do with ZFS or Oracle in any way. If you really think the reach of Oracle is so great, then why not block all Oracle code from ever running on the OS? That seems to me to be just as justified as this change.


> why not block all Oracle code from ever running on the OS?

...to be fair, I would probably run that module.


I think that it would be a mandated module by many companies.


Oracle sued Google for copying the same names of the functions.


And I believe oracle copied the amazon s3 api.

I can't make a informed opinion but my uninformed gut feeling is oracle have done what they are suing google for having done.


This want a case of "purposely hinder", but rather the zfs nodule broke because of some kernel changes. The kernel is careful to never break userspace and never break its own merged modules. But if you're a third-party module then you're on your own. The kernel developers can't be responsible for maintaining compatibility with your stuff.


The changes conveniently accomplished nothing except for breaking ZFS. Furthermore, just because they don't officially support ZFS doesn't mean they must stonewall all the users who desire the improved compatibility. Reverting this small change would not be a declaration that ZFS is officially supported.


> - Performance is not that great compared to the alternatives.

CoW filesystems do trade performance for data safety. Or did you mean there are other _stable/production_ CoW filesystems with better performance? If so, please do point them out!


XFS on LVM thin pool LV. Stable and performant as far as I can tell.


My terrible experiences with thin pools makes me see btrfs as a pool of wonderful, trouble-free and perfect code.

Just ask yourself what happens when a thin pool runs out of actual, physical disk blocks?


Isn't this a problem for any over provisioned storage pool ? You can avoid that if you want by not over provisioning & checking space consumed by CoW snapshots. Also what does ZFS do if you run out of blocks ?

I have actually managed to run out of blocks on XFS on thin LV and it's an interesting experience. XFS always survoved just fine, but some files basically vanished. Looks like mostly those that were open and being written to at exhaustion time, like for example a mariadb database backing store. Files that were just sitting there were perfectly fine as far as I could tell.

Still, you definitely should never put data on a volume where a pool can be exhausted, without a backup as I don't think there is really a bulletproof way for a filesystem to handle that happening suddenly.


>Isn't this a problem for any over provisioned storage pool ?

ZFS doesn't over-provision anything by default. The only case I'm aware of where you can over-provision with ZFS is when you explicitly choose to thin provision zvols (virtual block devices with a fixed size). This can't be done with regular file systems which grow as needed, though you can reserve space for them.

File systems do handle running out of space (for a loose definition of handle) but they never expect the underlying block device to run out of space, which is what happens with over-provisioning. That's a problem common to any volume manager that allows you to over provision.


Can't you over provision even just by creating too many many snapshots ? Even if you never make the filesystems bigger then the backing pool, the snapshots will allocate some blocks from the pool and over time, boom.


Snapshots can't cause over-provisioning, not for file systems. If I mutate my data and keep snapshots forever, eventually my pool will run out of free space. But that's not a problem of over-provisioning, that's just running out of space.

With ZFS, if I take a snapshot and then delete 10GB of data my file system will appear to have shrunk by 10GB. If I compare the output of df before and after deleting the data, df will tell me that "size" and "used" have decreased by 10GB while "available" remained constant. Once the snapshot is deleted that 10GB will be made available again and the "size" and "available" columns in df will increase. It avoids over-provisioning by never promising more available space than it can guarantee you're able to write.

I think you're trying to relate ZFS too much to how LVM works, where LVM is just a volume manager that exposes virtual devices. The analogue to thin provisioned LVM volumes is thin-provisioned zvols, not regular ZFS file systems. I can choose to use ZFS in place of LVM as a volume manager with XFS as my file system. Over-provisioned zvols+XFS will have functionally equivalent problems as over-provisioned LVM+XFS.


ZFS doesn't work this way. The free blocks in the ZFS pool are available to all datasets (filesystems). The datasets themselves don't take up any space up front until you add data to them. Snapshots don't take up any space initially. They only take up space when the original dataset is modified, and altered blocks are moved onto a "deadlist". Since the modification allocates new blocks, if the pool runs out of space it will simply return ENOSPC at some point. There's no possibility of over-provisioning.

ZFS has quotas and reservations. The former is a maximum allocation for a dataset. The latter is a minimum guaranteed allocation. Neither actually allocate blocks from the pool. These don't relate in any comparable way to how LVM works. They are just numbers to check when allocating blocks.


LVM thin pools had (maybe still have - I haven't used them recently) another issue though, where running out of metadata space caused the volumes in the thinpool to become corrupt and unreadable.


ZFS does overprovision all filesystems in a zpool by default. Create 10 new filesystems and 'df' will now display 10x the space of the parent fs. A full fs is handled differently than your volume manager running out of blocks. But the normal case is overprovisioning.


That's not really overprovisioning. That's just a factor of the space belonging to a zpool, but 'df' not really having a sensible way of representing that.


That is not over-provisioning, it's just that 'df' doesn't have the concept of pooled storage. With pools it's possible for different file systems to share their "available" space. BTRFS also has its own problems with ouput when using df and getting strange results.

If I have a 10GB pool and I create 10 empty file systems, the sizes reported in df will be 100GB. It's not quite a lie either, because each of those 10 file systems does in fact have 10GB of space available I could write 10GB to any one of them. If I write 1GB to one of those file systems, the "size" and "available" spaces for the other nine will all shrink despite not having a single byte of data written to them.

With ZFS and df the "size" column is really only measuring the maximum possible size (at this point in time, assuming nothing else is written) so it isn't very meaningful, but the "used" and "available" columns do measure something useful.


This is exactly what overprovisioning is: The sum of possible future allocations is greater than available space.


In my example the sum of possible future allocations for ZFS is still only 10GB total. Each of the ten file systems, considered individually, does truthfully have 10GB available to it before any data is written. The difference is that with over-provisioning (like LVM+XFS), if I write 10GB of data to one file system the others will still report 10GB of free space, but with ZFS or BTRFS they'll report 0GB available, so I can never actually attempt to allocate 100GB of data.

You could build a pool-aware version of DF that reflects this, by grouping file systems in a pool together and reporting that the pool has 10GB available. But frankly there's not enough benefit to doing that because people with storage pools already understand summing up all the available space from df's output is not meaningful. Tools like zpool list and BTRFS's df equivalent already correctly report the total free space in the pool.


XFS is not copy on write.


XFS has supported reflinks for some time already, just the deduplication is kind of experimental.

Supporting reflinks is actually more, than can be said about ZoL (see zfsonlinux#405).


I think that you're both right - under normal conditions XFS not CoW, but when using the reflink option it does use CoW => kind of a mix.


>- Using it opens you up to the threat of lawsuits from Oracle. Given history, this is a real threat. (This is one that should be high for Linus but not for me - there is no conceivable reason that Oracle would want to threaten me with a lawsuit.)

No. Distributing (ie. precompiled distro with ZFS) will. You are free to run any software on your machine as you so desire.


This reminds me of the adaptation of a Churchill quote that "ZFS is the worst of the file systems, except for all others."


The problem with ZFS is that it isn't part of Linux kernel.

Linux project maintains compatibility with userspace software but it does not maintain compatibility with 3rd party modules and for a good reason.

Since modules have access to any internal kernel API it is not possible to change anything within kernel without considering 3rd party code, if you want to keep that code working.

For this reason the decision was made that if you want your module to work you need to make it part of Linux kernel and then if anybody refactors anything they need to consider modules they would be affecting by the change.

Not allowing the module to be part of the kernel is a disservice to your user base. While there are modules like that that are maintained moderately successfully (Nvidia, vmware, etc.) this is all at the cost of the user and userspace maintainers who have to deal with it.


It isn't just ZFS. All sorts of drivers get broken because Linux refuses to offer a stable API, saying your code should be in the kernel, but also often refuses to accept drivers into the kernel, even open-source code with no particular quality issues (e.g. quickcam, reiserfsv4).

Use FreeBSD where there's a stable ABI and you don't have these problems.


Plenty of drivers get rejected because the kernel developers have no confidence that they will be maintained going forward, which would mean the driver would be removed fairly quickly again.


FreeBSD does not really have a stable ABI; every major release breaks the ABI, so it's only stable for 2 years.

https://wiki.freebsd.org/VendorInformation


Stable for each major and minor release is still a vast step up on Linux.

Having a stable ABI for two years is vastly easier to support than an ABI which changes every two weeks. This is reflected by the number of binary modules which are packaged for FreeBSD in the ports tree, and provided by third-party vendors. This stability makes it possible to properly support for a reasonable timeframe, and vendors are doing so.


Honestly, I don't like binary modules and I am happy with policy that let's me have functional operating system with modern hardware with source code that I have access to (well... except the firmware that even Linux can't do anything about until open-source hardware projects get more traction).

It is enough that almost all devices around me have a bunch of running code that I have absolutely no control over. I need at least one computer I can trust to do MY bidding.


The problem I have with this is that Linux shoots itself in the foot here. It's conflating two different problems: (1) supporting third-party modules and (2) supporting proprietary modules. All modules are ultimately binary; only a small subset are both proprietary and binary-only.

If you look at FreeBSD, the majority of third-party modules are free software. It's stuff like graphics drivers, newer ZFS modules, esoteric HBAs etc. Proprietary modules, like nVidia's graphics driver, are the minority.

I can see and understand why things are the way they are, and indeed I agreed with the approach for many years. Today, I see it being as short sighted as the GCC vs LLVM approach to modular architecture.

Linux is nearly 30 years old now. To not have stable internal interfaces seems to me to be indicative of either bad initial design or ill discipline on the part of its maintainers. Every other major kernel seems to manage to have a stable ABI for third-party functionality, and Linux is an outlier in its approach. Having to upgrade the kernel for a new GPU driver is painful. Not only do I have to wait for a new kernel release, I have to hope that none of the other changes in that release cause breakage or change the behaviour in unexpected ways. Upgrading a third-party module is much less risky.


I don't see how it shoots itself in the foot given that these rules were basically since forever and it is currently most popular open source operating system by a huge margin.


Well, I left Linux in part because a lot of my hardware stopped working - FreeBSD probably has a fraction of the developers that Linux does, yet I actually have more faith in its hardware support because of this issue. YMMV I guess.


Parent updated their post and my comment is no longer relevant.


I don't see how it's an insult to the users. It's saying that not allowing ZFS code to be distributed under the GPL and be maintained as part of the Linux kernel, is a disservice to ZFSonLinux users. Which I think is clearly right.


I edited it out before I saw your comment.


And he was doing fine up to that point. For IMO good reasons, ZFS will likely never be merged into Linux. And filesystem kernel modules from third parties have a pretty long history of breakage issues going back to some older Unixes.

That's going to be plenty of reason not to use ZFS for most people. The licensing by itself is also certainly a showstopper for many.

But I'm not sure his other comments are really fair and, had Oracle relicensed ZFS n years back, ZFS would almost certainly be shipping with Linux, whether or not as the typical default I can't say. It certainly wasn't just a buzzword and there were a number of interesting aspects to its approach.


Well, he says

> It was always more of a buzzword than anything else, I feel, and the licensing issues just make it a non-starter for me.

So presumably the licensing problem mentioned by your parent's comment is weighing heavily here. I think this "don't use ZFS" statement is most accurately targeted at distro maintainers. Anyone not actually redistributing Linux and ZFS in a way that would (maybe) violate the GPL is not at any risk. That means even large enterprises can get away with using ZoL.


It's exactly that, when combined with the longstanding practice of maintaining compatibility with userspace, but reserving the right to refactor kernel-space code whenever and wherever needed. If ZFS-on-linux breaks in a subtle or obvious way due to a change in linux, he can't afford to care about that - keeping the linux kernel codebase sane while adding new features, supported hardware, optimizations, and fixes at an honestly scary rate, is not that easy.

See also https://www.kernel.org/doc/html/latest/process/stable-api-no...

(fuse is a stable user-space API if you want one ... it won't have the same performance and capabilities of course ...)


> he can't afford to care about that - keeping the linux kernel codebase sane while adding new features, supported hardware, optimizations, and fixes at an honestly scary rate, is not that easy.

Maybe, but the complains seem to be more related to the (problematic) changes not being of technical nature accidentally braking ZFS, but being more of political nature. With speculation that it might have been meant to _intentionally_ brake ZFS and then pretend this was a accident because ZFS isn't (and can never) be maintained in tree. Basically on the line of "we don't like out of tree kernel modules so we make the live hard for them". No idea if this is actually the case or people just spin thinks together. Even if it is the case I'm not sure what I should think about, because it's at least partially somewhat understandably.


Linus is rather tolerant (or apathetic) about non-GPL modules, but what he doesn't care to do is ensure that there is an appropriate set of non-GPL-marked exports available for external modules. If some other developer happens to mark some export GPL and it happens to be one key export needed by a non-GPL external module, Linus doesn't care, because he doesn't care about external modules.

This has come up many times in the past. Keep in mind that linux has always been GPLv2-only, it is not LGPL or anything like that.

https://lwn.net/Articles/769471/

https://lwn.net/Articles/603131/

https://lkml.org/lkml/2012/2/7/451


"Don't use ZFS. It's that simple. It was always more of a buzzword than anything else, I feel, and the licensing issues just make it a non-starter for me."

When he says that, I think on the $500 million Sun spent on advertising java.


Sun isn't going to sue anyone into oblivion any time soon, but Oracle sure will


Sun is all but defunct, I don't think I would characterize it as a subsidiary of Oracle.


That's kinda non-sensical IMO. If Oracle, the parent company is trigger happy, there are no guarantees they won't go deeper to protect their children companies IP if they feel they're being infringed.


I was thinking more of "the buzzword" bit, and how it got to be such a well known technology.


Well he had this:

> as far as I can tell, it has no real maintenance behind it either any more

Which simply isn't true. They just released a new ZFS version with encryption built in (no more ZFS + LUKS) and they removed the SPL dependency (which didn't support Linux 5.0+ anyway).

I use ZFS on my Linux machines for my storage and I've been rather happy with it.


Same, for at least 6 years in a 4 drive zraid array. It always reads and writes at full gigabit ethernet speeds and I haven't had any downtime other than maintaining FreeBSD updates which are trivial even when going from 10.x to 11 to 12.


"Same" for the last ~4 years, starting with 8 disks and as of 2018, the 24-bay enclosure is full. Each vdev is a mirrored pair split across HBAs to sedate my paranoia. I've replaced a few drives after watching unreadable sector count slowly increase over a few months. I've also switched out most of the original 3TB pairs to 8TB and 10TB pairs. ~42TB usable and the box only has 16GB of RAM (because I can't get the used 32GB sticks to work, it's a picky mainboard and difficult to find matching ECC memory here in Europe). I haven't powered down much except to attempt to replace the RAM or during extremely hot days. Read/write speed is more or less max gigabit, even during rebuild after hot-swapping drives.


Same here (4-drive raidz for many years), though I do have an issue where deleting large files (~1 GB) takes around a minute and nobody seems to know why (I have plenty free space and RAM)...


do you have lots of snapshots? every snapshotting FS I've worked with has really slow deletes, especially when the volume is near capacity.


Snapshots are one thing ZFS is fast at. All the blocks for a given snapshot are placed on a "deadlist". Snapshot deletion is essentially just returning this list of blocks back to the free pool. A terabyte snapshot will take a short while (in the background) to recycle those blocks. But the deletion itself is near instantaneous.


I think you misunderstand: file deletions are what is slow (I don't use ZFS, my reference is WAFL, but my understanding is that all snapshotting file systems have this problem).


Even this should have minimal overhead. If the file is present in the snapshot, then it's simply moving the blocks over to the deadlist which is a very cheap operation. If it's not in the snapshot then the blocks will get recycled in the background. In both cases you should have the unlink complete almost immediately.

All of the snapshot functionality is based upon simple transaction number comparisons plus the deadlist of blocks owned by the snapshot. Only the recycling of blocks should have a bit of overhead, and that's done by a background worker--you see the free space increase for a few minutes after a gargantuan snapshot or dataset deletion, but the actual deletion completed immediately.


I've been promised many things by vendors and they always fall back to "hey! look! cool CS file system theory". I test my systems carefully and report the results back; they often don't agree.

I should point out again that I don't have enough direct experience with ZFS to say if this is the case, my experience was with an enterprise NetApp server at a large company that was filling the disk up (>95%) in addition to doing hourly snapshots.


I have 400 in total, though none on the slow volume :/ That shouldn't affect it, right?


A single 5400 rpm drive (the like of wd red) should be able to saturate gigabit ethernet. 4 drive array should be basically idling.


Relevant bits:

"Don't use ZFS. It's that simple. It was always more of a buzzword than anything else, I feel, and the licensing issues just make it a non-starter for me.

The benchmarks I've seen do not make ZFS look all that great. And as far as I can tell, it has no real maintenance behind it either any more, so from a long-term stability standpoint, why would you ever want to use it in the first place?"


> The benchmarks I've seen do not make ZFS look all that great.

The thing about ZFS that actually appeals to me is how much error-checking it does. Checksums/hashes are kept of both data and metadata, and those checksums are regularly checked to detect and fix corruption. As far as I know it (and filesystems with similar architectures) are the only ones that can actually protect against bit rot.

https://github.com/zfsonlinux/zfs/wiki/Checksums

> And as far as I can tell, it has no real maintenance behind it either any more, so from a long-term stability standpoint, why would you ever want to use it in the first place?"

It has as much maintenance as any open source project: http://open-zfs.org/. IIRC, it has more development momentum behind it than the competing btrfs project.


> those checksums are regularly checked to detect and fix corruption.

I don't believe that's true. They are checked on access, but if left alone, nothing will verify them. From what I've read, you need to setup a cron job that runs scrubbing on some regular schedule.


Yes. Those cron jobs are installed by default by all major vendors that supply/support ZFS.


The setup instructions for ZFS always include the "how to setup regular scrubs" step.


Linus is just wrong as far as maintenance, as a look at the linux-zfs lists would show.

From my perspective, it has no real competitor under linux, which is why I use it. I don't consider brtfs mature enough for critical data. (Others can reasonably disagree, I have intentionally high standards for data durability.)

Aside from legal issues, he's talking out of his ass.


I don't care about my data, so I use ext4, and like most non-ZFS peasants I lose files every other day.


Bitrot is a real thing and deduplication is actually very useful for many usecases, so your sarcasm is ill-advised. ZFS has legitimate useful features that ext4 does not.


Not sure where that belief comes from. But it might be that many benchmarks are naive and compare it against other filesystems in single-disc setups with zero tuning. Since its metadata overheads are higher, it's definitely slower in this scenario. However, put a pool onto an array of discs and tune it a little, and the performance scales up and up leaving all Linux-native filesystems, and LVM/dm/mdraid, well behind. It's a shame that Linux has nothing compelling to do better than this.


Last time I used ZFS write performance was terrible compared to an ordinary RAID5. IIRC Writes in a raidz are always limited to a single disk’s performance. The only way to get better write speed is to combine multiple raidzs - which means you need a boatload if disks.


We had a bunch of Thumpers (SunFire X4200) with 48 disks at work, running ZFS on Solaris. It was dog slow and awful, tuning performance was complicated and took ages. One had to use just the right disks in just the right order in RaidZs with striping over them. Swap in a hotspare: things slow to a crawl (i.e. not even Gbit/s).

After EoL a colleague installed Linux with dmraid, LVM and xfs on the same hardware: much faster, more robust. Sorry, don't have numbers around anymore, stuff has been trashed since.

Oh, and btw., snapshots and larger numbers of filesystems (which Sun recommended instead of the missing Quota support) also slow things down to a crawl. ZFS is nice on paper and maybe nice to play with. Definitely simpler to use than anything else. But performance-wise it sucked big time, at least on Solaris.


ZFS, on Solaris, not robust?

ZFS for “play”?!

This... is just plain uninformed.

Not just me and my employer, but many (many) others rely on ZFS for critical production storage, and have done so for many years.

It’s actually very robust on Linux as well - considering the fact that freeBSD have started to use the ZoL code base is quite telling.

Would freeBSD also be in the “play” and “not robust” category as well, hanging out together with Solaris?

Will it perform better than all in terms of writes/s? Most likely not - although by staying away from de-dup, enough RAM and adhere the pretty much general recommendation to use mirror vdevs only in your pools, it can be competitive.

Something solid with data integrity guarantees? You can’t beat ZFS, imo.


> Something solid with data integrity guarantees? You can’t beat ZFS, imo.

This reminds me. We had one file server used mostly for package installs that used ZFS for storage. One day our java package stops installing. The package had become corrupt. So I force a manual ZFS scrub. No dice. Ok fine I’ll just replace the package. It seems to work but the next day it’s corrupt again. Weird. Ok I’ll download the package directly from Oracle again. The next day again it’s corrupt. I download a slightly different version. No problems. I grab the previous problematic package and put it in a different directory (with no other copies on the file system) - again it becomes corrupt.

There was something specific about the java package that ZFS just thought it needed to “fix”. If I had to guess it was getting the file hash confused. I’m pretty sure we had dedupe turned on so that may have factored into it.

Anyway that’s the first and only time I’ve seen a file system munge up a regular file for no reason - and it was on ZFS.


Performance wasn't robust, especially on dead disks and rebuilds, but also on pools with many (>100) filesystems or snapshots. Performance would often degrade heavily and unpredictably on such occasions. We didn't loose data more often than with other systems.

"play" comes from my distinct impression that the most vocal ZFS proponents are hobbyists and admins herding their pet servers (as opposed to cattle). ZFS comes at low/no cost nowadays and is easy to use, therefore ideal in this world.


Fair enough, I can’t argue with your personal experience, but I can assure you that ZFS is used ”for real” at many shops.

I’ve only used zfs in two or three way mirror setup, on beefy boxes, where the issues you describe are minimal. Also JBOD only.

The thing is that without checksumming you’ve actually no idea if you lose data. I’ve had several pools over the years report automatic resilvering on checksum mismatches. Usually it’s been disks acting up well before smart can tell, and reporting this has been invaluable.


Sounds like you turned on dedupe, or had an absurdly wide stripe size. You do need to match your array structure to your needs as well as tune ZFS.

On our backup servers (45 disks, 6-wide Z2 stripes) easily handle wire-speed 10G with 32G ARC.

And you're just wrong about snapshots and filesystem counts.

ZFS is no speed demon, but it performs just fine if you set it up correctly and tune it.


Stripe size could have been a problem, though we just went with the default there afair. Most of the first tries was just along the Sun docs, we later only changed things until performance was sufficient. Dedupe wasn't even implemented back then.

Maybe you also don't see as massive an impact because your hardware is a lot faster. X4200s were predominantly meant to be cheap, not fast. No cache, insufficient RAM, slow controllers, etc.


X4200s were the devil's work. Terrible BMC, raid controller, even the disk caddies were poorly designed.

The BMC controller couldn't speak to the disk controller so you had no out-of-band storage management.

I had to Run a fleet of 300 of them, truly an awful time.


ZFS performs quite well if you give it boatloads of RAM. It uses its own cache layer, and eats RAM like hotcakes. XFS OTOH is as fast as the hardware can go with any amount of RAM.


Sort of. But no snapshots.

Wanna use LVM for snapshots? 33% performance hit for the entire LV per snapshot, by implementation.

ZFS? ~1% hit. I've never been able to see any difference at the workloads I run, whereas with LVM it was pervasive and inescapable.


That was with the old LVM snapshots. Modern CoW snapshots have a much smaller impact. Plus XFS developers are working on internal snapshots, multi-volume management, and live fsck (live check already works, live repair to come).


I don't doubt this but do you have any documentation?

Asking for a friend who uses XFS on LVM for disk heavy applications like database, file server, etc.


You would have to look at the implementation directly. The user documentation isn't great for documenting performance considerations, sadly.

Essentially it comes down to this: a snapshot LV contains copies of old blocks which have been modified in the source LV. Whenever a block is updated in the source LV, LVM will need to check if that block been previously copied into all corresponding snapshot LVs. For each source LV where this is not the case, it will need to copy the block to the snapshot LV.

This means that there is O(n) complexity in the checking and copying. And in the case of "thin" LVs, it will also need to allocate the block to copy to, potentially for every snapshot LV in existence, making the process even slower. The effect is write amplification effectively proportional to the total number of snapshots.

ZFS snapshots, in comparison, cost essentially the same no matter how many you create, because the old blocks are put onto a "deadlist" of the most recent snapshot, and it doesn't need repeating for every other snapshot in existence. Older snapshots can reference them when needed, and if a snapshot is deleted, any blocks still referenced are moved to the next oldest snapshot. Blocks are never copied and only have a single direct owner. This makes the operations cheap.


FreeNAS has good documentation on which hardware to pick and how to set up ZFS.


That's for the old "fat" LVM snapshots, right ? No way the new CoW thin LVs have such a big overhead for snapshots.


There will be a much bigger overhead in accounting for all of the allocations from the "thin pool".

The overlying filesystem also lacks knowledge of the underlying storage. The snapshot much be able to accommodate writes up to and including the full size of the parent block device in order to remain readable, just like the old-style snapshots did. That's the fundamental problem with LVM snapshots; they can go read-only at any point in time if the space is exhausted, due to the implicit over-commit which occurs every time you create a snapshot.

The overheads with ZFS snapshots are completely explicit and all space is fully and transparently accounted for. You know exactly what is using space from the pool, and why, with a single command. With LVM separating the block storage from the filesystem, the cause of space usage is almost completely opaque. Just modifying files on the parent LV can kill a snapshot LV, while with ZFS this can never occur.


"After EoL a colleague installed Linux with dmraid, LVM and xfs on the same hardware: much faster, more robust."

Please let me know which company this is, so I can ensure that I never end up working there by accident. Much obliged in advance, thank you kindly.


Why? What is bad about playing around with leftover hardware?


Nothing at all; it's what was done to that hardware that's the travesty here. It takes an extraordinary level of incompetence and ignorance to even get the idea to slap Linux with dmraid and LVM on that hardware and then claim that it was faster and more robust without understanding how unreliable and fragile that constelation is and that it was faster because all the reliability was gone.


dmraid raid5/6 lose data, sometimes catastrophically, in normal failure scenarios that the ZFS equivalent handles just fine. If a sector goes bad between the time when you last scrubbed and the time when you get a disk failure (which is pretty much inevitable with modern disk sizes), you're screwed.


> Writes in a raidz are always limited to a single disk’s performance

what? no. why would that be the case? You lose a single disk's performance due to the checksumming.

just from my personal NAS I can tell you that I can do transfers from my scratch drive (NVMe SSD) to the storage array at more than twice the speed of any individual drive in the array... and that's in rsync which is notably slower than a "native" mv or cp.

The one thing I will say is that it does struggle to keep up with NVMe SSDs, otherwise I've always seen it run at drive speed on anything spinning, no matter how many drives.


> what? no. why would that be the case? You lose a single disk's performance due to the checksumming.

I think they are probably referring to the write performance of a RAIDZ VDEV being constrained by the performance of the slowest disc within the VDEV.


true, if you have 7 fast disks and one slow disk in a raidz, you get 7 x slow disk performance.


Have you seen any benchmarks for the scenario you've described?


Have you got any info on how to do the required tuning that's geared towards a home NAS?


Group your disks in bunches of 4 or 5 per Raidz, no more. And have them on the same controller or SAS-expander per bunch. Use striping over the bunches. Don't use hotspares, for performance maybe avoid RAIDz6. Try out and benchmark a lot. Get more RAM, lots more RAM.


Back when I setup my last ZFS running on OmniOS 5 disks was not optimal, though i am running RAIDZ2

But yes, lots of RAM


I think the optimal number of RAIDz5 disks is 3, if you just want performance. But this wastes lots of space of course. Also, the number of SAS/SATA-channels per controller and the topology of expanders is important. Thats why I don't think there is a recipy, you have to try it out for each new kind of hardware.

And as another thread pointed out, stripe size is also an important parameter.


I think you mean RAIDZ1, not 5.


Yes. RAIDz (without the "1") was the original RAID5-equivalent, RAIDz2 is equivalent to RAID6. However since nobody really knows what the hell z1 and z2 is and z1 is easy to mix up with RAID1 for nonZFS people, calling it z5 and z6 is far less confusing.


It's the number of parity disks, pretty simple. There has been occasional talk of making the number arbitrary, though presently only raidz1, raidz2, and raidz3 exist.


I think speed is not the primary reason many (most?) people use ZFS; I think it's mostly about stability, reliability and maintainability.


> And I'm not at all interested in some "ZFS shim layer" thing either

If there is no "approved" method for creating Linux drivers under licenses other than the GPL, that seems like a major problem that Linux should be working to address.

Expecting all Linux drivers to be GPL-licensed is unrealistic and just leads to crappy user experiences. nVidia is never going to release full-featured GPL'd drivers, and even corporative vendors sometimes have NDAs which preclude releasing open source drivers.

Linux is able to run proprietary userspace software. Even most open source zealots agree that this is necessary. Why are all drivers expected to use the GPL?

---

P.S. Never mind the fact that ZFS is open source, just not GPL compatible.

P.P.S. There's a lot of technical underpinnings here that I'll readily admit I don't understand. If I speak out of ignorance, please feel free to correct me.


I am also not an expert in this space - but if I understand correctly the reason the linux Nvidia driver sucks so much is that it is not GPL'd (or open source at all).

There is little incentive for Nvidia to maintain a linux specific driver, but because it is closed source the community cannot improve/fix it.

> Why are all drivers expected to use the GPL?

I think the answer to this is: drivers are expect to use the GPL if they want to be mainlined and maintained - as Linus said: other than that you are "on your own".


> drivers are expect to use the GPL if they want to be mainlined and maintained

I think parent comment wasn't asking for third party, non-GPL drivers to be mainlined, but for a stable interface for out-of-tree drivers.


There is just no incentive for this that I can see. Linux is an open source effort. Linus had said that he considers open source "the only right way to do software". Out of tree drivers are tolerated, but the preferred outcome is for drivers to be open sourced and merged to the main Linux tree.

The idea that Linux needs better support for out of tree drivers is like someone going to church and saying to the priest "I don't care about this Jesus stuff but can I have some free wine and cookies please".

Full disclosure my day job is to write out of tree drivers for Linux :)


I would expect a large fraction of Nvidia's GPU sales to be from customers wanting to do machine learning. What platform do these customers typically use? Windows?

How do the Linux and Windows drivers compare on matters related to CUDA?


Nvidia has a proprietary Linux driver that works just fine for GPGPU purposes. But because it's not GPLed, it will never be mainlined into the kernel, so you have to install it separately. This is in contrast to AMD GPUs, for which the driver lives in the Linux kernel itself.


Critically, Nvidia has a GPL'd shim. In the kernel code, which lets them keep a stable ABI. The kind of shim Linus isn't interested in for ZFS.


CUDA works fine, and I have found (completely non-rigorously) that a lot of the time where the workload is somewhat mixed between GPU and CPU you'll get better performance on Linux.

The _desktop_ situation is worse, though perfectly functional. But I boot into Windows when I want battery life and quiet fans on my laptop.


You make it sound like the idea is "if you GPL your driver, we'll maintain it for you", which is kinda bullshit. For one, kernel devs only really maintain what they want to maintain. They'll do enough work to make it compile but they aren't going to go out of their way to test it. Regressions do happen. More importantly though, they very purposefully do no maintain any stability in the driver ABI. The policy is actively hostile to the concept of proprietary drivers.

Which is really kinda of hilarious considering that so much modern hardware requires proprietary firmware blobs to run.


My experience is that linux nvidia drivers are better than the competitors open source drivers.


Nvidia proprietary drivers work OK for me, mostly (I needed to spoof the video card ID so KVM could lie to the Windows drivers in my home VFIO setup, but it wasn't hard.)

But it means I can't use Wayland. Wayland isn't critical for me, but since NVidia is refusing to implement GBM and using EGLStream instead, there's nothing I can do about it. It simply isn't worth NVidia's time to make Wayland work, so I'm stuck using X. If the driver were open-source someone would have submitted a GBM patch and i wouldn't be stuck in this predicament.

I can't wait for NVidia to have real competition in the ML space so I can ditch them.


No you can use Wayland as long as your window manager/environment supports GBM. Gnome and KDE both do (Which for most Linux users is all that is needed).

Now you can't use something like Sway but their lead developer is too evangelical for my taste so even if I had an AMD/Intel card I would never use it.


> No you can use Wayland as long as your window manager/environment supports GBM.

You can do that on Intel and AMD drivers and other open source graphics drivers, which due to being open source allow 3rd parties like redhat to patch in GBM support in drivers and mesa when required.

Nvidia driver does not support GBM code paths. Therefore wayland does not work on nvidia. And because nvidia driver is not open source, someone else cannot patch GBM in.


I'm fairly sure parent meant 'EGLStream', not GBM. KDE and GNOME's Wayland compositors both support EGLStream.


Technically, you can use Wayland.

What you cannot use is applications that use OpenGL or Vulkan acceleration. GBM is used for sharing buffers across APIs handled by GPU. If your Wayland clients use just shm to communicate with compositor, it will work.


Is that experience recent? AMD drivers used to be terrible and Intel isn't even competition.


Depends also on the AMD GPU. Vega is fine, Raven Ridge had weird bugs last time I looked, with rx590 I couldn't even boot the proxmox 6.1 installer (it worked when I swapped in rx580 instead).

Why is Intel not a competition? In laptops, I want only Intel, nothing else. It is the smoothest/most reliable/least buggy thing you may have.


Performance wise, Intel is streets behind.


I know. But do you need that performance for what you do on the computer?

For most uses, Intel GPU is fine.


But if you do need that performance, Intel isn't an option. If you don't, there is no reason to even consider Nvidia. They serve different needs.


I'm currently running a AMD card because I thought the drivers were better. I was mistaken, I still have screen tearing that I can't fix.

No doubt someone more knowledgeable about Linux could fix this issue, but I never had any issues with my nVidia blobs. That's not to say nVidia don't have their own issues.


this was my experience as well. I eventually bought an NVidia card to replace it so I could stop having problems. It's been smooth ever since.


I have both an Nvidia and an AMD card. AMDGPU is the gold standard.


This was true until relatively recently, but no longer.


> Expecting all Linux drivers to be GPL-licensed is unrealistic and just leads to crappy user experiences. nVidia is never going to release full-featured GPL'd drivers, and even corporative vendors sometimes have NDAs which preclude releasing open source drivers.

Nvidia is pretty much the only remaining holdout here on the hardware driver front. I don't see why they should get special treatment when the 100%-GPL model works for everyone else.


ZFS is not really GPL-incompatible either, but it doesn't matter. Between FUD and Oracle's litigiousness, the end result is that there is no way to overcome the impression that it is GPL-incompatible.

But it is a problem that you can't reliably have out-of-tree modules.

Also, Linus is wrong: there's no reason that the ZoL project can't keep the ZFS module in working order, with some lag relative to updates to the Linux mainline, so as long as you stay on supported kernels and the ZoL project remains alive, then of course you can use ZFS. And you should use ZFS because it's awesome.


> But it is a problem that you can't reliably have out-of-tree modules.

That is the bit I'm trying to get at. Yes it would be best if ZFS was just part of Linux, and maybe some day it can be after Oracle is dead and gone (or under a new leadership and strategy). But it's almost beside the point.

Every other OS supports installing drivers that aren't "part" of the OS. I don't understand why Linux is so hostile to this very real use case. Sure it's not ideal, but the world is full of compromises.


I'm not sure Linux is especially hostile. A new OS version of, say, Windows can absolutely break drivers from a previous version.


Linux absolutely is especially hostile. Windows will generally try to support existing drivers, even binary-only ones, and give plenty of notice for API changes. FreeBSD has dedicated compatibility with previous ABIs going several versions back. Linux explicitly refuses to offer any kind of stability for its API (i.e. they can and will break APIs even in minor patches), yet alone ABI.


Linux is generally not happy about seeing any out of tree drivers.

But that is also not without reason, in a certain way Linux balances in a field where they are and want to stay open source. But a lot of users (and someteimes the companies paying some "contributors", too) are companies which are not always that happy about open source. So if it's easy to not put drivers under permissive licenses and still get a good experience out of it they will have very little sensitive to ever make any in-tree GBL drivers and Linux would run at risk of becoming a skeleton you can't use without accepting/buying drivers from multiple 3rd parties.

Through take that argument with a (large) grain of salt, there are counter arguments for it, too. E.g. the LLVM project with is much more permissive and still maintained well, but then is also a very different kind of software.


There's a unique variable here and that's Oracle.

That shouldn't actually matter; it should just depend on the license. But millions in legal fees says otherwise.


>If there is no "approved" method for creating Linux drivers under licenses other than the GPL, that seems like a major problem that Linux should be working to address.

As a Linux user and an ex android user, I absolutely disagree and would add that the GPL requirement for drivers is probably the biggest feature Linux has!


Yes, the often times proprietary android linux driver for are such a pain. Not only make they it harder to reuse the hardware outside of android (e.g. in a laptop or similar). But they also tend to cause delays with android updates and sometimes make it impossible to update a phone to a newer android version even if the phone producer wants to do so.

Android did start making this less of problem with HAL and stuff, but it's still a problem, just a less big one.


There is a big difference between a company distributing a proprietary Linux driver, and the linux project merging software of a gpl incompatible license. In the first case it is the linux developers who can raise the issue of copyright infringement, and it is the company that has to defend their right to distribute. In the later the roles are reversed with the linux developers who has to argue that they are within compliance of the copyright license.

A shim layer is a poor legal bet. It assumes that a judge who might not have much technical knowledge will agree that by putting this little technical trickery between the two incompatible works then somehow that turn it from being a single combined work into two cleanly separated works. It could work, but it could also very easily be seen as meaningless obfuscation.

> Why are all drivers expected to use the GPL

Because a driver is tightly depended on the kernel. It is this relationship that distinguish two works from a single work. A easy way to see this is how a music video work. If a create a file with a video part and a audio part, and distribute it, legally this will be seen as me distributing a single work. I also need to have additional copyright permission in order to create such derivative work, rights that goes beyond just distributing the different parts. If I would argue in court that I just am distributing two different works then the relationship between the video and the music would be put into question.

A userspace software is generally seen as independent work. One reason is that such software can run on multiple platforms, but the primary reason is that people simply don't see them as an extension of the kernel.


There is an "approved" method - write an publish your own kernel module. However if your module is not GPL licensed it cannot be published in the linux kernel itself, and you must keep up with the maintenance of the code. This is a relatively fair requirement imo.


...which is what the ZFS on Linux team are doing?

The issue here is which parts of the kernel API are allowed for non-GPL modules has been decided to be a moving target from version to version, which might as well be interpreted as "just don't bother anymore".


I wonder if this was exactly what they intended, i.e.: "just don't bother anymore to write out of tree driver and put them under GBL into the tree". And ZFS might just have been accidentally hit by this but is in a situation where it can't put thinks into the tree...


> If there is no "approved" method for creating Linux drivers under licenses other than the GPL, that seems like a major problem that Linux should be working to address.

It's a feature, not a bug. Linux is intentionally hostile to binary-blob drivers. Torvalds described his decision to go with the GPLv2 licence as the best thing I ever did. [0]

This licensing decision sets Linux apart from BSD, and is probably the reason Linux has taken over the world. It's not that Linux is technically superior to FreeBSD or OpenSolaris.

> Expecting all Linux drivers to be GPL-licensed is unrealistic and just leads to crappy user experiences

'Unrealistic'? Again, Linux took over the world!

As for nVidia's proprietary graphics drivers, they're an unusual case. To quote Linus: I personally believe that some modules may be considered to not be derived works simply because they weren't designed for Linux and don't depend on any special Linux behaviour [1]

> Why are all drivers expected to use the GPL?

Because of the 'derived works' concept.

The GPL wasn't intended to overreach to the point that a GPL web server would require that only GPL-compatible web browsers could connect to it, but it was intended to block the creation of a non-free fork of a GPL codebase. There are edge-cases, as there are with everything, such as the nVidia driver situation I mentioned above.

[0] https://en.wikipedia.org/w/index.php?title=History_of_Linux&...

[1] https://en.wikipedia.org/w/index.php?title=Linux_kernel&oldi...


> If there is no "approved" method for creating Linux drivers under licenses other than the GPL, that seems like a major problem that Linux should be working to address.

The problem is already addressed: if someone wants to contribute code to the project then it's licensing must be compatible with the prior work contributed to project. That's it.


But why are all drivers expected to be "part of the project"? We don't treat userspace Linux software that way. We don't consider Windows drivers part of Windows.


It's pretty simple, once they expose such an API they'd have to support it forever, hindering options for refactoring (that happens all the time). With all the drivers in the tree, they can simply update every driver at the same time to whatever new in-kernel API they're rolling out or removing. And being that the majority of drivers would arguably have to be GPL anyway, and thus open-source, the advantages of keeping all the drivers in tree are high, and the disadvantages low.

With that, they do expose a userspace filesystem driver interface, FUSE. There used to be a FUSE ZFS driver, though I believe it's mostly dead now (But I never used it, so I don't know for sure). While it's not the same as an actual kernel FS driver (performance in particular), it effectively allows what you're asking for by exposing an API you can write a filesystem driver against without it being part of the kernel code.


Given that the kernel is nearly 30 years old, do you not find it slightly incredible that there has been no effort to stabilise the internal ABI while every other major kernel has managed it, including FreeBSD?

There are ways and means to do this. It would be perfectly possible to have a versioned VFS interface and permit filesystems to provide multiple implementations to interoperate with different kernel versions.

I can understand the desire to be unconstrained by legacy technical debt and be able to change code at will. I would find that liberating. However, this is no longer a project run by dedicated amateurs. It made it to the top, and at this point in time, it seems undisciplined and anachronistic.


You know, come to think of it, is there anything stopping Linux from having a... FKSE (Filesystem in Kernel SpacE) standard API?

Presumably, such a thing would just be a set of kernel APIs that would parallel the FUSE APIs, but would exist for (DKMS) kernel modules to use, rather than for userland processes to use. Due to the parallel, it would only be the work of a couple hours to port any existing FUSE server over into being such a kernel module.

And, given how much code could be shared with FUSE support, adding support for this wouldn't even require much of a patch.

Seems like an "obvious win", really.


It's not the context switch that kills you for the most part, but the nature of the API and it's lack of direct access to the buffer cache and VMM layer. Making a stable FKSE leads to the same issues.

That's why Windows moved WSL2 to being a kernel running on hyper-v rather than in kernel. Their IFS (installable filesystem driver) stack screws up where the buffer cache manager is, and it was pretty much impossible to change. At that point, the real apples to apples comparison left NT lacking. Running a full kernel in another VM ended up being faster because of this.


I mean, it doesn't really work that way, you can't just port a userspace program into a kernel module. For starters, there's no libc in the kernel - what do you do when you want to call `malloc`? ;)

With that, I doubt the performance issues are directly because it runs in userspace, they're likely due to the marshaling/transferring from the in-kernel APIs into the FUSE API (And the complexity that comes with talking to userspace for something like a filesystem), as well as the fact that the FUSE program has to call back into the kernel via the syscall interface. Both of those things are not easily fixable - FKSE would still effectively be using the FUSE APIs, and syscalls don't translate directly into callable kernel functions (and definitely not the ones you should be using).


The hard part isn't the "FKSE API", the hard part is for the "FKSE driver" to be able to do anything other than talk to that API. Like, scheduling, talking to storage, the network, whatever is needed to actually implement a useful filesystem.


The problem is that nobody is interested in doing that and that's why we are in this situation in the first place. If Oracle wanted to integrate ZFS into Linux they would just relicense it.


> With that, they do expose a userspace filesystem driver interface, FUSE.

Yes, which Linus has also poo-pooed:

"People who think that userspace filesystems are realistic for anything but toys are just misguided."


I mean, he's right. VFS, VMM, and buffer cache are all three sides of the same coin. Nearly every system that puts the FS in user space has abysmal performance; the one exception I can think of off the top of my head is XOK's native FS which is very very very different than traditional filesystems at every layer in the stack, and has abysmal performance again once two processes are accessing the same files.


Oh, I totally agree. But between that statement and this one about ZFS, the takeaway seems to be: for filesystems on Linux, go GPL or go home. Which is fine if that's his attitude, but if so I do wish he'd be more direct about it rather than making claims that are questionable at best (e.g. "ZFS is not maintained"--wtf?).


And yet people use them all the damn time because they're incredibly useful and even more importantly are relatively easy to put together compared to kernel modules.

Linus is just plain wrong on this one.


You should read the full quote, he really doesn't disagree with you:

> fuse works fine if the thing being exported is some random low-use interface to a fundamentally slow device. But for something like your root filesystem? Nope. Not going to happen.

His point is that FUSE is useful and fine for things that aren't performance critical, but it's fundamentally too slow for cases where performance is relevant.


The problem with FUSE file systems is not that they aren't part of the kernel's VCS repo, but that it requires a context switch to user-space.


> But why are all drivers expected to be "part of the project"? We don't treat userspace Linux software that way.

It is the policy of linux development at work. Linux kernel doesn't break userspace, you could safely upgrade kernel, your userspace would work nice. But Linux kernel breaks inner APIs easily, and kernel developers take responsibility for all the code. So if a patch in memory management subsystem broke some drivers, kernel developers would find breakages and fix them.

> We don't consider Windows drivers part of Windows.

Yeah, because Windows kernel less frequently breaks backward compatibility in kernel space, and because hardware vendors are ready to maintain drivers for Windows.


You can license both kernel modules or FUSE implementation any way you see fit. That's a non-issue.

https://www.kernel.org/doc/html/latest/process/license-rules...

It seems that some people are oblivious to the actual problem, which is some people want their code to be mixed into the source code of a software project without having to comply with the rightholder's wishes, as if their will shouldn't be respected.

> We don't consider Windows drivers part of Windows.

I'm not sure you can commit your source code to the windows kernel project.


No, no one wants to force ZFS into the Linux kernel. I think anyone agrees that it needs to be out-of tree the way thinks are currently.

The problem is the nature of changes, and people questioning if there is any good _technical_ reason why some of the changes need to be the way they are done.


Because running proprietary binaries in kernel space is not a good idea nor is it compatible with the vision of Linux?


ZFS isn't proprietary it's merely incompatible with the gpl.


> If there is no "approved" method for creating Linux drivers under licenses other than the GPL, that seems like a major problem that Linux should be working to address.

It's less a think Linux can work on then a think lawmakers/courts would have to make binding decisions on, which would make it clear if this usage is Ok or not. But in practice this can only be decided on a case-by-case basis.

The only way Linux could work on this is by:

1. Adding a exception to there GPL license to exclude kernel modules from GPL constraints (which obviously won't happen for a bunch of reasons).

2. Turn Linux into a micro kernel with user-land drivers and interfaces for that drivers which are not license encumbered (which again won't happen because this would be a completely different system)

3. Oracle re-licensing ZFS under a permissible Open Source license (e.g. dual license it, doesn't need to be GPL, just GPL compatible e.g. Apache v2). Guess, what that won't happen either, or at last I would be very surprised. I mean Oracle is running out of products people _want_ to buy from them and increasingly run into an area where they (ab-)use the license/copyright/patent system to earn their monny and force people to buy there products (or at last somehow pay license fees to them).


>[...] that seems like a major problem that Linux should be working to address [...] Why are all drivers expected to use the GPL?

Vendors are expected to merge their drivers in mainline because that is the path to getting a well-supported and well-tested driver. Drivers that get merged are expected to use a GPL2-compatible license because that is the license of the Linux kernel. If you're wondering why the kernel community does not care about supporting an API for use in closed-source drivers, it's because it's fundamentally incompatible with the way kernel development actually works, and the resulting experience is even more crappy anyway. Variations of this question get asked so often that there are multiple pages of documentation about it [0] [1].

The tl;dr is that closed-source drivers get pinned to the kernel version they're built for and lag behind. When the vendor decides to stop supporting the hardware, the drivers stop being built for new kernel versions and you can basically never upgrade your kernel after that. In practice it means you are forced to use that vendor's distro if you want things to work properly.

>[...] nVidia is never going to release full-featured GPL'd drivers.

All that says to me is that if you want your hardware to be future-proof, never buy nvidia. All the other Linux vendors have figured out that it's nonsensical to sell someone a piece of hardware that can't be operated without secret bits of code. If you ever wondered why Linus was flipping nvidia the bird in that video that was going around a few years ago... well now you know.

[0]: https://www.kernel.org/doc/html/latest/process/kernel-driver...

[1]: https://www.kernel.org/doc/html/latest/process/stable-api-no...


> Linux is able to run proprietary userspace software. Even most open source zealots agree that this is necessary. Why are all drivers expected to use the GPL?

To answer your excellent question (and ignore the somewhat unfortunate slam on people who seem to differ with your way of thinking), it is an intentional goal of software freedom. The idea of a free software license is to allow people to obtain a license to the software if they agree not to distribute changes to that software in such a way so that downstream users have less options than they would with the original software.

Some people are at odds with the options available with licenses like the GPL. Some think they are too restrictive. Some think they are too permissive. Some think they are just right. With respect to you question, it's neither here nor there if the GPL is hitting a sweet spot or not. What's important is that the original author has decided that it did and has chosen the license. I don't imagine that you intend to argue that a person should not be able to choose the license that is best for them, so I'll just leave it at that.

The root of the question is "What determines a change to the software". Is it if we modify the original code? What if we add code? What if we add a completely new file to the code? What if we add a completely new library and simply link it to the code? What if we interact with a module system at runtime and link to the code that way?

The answers to these questions are not well defined. Some of them have been tested in court, while others have not. There are many opinions on which of these constitutes changing of the original software. These opinions vary wildly, but we won't get a definitive answer until the issues are brought up in court.

Before that time period, as a third party who wishes to interact with the software, you have a few choices. You can simply take your chances and do whatever you want. You might be sued by someone who has standing to sue. You might win the case even if you are sued. It's a risk. In some cases the risk is higher than others (probably roughly ordered in the way I ordered the questions).

Another possibility is that you can follow the intent of the original author. You can ask them, "How do you define changing of the software". You may agree with their ideas or not, but it is a completely valid course of action to choose to follow their intent regardless of your opinion.

Your question is: why are all drivers expected to use the GPL? The answer is because drivers are considered by the author to be an extension of the software and hence to be covered by the same license. You are absolutely free to disagree, but it will not change the original author's opinion. You are also able to decide not to abide by the author's opinion. This may open you up to the risk of being sued. Or it may not.

Now, the question unasked is probably the more interesting question. Why does Linus want the drivers to be considered an extension of the original software? I think the answer is that he sees more advantages in the way people interact in that system than disadvantages. There are certainly disadvantages and things that we currently can't use, but for many people this is not a massive hardship. I think the question you might want to put to him is, what advantages have you realised over the years from maintaining the license boundaries as they are? I don't actually know the answer to this question, but would be very interested to hear Linus's opinion.


Sorry for using the term "zealots", I didn't intend it as a pejorative. I should probably have said "hardliners". I meant only to refer to people at the extreme end of the spectrum on this issue.

> The root of the question is "What determines a change to the software". [...] The answers to these questions are not well defined.

And that's fair, but what confuses me is that I never see this question raised on non-Linux platforms. No one considers Windows drivers a derivative of Windows, or Mac kernel extensions a derivative of Darwin.

Should the currently-in-development Windows ZFS port reach maturity and gain widespread adoption (which feels possible!), do you foresee a possibility of Oracle suing? If not, why is Linux different?


>No one considers Windows drivers a derivative of Windows, or Mac kernel extensions a derivative of Darwin.

Perhaps they do, but the difference is that their licensing does not regard their status of derivative works as being important. Those platforms have their own restrictions on what drivers they want to allow. In particular, Mac doesn't even allow unsigned drivers anymore, and any signed drivers have to go through a manual approval process. And don't forget iOS, which doesn't even support user-loadable drivers at all.

>Should the currently-in-development Windows ZFS port reach maturity and gain widespread adoption (which feels possible!), do you foresee a possibility of Oracle suing? If not, why is Linux different?

I'm not sure, I haven't used Windows in many years and I don't know their policies. But see what I said earlier: the simple answer is that the license is different from the license of Linux. For more details, the question you should be asking is: Is the CDDL incompatible with Windows licensing?


Thank you!

Just to clarify one little thing, because it appears to be something of a common misconception:

> Mac doesn't even allow unsigned drivers anymore

You can absolutely still install unsigned drivers (kernel extensions) on macOS, the user just needs to run a Terminal command from recovery mode. This is a one-time process that takes all of five minutes if you know what you're doing.

You can theoretically replace the Darwin kernel with your own version too. macOS is not iOS, you can completely open it up if you want.


This is nonsense. The problem is not getting ZFS bundled with Linux like he is implying here. The problem is how Linux artificially restricts what APIs your module is able to access based on the license, so you wouldn't be able to use ZFS even by your own prerogative like he is suggesting.

He is claiming that it comes down to the user's choice, which would be just fine if that were true. The only problem here is that Linux has purposely taken steps to hinder that choice.


I'll give up Linux on my servers before I give up ZFS

especially so given the recent petulant attitude that broke API compatibility in the LTS branch just to spite the ZFS developers: https://news.ycombinator.com/item?id=20186458

compete honestly on technical merit, rather than pulling dirty tricks that you'd expect of Oracle or 1990's MS


Pretty much my view as well. If Linux becomes incompatible with ZFS in any way, I'll switch to FreeBSD.

That said, after the Oracle Java debacle, I can see why Linus would not be receptive towards merging ZFS into the kernel. I just wish he argued the point on legal issues alone instead of making up stories about non-existent technical flaws in ZFS. The whole thing is basically a work of art. Oracle should consider GPL-ing it and integrating it into Linux directly.


I have been using FreeBSD for its better ZFS support for years, and it's great. Highly recommended.


Same here. I immediately ditched Solaris (hated it) for Linux when KQ Infotech ported ZFS over about a decade ago, but after a few years switched all my file servers to FreeBSD. ZFS > Linux. Did I mention I love Linux?


FreeBSD has been using the ZFS on Linux code for a while.


Not true, not yet.


Thanks for the correction, they decided to go that route in 2018, seems like it’s not live yet.


> Oracle should consider GPL-ing it and integrating it into Linux directly

I think Apache (or BSD/MIT) would be far more palatable, as GPL'ing it would cut off the BSDs as well as OpenSolaris, which would certainly be a bummer.


Por que no los dos? It's their IP, they can license it under several licenses to maximize adoption. This being Oracle though, I don't think it'll ever happen.


> If Linux becomes incompatible with ZFS in any way, I'll switch to FreeBSD

Isn't FreeBSD now using ZFSOnLinux project as well?


no. see above


Yeah...I use FreeBSD for file servers because I don't have to even pay attention this constant ZFSonLinux drama. I treat them like almost like appliances. Linux servers are more than happy to use them on the back-end.


how do you structure that? block images served by iscsi? shares served by nfs/samba?


All of the above, depending on what I'm trying to accomplish. Usually Samba for Windows clients and general bulk storage (pretty much everything can do CIFS mounts at some basic level these days), NFS for *nix and VMWare clients (esp where I can leverage NFSv4), iSCSI for various block image needs, scp/sftp/rsync. All of this is basic out of box for FreeBSD.

In operations, I treat them as semi-black-box (grey-box?) appliance where they only do the file storage function and are moral equivalents of Network Appliance NAS boxes. I don't try to convince the Linux or Windows teams to migrate other workloads to FreeBSD, and mostly don't want them to since that would mean Yet Another app environment to support.

FreeBSD has a really efficient network stack, so I can attach the FreeBSD stores at 10GB (usually, 2-4 bonded 10GB links) and it keeps up fine. I I have colleagues who are doing the same with 40GB links (40-160BG aggregate) to backbone networks, and apparently there are many shops hooking up FreeBSD with 100GB links. Limiting factor seems to be ZFS and the storage subsystems supporting it, not network, which is interesting, but I don't have the ability to benchmark the 40GB and higher stuff.


The core development team hasn't really been fully trustworthy since they spent years pretending their cpu scheduler wasn't hot garbage for desktop usage, denied the need for a pluggable scheduler to allow multiple schedulers to be selected from, then seemingly an age later implemented something in the same vein as CK while giving zero credit.


As a heavy user of ZFS and Linux, what else is there that even comes close to what ZFS offers?

I want cheap and reliable snapshots, export & import of file systems like ZFS datasets, simple compression, caching facilities(like SLOG and ARC) and decent performance.


Bcachefs is probably the only thing that will get there. The codebase is clean and we'll mantained, built from solid technology (bcache) and will include most of the ZFS niceties. I just wish more companies would sponsor de project and stop wasting money on BTRFS


Yes, I’m eagerly waiting for Bcachefs to get there at some point, but it is several years away (rightly so, because it is hard and the developer is doing an amazing job) if my understanding of its state is correct.

I have heard of durability issues with btrfs, and do not want to touch it if it fails with its primary job.


Which is why ZFS is still a thing today - there are no other alternatives. Everything is coming "soon" while ZFS is actually here and clocking up a stability track-record.


>Bcachefs is probably the only thing that will get there.

Or Bcachefs is probably the only thing that might get there.

The amount of engineering hours went into ZFS is insane. It is easy to get a project that has 80% similarity on the surface, but then you spend the same amount of time from 0 - 80% on the last 20% and edge cases. ZFS has been battle tested by many. Rsync is on ZFS.

The amount of Petabyte stored in ZFS safely over the years gives peace of mind.

Speaking of Rsync, normally a topic of ZFS on HN will have him resurface. Hasn't seen any reply from him yet.


I’m looking forward to bcachefs becoming feature complete and upstreamed. We finally have a good chance of having a modern and reliable FS in the Linux kernel. My wish list includes snapshots and per volume encryption.


What if the main purpose of BTRFS is to have something "good enough" so no one starts working on a project that can compete with large commercial storage offerings?

Does anyone remember the parity patches they rejected in 2014?

> Your work is very very good, it just doesn’t fit our business case.

I haven't followed it much. Does it have anything more than mirroring (that's stable) these days?


>stop wasting money on BTRFS

You're saying they should stop supporting a project that was considered stable by the time the other started being developed. Why do that? What makes Bcachefs a better choice?


Take a cursory look into both codebases, the stability of the every feature at launch and on maintenance. It's not hard to see BTRFS is a doomed project. Bcachefs is more like PostgreSQL, the developer doesn't add features until he has a solid design that's well thought out. Hence why he hasn't implemented snapshots.

I don't think too many people consider it stable enough for production, either. (Unless you count a very limited subset of its functionality).

I rather run Bcachefs today than Btrfs, by a mile. At least with bcachefs I won't lose my data.


If you are on BTRFS and you encounter an unrecoverable bug (which seems to be reasonably common), the developers will most likely recommend you wipe the drive and restore from backups (because you had backups, right?)

Even if the data is still on the drive and a bugfix would make the filesystem recoverable again, they don't have the time/knowledge/resources to untangle that codebase and make fixes. Even BTRFS developers don't trust the filesystem with their own data.

If you are on Bcachefs and you encounter an unrecoverable bug, the developer will ask for some logs, or reproduction steps, or potentially even remote debugging access to your corrupt filesystem.

And then he will fix the bug, releasing a new version that can read/repair your filesystem. He knows his codebase like the back of hand.

In my research, I couldn't find any examples of someone actually losing data due to Bcachefs. All the bugs appeared to be "data has been written to drive, but bug prevented reading"

While I would still hesitate to trust Bcachefs, I would trust it way more than BTRFS.


Just want to note that bcachefs looks great (I was sort of tangentially aware but hadn't dedicated significant attention to it).

Definitely something to try out (backing up my home servers is just about to reach viability for me, so I'd definitely consider switching to it in that use case).

Thanks!


Btrfs is the only FS I used that resulted in complete FS corruption losing nearly all data on disk, not once, but 3 times.

After that, none of the features like compression, snapshots, COW or checksums meant anything to me. I'm much happier with ext4 and xfs on lvm.


Anecdote, I know, but I have about a dozen machines with BtrFS volumes, all active with varying loads and never experienced data loss. It seems some features are more mature than others - only two of the volumes span more than one disk and none has files that are larger than a physical volume (even though one of the multi-device volumes is striped).


In the 26 years or so I have used Linux, I have had corrupted filesystems with reiserfs, XFS, btrfs, and ext[23]. In the case of reiserfs and XFS it was practically impossible to recover the filesystem (IIRC reiserfs would reattach anything that resembled a B-tree). For ext[23], it was surprisingly easy to get back most of the data. Never had any corruption with ZFS or ext4. I didn't try to fix the btrfs filesystem, since it was a machine that had to be repurposed anyway.


My experience with recovering btrfs is that you get back most of your files, but with the content replaced with random gibberish. Which is not too useful.

In a way, I would rather it bomb out and declare a total loss than to keep sinking more time into it as it leads you along.


When was it that XFS got corrupted on you? I think as RedHat embraces XFS, I assume it's quite good now.


Somewhere between being merged in mainline and 2009.


Funny the other day on another HN thread, someone was saying btrfs is good even though I said RedHat has abandoned the btrfs ship but then he said Facebook had been using it heavily.

But seeing how so many people had lost data using it, I will never use btrfs...


I don't think BTRFS has ever been considered stable.

I think they just said: "The on-disk data structure is stable" and lots of people misinterpreted that as "the whole thing is stable"

A stable on-disk data structure just means it's been frozen and can't be changed in non-backwards compatible ways. It says nothing about code quality, feature completeness or if the frozen data structure was any good.


The finalization of the on-disk data structure came soon after Btrfs was announced and happened before 2010. I meant that by 2010s when Bcachefs started development, Btrfs was considered a supported filesystem for "big name" server distros such as Oracle and SUSE.


Snapshots don't seem to be done yet.


Kent has admitted (many times) that snapshots are one of the more difficult features to add in a reliable and safe way, and will require significant work to do right, especially for what he wants to see them do (I assume "really damn fast and low overhead" is a major one, plus some other tricks he has up his sleeve.) So he has intentionally not tackled them yet, instead going after a slew of other features first. Reflink, full checksum, replication, caching, compression, native encryption, etc. All of that works today.

Snapshots are a huge feature for sure, but it's not like bcachefs is completely incapable without them.

There was a very recent update he gave in late December (2019) that mentioned he's actively chipping away at the roadblocks for snapshots.


They're being worked on ATM: Dec 29, 2019 "Just finished a major rework that gets us a step closer to snapshots: the btree code is incrementally being changed to handle extents like regular keys." https://www.patreon.com/posts/towards-32698961


That's exactly why I said it's probably the only one that will get there.


Heh, BTRFS deja vu. Been hearing about the ZFS alternative "not quite there, but catching up" for about as long as high-speed rail. I wonder which will arrive first :)


BTRFS is never going to become stable. Ever. Just take a quick dive into the codebase.

Bcachefs has never had an unrecoverable data error AFAIK, even though it isn't even considered stable enough to merge into the kernel. The bcache on disk format won't be considered stable until he merges his code into mainline, though he doesn't feel he will need to adjust it further.

Features that currently work: Full data checksumming Compression Multiple device support Tiering/writeback caching RAID1/RAID10

All of these are stable, tested, and mostly bug free. Honestly, once the code gets mainlined you'll be able to start using it very quickly.

Main issue right now is performance, as it about as slow as BTRFS, which isn't inspiring. However the author has stated that he's going for correctness first, then he'll begin optimizing.


I know this isn't an option for everyone, but this is part of why I run FreeBSD instead of Linux for servers where I need ZFS.


This isn't why I started running FreeBSD, but it is also one of the reasons I continue to run FreeBSD.


Yes, I run Linux for business but keep a FreeBSD personally so I'm used to it in case I need zfs for business.


I agree that ZFS has a lot to offer. But the legal difficulties in merging ZFS support into the mainline kernal are understandable. It's a shame but I think he is making the right call.


Merging into the mainline kernel is not what the person he is replying to was even asking for. All they were asking is for Linux to stop putting APIs behind DRM that prevents non-GPL modules like ZFS from using them. That doesn't mean ZFS must be bundled with Linux.

I think everyone is in agreement that ZFS can't be included in the mainline kernel. The question is just if users should be able to install and use it themselves or not.


Thanks, I should have read more into this.

The follow up actually clears things up pretty well. https://www.realworldtech.com/forum/?threadid=189711&curpost...


Kernal? If you can merge zfs support into 8KB kernal then you are not a mere mortal, so no need to worry about any legal difficulties.


XFS on LVM thin pool LV should give you a very robust fs, cheap CoW snapshots, multi device support. If you want, you can make the thin pool be on RAID via LVM RAID under the thin pool.

For import export, IIRC XFS has support for it and you can dump/import LV snapshots to get atomicity.

For caching there is LVM cache, should be again possible to combine with thinpool & RAID. Or you can use it separately for normal LV.

All this is functionality tested by years of production use.

For compression/deduplication, that is AFAIK work in progress upstream based on the open sourced VDO code.


Interesting combination of tools I have used independently but never as a replacement of my beloved ZFS.

Never made snapshots with LVM. Always used LVM as a way to carve up logical storage from a pool of physical devices but nothing more. I need to RTFM on how snapshotting would work there - could I restore just a few files from an hour ago while letting everything else be as they are?

With ZFS, I use RAM as read chace(ARC) and an Optane disk as sync write cache(SLOG). I wonder if LVM cache would let me do such a thing. Again, a pointer for more manual reading for me.

Compression is a nice to have for me at this moment. Good to know that it is being worked on at the LVM layer.


IIRC you can mount any of the snapshots & copy files from it without influencing the others & the thin LV itself. As for RAM caching, I'm not sure LVM would allow LVM cache residing on ram disk PV, but isin't regular Linux transparent FS access RAM caching sufficient actually ?

For some reading about LVM thin provisioning:

http://man7.org/linux/man-pages/man7/lvmthin.7.html

https://access.redhat.com/documentation/en-us/red_hat_enterp...


Call me when somebody like a major cloud provider has used this system to drive millions of hard-drives. I'm not gone patch my data security together like that.

There is difference between 'all these tools have been used in production' and 'this is an integrated tool that has been used for 15+ years in the biggest storage installations in the world'.


Yes! The problem with the LVM approach trying to replicate anything ZFS is doing that you have to use a myriad of different tools. And then you have to pray that they all work correctly together, and if one has a bug you possible lost all your data because there may be so many data corruptions emerging because of it.


Honestly asking, how Btrfs compares to ZFS?

There's also Lustre but it's a different beast altogether for a different scenario.


On the surface, btrfs is pretty close to zfs.

Once you actually use them, you discover all the ways that btrfs is a pain and zfs is a (minor) joy:

- snapshot management

- online scrub

- data integrity

- disk management

I lost data from perfectly healthy-appearing btrfs systems twice. I've never lost data on maintained zfs systems, and I now trust a lot more data to zfs than I ever have to btrfs.


At least disk management is far easier with btrfs. You can restripe at will while zfs has severe limitations around resizing, adding and removing devices.

Granted, at enterprise scale this hardly matters because you can just send-receive to rebuild pools if you have enough spares, but for consumer-grade deployments it's a non-negligible annoyance.


Restriping is source of unsafety, though. A lot of ZFS data safety comes from the fact it doesn't support overwriting anything, making it so that normal operation can't introduce unrecoverable corruption. In fact, all writes are done through snapshots.


ZFS wanted to have that too (the mythical block pointer rewrite) but it never happend, instead they add clunky workarounds like indirection tables for.


It was treated more like "ok, yet another person complaining about it - here's what you need to implement, and why you won't".

The indirection tables are survivable for fixing short term mistakes, though.


Actually, this matters a lot in many enterprises. Beancounters hate excess capacities, so there are never enough spares and everything is always almost full.

Maybe SV is different...


Since the plural of anecdote is data, I'll provide mine here. ZFS is the only file-system from which I've lost data on hardware that was functioning properly, though that does come with a caveat.

Twice btrfs ended up in a non-mountable situation, but both times it was due to a known issue and #btrfs on freenode was able to walk me through getting it working again.

With ZFS, I neded up in a non-mountable system, and the response in both #zfs and #zfsonlinux to me posting the error message were, "that sucks, hope you had backups." Since I both had backups and it was my laptop 2000 miles from home that was my only computing device, I didn't dig deeper to see if I could discover the problem. FWIW, I've been using ZFS on that same hardware for almost 2 years since with no issues.


Thanks for your answer and sorry for your data loss.

> I lost data from perfectly healthy-appearing btrfs systems twice.

I still consider btrfs as beta-level software. This is why I never looked into it very seriously and asked this question.

Looks like btrfs has something around five years to be considered serious at the scale where ZFS just starting to warm-up.


The one thing I can't understand about btrfs is the unknown answer to the question "How much disk space do I have left?". I don't get that being a "this much, maybe" answer


# btrfs filesystem usage /

Overall:

    Device size:         142.86GiB
    Device allocated:     48.05GiB
    Device unallocated:   94.81GiB
    Device missing:          0.00B
    Used:                 37.75GiB
    Free (estimated):    103.94GiB  (min: 103.94GiB)
    Data ratio:               1.00
    Metadata ratio:           1.00
    Global reserve:       82.20MiB  (used: 0.00B)


"Free (estimated)"


btrfs is such a mess that for a database or VM to be marginally stable, you have to disable the CoW featureset for those files with the +C attribute. It's nowhere near a serious solution.


Btrfs has eat my data, and once that happens I will never, ever, ever, literally ever go back to that system. Its unacceptable to me that a system eats data specially after multiple rounds of 'its stable now'.

But in the end it always turns out that only if you 'use' it correctly it is actually not gone eat your data.

I used ZFS for far longer and had far fewer issues.


Stratis and VDO have a lot of promise, although it's still a little early. The approach that Stratis has taken is refreshing. It's very simple and reuses lots of already existing stuff so by the time it's released it will already be mature (since the underlying code has been running for many years).

Once a little more guidance comes out about how to properly use VDO and Stratis together, I'll move my personal stuff to it.


So besides the obvious btrfs answer, what about ceph as clustered storage with very fast connectivity?

There is also BeeGFS, I haven't used it but /r/datahoarders sometimes touts it.

Not for linux but I have been keeping an eye on M Dillons DragonFly BSD where he has been working on HAMMER2, which is very interesting.

I don't know much but bcachefs has been making more waves lately also.

I think the bottom line is that people need to have good backup in place regardless.


Does btrfs met your requirements?


I've tried btrfs without much luck.

btrfs still has a write hole for RAID5/6 (the kind I primarily use) [0] and has since at least 2012.

For a filesystem to have a bug leading to dataloss unpatched for over 8 years is just plain unacceptable.

I've also had issues even without RAID, particularly after power outages. Not minor issues but "your filesystem is gone now, sorry" issues.

[0]: https://btrfs.wiki.kernel.org/index.php/RAID56


It's not a bug, but an unimplemented feature. They never made any promise that raid5 is production-ready.

Pretty much all software-raid systems suffer from it unless they explicitly patch over it via journaling. Hardware raid gets away with it if it has battery backups, if they don't they suffer from exactly the same problem.


... hence the desire to use ZFS, which skips trying to present a single coherent block device and performs parity at the file (chunk) level.


My home NAS runs btrfs in RAID 5. The key is to use software RAID / LVM to present a single block device to btrfs. That way you never use btrfs's screwed-up RAID 5/6 implementation.


If you use LVM/mdadm for RAID, it's not possible for btrfs to correct checksum mismatches (i.e. protect against bitrot).


That's a good point, though Synology (my brand of NAS) claims that they've developed analogous corruption checks operating at the LVM level, so you get the benefits of btrfs (including checksum checks and RAID scrubbing) without having to actually use its RAID implementation.

https://www.synology.com/en-global/knowledgebase/DSM/help/DS...


I wasn't actually able to find any real documentation on how Synology's SHR works.

Their recovery documentation [0] indicates that SHR is just plain mdadm + LVM and a couple of NAS recovery sites [1,2] indicate the same.

In the end I got a Reddit post [3] with a response from a Synology representative who says that the btrfs filesystem will request a read from a redundant copy from mdadm in order to correct checksum errors.

I wonder whether this is unique to Synology or whether the change has been upstreamed into the main Linux kernel.

[0]: https://www.synology.com/en-global/knowledgebase/DSM/tutoria...

[1]: https://support.reclaime.com/kb/article/8-synology-shr-raid/

[2]: http://www.nas-recovery.com/kb_hybrydraid.php

[3]: https://www.reddit.com/r/DataHoarder/comments/5yb13m/anyone_...


Why use RAID5/6, RAID10 is much more safe because you drastically reduce the change of a cascading resilvering failure. Yes, you get less capacity per drive, but drives are (relatively) cheap.

I thought I wanted RAID5, but after reading horror stories of drives failing when replacing a failed drive, I decided it just wasn't worth the risk.

I currently run RAID1, and when I need more space, I'll double my drives and set up RAID10. I don't need most of the features of ZFS, so BTRFS works for me.


I use RAID6 because it gives me highly efficient utilization of my available storage capacity while still giving me some degree of redundancy should a disk fail. My workload is also mostly sequential, so random read/write performance isn't too important to me.

If a disk fails and resilvering causes a cascading failure, I can restore from a backup.

I think you might be mistaking RAID for a backup, which is a mistake. RAID is very much not a backup or any kind of substitute for a backup. A backup ensures durability and integrity of your data by providing an independent fallback should your primary storage fail. RAID ensures availability of your data by keeping your storage online when up to N disks fail.

RAID won't protect you from an accidental "rm -Rf /", ransomware or other malware, bugs in your software or many other common causes of data loss.

I might consider RAID10 if I were running a business-critical server where availability was paramount, or where I needed decent random read/write performance but even so I'd still want a hot-failover and a comprehensively tested backup strategy.


btrfs is not at all reliable, so if you care about your files staying working files, it probably doesn't meet your requirements. It is like the MongoDB 0.1 of filesystems.


Seems pretty reliable these days. Are you commenting based upon personal experience? If so, when was it that you used btrfs?


When it comes to file systems “pretty reliable” these days does not sound very good. Reliability had to have been a fundamental requirement for design of a file system. If not, it sounds like putting lipstick on a pig.

Redhat throwing towel on their support for development does not instill confidence either.

Nothing personally against Btrfs. Just an end user making a file system choice saying what I care about.


re Redhat deprecating btrfs:

> People are making a bigger deal of this than it is. Since I left Red Hat in 2012 there hasn't been another engineer to pick up the work, and it is _a lot_ of work.

https://news.ycombinator.com/item?id=14909843


I have a laptop running opensuse, with root on btrfs. Twice I have had to reinstall because it managed to corrupt the file system.


btrfs + dm-cache? throw in dm-raid if you want raid5.


Hardware RAID controllers can do most if not all of these things.


I've lost more data in hardware RAID than in ZFS but I have lost data in both.

Hardware RAID has very poor longevity. Vendor support and battery backup replacement collide in BIOS and host management badly.

Disclaimer: I work on Dell rackmounts, which means rather than native SAS I am 'Dells hack on SAS' which is a problem and I know its possible to 'downgrade' back to native.


Yeah we started ordering the ones with the supercap so we didn’t have to replace batteries anymore.

Somewhat recently I dealt with LSI and Dell cards. Longevity seemed just fine for a normal 3 year server lifecycle. The only time we had an issue is when the power went down in the data center. The power spike fried a few of the cards. Luckily we had spares.

Way way back I dealt with the Compaq/hp smartarrays. Those were awful. Also anything consumer grade is awful.


The problem with most of these is you have to bring the system down to do maintenance. You can do a scrub on zfs while it's up.


Most non-hobbyist RAID hardware does online-scrub just fine (not that I would recommend wasting money on such hw).

Btw, ZFS scrub is not only a RAID-block-check but also a partial fsck, so its not really comparable.


We used the LSI 9286CV-8e (or dell equivalent) which was somewhere between $1000-$1500 back in the day. Worth it compared to babysitting any software RAID IMO.


Pay more for less safety and put all your data into the hands of the guy who wrote the firmware for that thing. I'm sure that software is well maintained open source code.


"Don't use ZFS. It's that simple. It was always more of a buzzword than anything else, I feel, and the licensing issues just make it a non-starter for me." - Linus

I have a strong feeling Linus has never actually used ZFS.


Probably not, given that statement. There's a reason why nearly everything I deal with today in large enterprise uses ZFS.


I also think he took a reasonable licence issue and conflated it with personal opinion not backed by experience. Nobody who has actually run ZFS says its just buzzwords.


The license issue is not actually so clear. The actual license is a good one. Oracle itself it the bigger problem.


Don't know.

Certainly about a decade ago the ZFS' chief architect, Jeff Bonwick, and Linus have met:

https://blogs.oracle.com/bonwick/casablanca-v2


his position is about legality, nothing else. I can see his point.


I once read this story about the problems trying to support ZFS - was it in the Linux kernel, though? Can't remember. Sadly, I can't seem to be able to dig it up right now, but the article walking readers through the various clashes in constraints between the different systems and implementations was bordering on the humourous.


I've never used it. What makes it so good?


He's not wrong. ext4 is actually maintained. This matters. ZFS hasn't kept up with SSDs. ZFS partitions are also almost impossible to resize, which is a huge deal in today's world of virtualized hardware.

Honestly Linus's attitude is refreshing. It's a sign that Linux hasn't yet become some stiff design-by-committee thing. One guy ranting still calls the shots. I love it. Protect this man at all costs.


> ZFS hasn't kept up with SSDs.

Pretty sure this is false. ZFS does support trim (FreeBSD had trim support for quite a while, but ZoL has it now as well), as well as supporting l2arc and zil/slog on ssd.

> ZFS partitions are also almost impossible to resize

You can grow zfs partitions just fine (and even online expand). You just can't shrink them.


> You just can't shrink them.

That's not even entirely true, though it requires shuffling around with multiple vdevs temporarily and doesn't presently support raidz. Also vdev removal is primarily made to support an accidental "oops, I added a disk I shouldn't have" rather than removing a long-lived device -- there's no technical restriction against the later case, though the redirect references could hamper performance.

The official stance has always been to send/receive to significantly change a pool's geometry where it isn't possible online.


yup, true enough. You can accomplish great things with a combination of zfs send/receive and time. ;)


I've just last week used btrfs shrink to upgrade to newer Fedora after making a minimal backup. Very useful on for my purposes... I don't plan to look at ZFS until it's in mainline kernel. Having any Linux install media usable as rescue disk is very handy.


> ZFS partitions are also almost impossible to resize

I'm not sure you've actually used ZFS very much as any way I can see you could be meaning this, it is actually pretty straightforward and simple to resize partitions with ZFS pools and volumes within ZFS pools.

For example, if you mean that you have a root zpool on a device using only half the device, you just have to resize the partition and then turn on `autoexpand` for the pool.


We are talking about something resembling adding an extra disk to raid5. Can be easily done in mdadm raid, and then you just need to resize lvm, or whatever you run over it. Can not be done in zfs, not in raid5/6 mode


You're confusing extending vdevs with extending pools and stripes.

It's kind of apples to oranges, really.

FreeNAS documentation[0] makes it pretty clear.

In ZFS, you cannot add devices to a vdev after it has been created -- however, you CAN add more vdevs to a pool.

So basically, your complaint is that ZFS wants to have stripes of vdevs and that instead of adding 1 drive to a 3 drive RAID5 to make a 4 drive RAID5, you have to add 3 drives to a RAIDZ1 for a 6 drive RAIDZ+0 that is equivalent to a RAID50 on a hardware controller.

Yes, it's more enterprisey, but it's not especially more difficult and the result is different and perhaps better depending on your use case.


"ZFS hasn't kept up with SSDs."

What does that mean?


ZFS (or at least ZoL) doesn't scale well to NVMEs:

https://github.com/zfsonlinux/zfs/issues/8381


Probably something about TRIM.


That they've never used SSDs for a ZIL or a zpool, I would wager.


He is wrong. He's focused on performance; people use ZFS for its features, not its performance.


At works we used ZFS w/ snapshots for a container build machine for performance reasons. We had some edge cases that made the Docker copy on write filesystem unsuitable.


zfsonlinux added support for TRIM last year. Are you referring to something else?


"last year" means it's in very few distributions at this time. Encryption is another feature that is technically supported, but just been added. When I built my NAS last year, I had to use dm-crypt because zfs didn't have it. Some features indeed lag pretty badly in zfs


Is the first party that is Oracle making any efforts to develop zfs further at this point? Is ZoL the primary development team at this time?


OpenZFS/ZFS on Linux diverged after version 28. They now use version 5000, and use feature flags instead of version numbers to allow different implementations to add their own improvements. Oracle has added some features since then as well, notably: encryption, large blocks, resilvering performance improvements, and device removal. All of these features have also been implemented in ZFS On Linux.

https://en.wikipedia.org/wiki/ZFS#Detailed_release_history

http://www.open-zfs.org/wiki/Feature_Flags


I don't blame Linus, but I use ZFS a lot.

I'll drop ZFS the moment I have an alternative with the same features:

- disk management with simple commands that can create raids in any modern configuration

- zero cost snapshots

- import/export (zfs send/recv)

- COW and other data integrity niceties

- compression, encryption, dedup, checksums

I am very grateful to the OpenZFS community, and I think they deserve praises for their work. Saying the code is not maintained is quite unfair.


> Saying the code is not maintained is quite unfair

IMO it's best viewed as a legal positioning of professing ignorance such that Oracle's thugs don't go after him for "copying" ZFS features, knowingly developing software that will be mixed with CDDL code, etc.

It's similar to how it's not a good idea for an engineer to read patents.


So why drop ZFS then?


I don't want to drop ZFS for technical reasons, but I do share some of Linus concerns about licensing.


He mentioned that he didn’t think it was being maintained. It’s more or less been formed no?

Has Linus not seen the work that the OpenZFS folks are doing?

ZFS is amazing and I would soon go to a BSD flavor with a fun set of user land utilities than give it up.


> He mentioned that he didn’t think it was being maintained.

This news would come as a surprise to the folks at LLNL who work on ZoL:

* https://github.com/zfsonlinux/zfs

* https://zfsonlinux.org/


Maybe meant not maintained by the first party?


ZoL is now the first-party for open sourced ZFS.


>Has Linus not seen the work that the OpenZFS folks are doing?

That's what he meant by Oracle licensing issues. Java API infringement case against Google


He also feels ZFS "was always more of a buzzword than anything else". Yikes.


Honestly, I wouldn't bash him for this comment. Not everyone runs a 10+ TB array at their home for storage and backup purposes.

ZFS doesn't primarily target single disks and small arrays anyway. :)


ZFS made wonders for me with very small servers (appliances) with SSDs that were forced to operate in remote areas on unstable power supply - where other file systems were dropping bytes and bricking them.


It's great on small disks, using ZFS root on solaris 11 in my day job, I can tell you it makes management a lot easier. Patching and rollbacks are like eating a nice dessert.


people probably will, in a few years.

rotational disks are getting cheaper and cheaper, 10TB disks in two years might cost as little as 2TB disks today (i got a 2TB disk for like 50€ off Amazon).


> people probably will, in a few years.

Yes, but without the array as you stated. We have 300+ 10TB disks at our datacenter today and, ZFS is relevant at this disk count, I/O and client load.

Running ZFS at small scale is raising a cow at home for a bucket of raw milk. It's more of a fun curiosity rather than a production level operation.

I'd run LVM or md or something similar at home instead of a full blown ZFS setup for practical reasons.


I feel ZFS are much better and easier than md or LVM. At least had it been properly supported (I have never tried ZoL).

CoW and cheap snapshots are game-changers, checksums as well but maybe not from a practicality and home-user standpoint. This holds just as well on PB storage as a 512 GB OS drive as a 2 GB thumb-drive (not that I would use ZFS on a thumb drive - again because of proper support across different OS).


Checksums are amazing for when you do have a problem, because a scrub will tell you what you lost. Knowing what's been damaged is practically more important then actually fixing it, and ZFS is great at this.

All the Linux alternatives answers to this problem are always "is your data okay? Don't know! It'll be a surprise when you get there".


I know, and personally that is very important for me.

But in practice, it is likely to be less than once in a decade problem - and you should have backups anyway.

So I can understand someone having different priorities. (not me though, data integrity is as important as it gets, I'd gladly pay performance/money for it)


What about the performances, is ZFS in the same ballpark than an equivalent (data-protection-wise) 'md' layout?


There are ways to improve the perf but the copy on write arch does come with performance taxes in my experience. The trade off is a whole richer experience than a simple ext4 partition for example.


I think ZFS - or at least the set of features ZFS provides - is relevant at any size or disk count all the way down to a single disk in a laptop. I've previously run ZFS on single block devices, though nowadays all my personal machines use at least ZFS mirroring. Without redundancy it can't recover from damage on its own, but checksums and free snapshots are irreplaceable to me.

It doesn't have to be ZFS in particular, I'll gladly switch my Linux systems over once a proper alternative is in the kernel. But right now it's the only working, mature solution. Bcachefs isn't ready yet and BTRFS isn't trustworthy.


Yeah if there was a similar GPL blessed effort that had most if not all of the main features of ZFS that was also robust and trust-able (likely takes years of use in production) I would be all for it. Projects like RedHat’s stratis might fill this. I’m not a ZFS zealot just love what it provides.


LVM/mdraid configuration is much more daunting than ZFS on BSD or Illumos, especially given the Availability of things like FreeNAS


> Running ZFS at small scale is raising a cow at home for a bucket of raw milk. It's more of a fun curiosity rather than a production level operation.

Well on one hand this is true, but on the other hand...

If you're running more than one disk at home, maybe you're some kind of enthusiast (homelabber?) And willing to put some effort into that. Under this scenario, the same amount of time spent learning zfs yields better results vs lvm/mdadm.


The risk of bit rot is still a thing at home or in the data center. And the other niceties of ZFS like snapshots and such are a boon too. Instead of a few various layers you have one whole subsystem to do all of it: all of this you know well using it in the data center. I just use it at home too


OpenZFS would still be in danger of a potential Oracle lawsuit.


For what? It's under the CDDL.


API Infringement.


That's not even a thing (yet).

ZFS was freely relicensed under the CDDL by Sun. Oracle can do nothing to take back any of the rights granted under the terms of the licence retrospectively. They haven't got any grounds whatsoever to curtail anyone's use or modification of the ZFS code.


It's been a thing for a decade[1] now. If you don't have a Google-sized team of lawyers handy it's a concern. Fingers crossed that Oracle loses in the end.

[1] https://en.wikipedia.org/wiki/Google_v._Oracle_America


Yes, but it won't be an actual thing to worry about until there's a legal precedent set. Right now, without any conclusions from the trial, it's not a problem.


Isn't that beside the point? OpenZFS is still CDDL.


I think it is besides the point for the risks of merging anything into Linux. He's right on that topic of course.

But it is a separate point he made about using zfs in general, and it's certainly not correct if you take one look at the activity in the zfsonlinux project on GitHub.


Its pretty clear that Linus simply doesn't have a clue about ZFS. And he just exposed himself as somebody that repeats stuff he read in some linux forum or something.

There is no way, after any technical evaluation by himself he would come to those conclusions.


Remember kids: Oracle has no customers, only hostages!


Amen to that


Switching to FreeBSD now for my storage. Having a 8TB database setup. Primary DB is running with LVM(cache)/XFS, which gives very satisfying speed, but I really love my secondary mirror DB which storage is on ZFS. I do daily snapshots and daily incremental backups via send/rcv to another ZFS location. No other FS I am aware of provides me with this and that easy functionality. Linus seems to never used ZFS. Although I can understand his issues with the licensing, he is ranting about ZFS. That's a shame.


Alright, Linus, I'll make you a deal: I'll consider dropping ZFS when you ship a production-grade BTRFS (or reiserfs or anything else with the same features).


Is reiserfs still maintained. After Hans went to prison I didn't think there was much left beyond stagnation.


I think his parole hearing is coming up this year...


They really needed to rename it.


Hans is eligible for parole this month.


He still murdered a person. He is quite literally the worst sort of domestic abuser. The kind who kills their partner.


I'm no expert but it seems unlikely that he'd get parole on his first chance. I don't think that is common for murder.


rename to parolefs


bcachefs is aiming to replace btrfs.


Yep:) I'd forgotten about that one, but when it gets merged I will have to seriously consider it! Unfortunately, that's probably years out, so I'm stuck on ZFS for now. I also have some portability concerns (ZFS works on FreeBSD, NetBSD, Illuminos, and Linux, and this was a selling point for me), but I'll probably get over it or at least mostly switch to bcachefs when it goes mainline.


Do you know that FreeBSD is actually a usable modern OS these days? ;) I run a FreeBSD desktop with Nvidia drivers without any issues. OpenJDK works great, among other things. And it supports ZFS natively, and root on ZFS is the default installation option.


Unfortunately some popular software (like Docker) doesn't work (afaik?) on FreeBSD which might hold a lot of people back.


ZFS functionality dwarfs the minor issues Linus has with it in my opinion. I find it to be well maintained, and not just bug fixes but new features keep being added as well. If I couldn't use ZFS on linux anymore, I wouldn't hesitate to setup another system just so I could keep using ZFS.


Probably should add Java to that list. The sooner Oracle stops existing, the better.


OpenJDK is on pretty solid legal ground, no?


Haha, OpenJDK is wholly owned by Oracle. There is no separate legal entity. License wise it's GPL w/ CPE.


Plus the OpenJDK Community TCK Licensing Agreement.

No-one's seriously concerned that adopting OpenJDK could land them in legal trouble, as far as I know. I don't think a separate legal entity is always necessary.


Well, depends what you mean by solid legal ground then.

Nobody is / should be seriously concerned by using the open source OpenZFS modules with their Linux distro of choice.


> depends what you mean by solid legal ground then

I'm referring to having confidence that using OpenJDK, in the absence of any licensing agreement with Oracle, will not land your company in legal trouble.

I'm not seeing an ambiguity in my use of solid legal ground.

> Nobody is / should be seriously concerned by using the open source OpenZFS modules with their Linux distro of choice.

Linus isn't convinced that it's legally safe to do this, and neither are some people in this thread, myself included.

It may be true that even Oracle are unlikely to go after you for using ZFS from Linux, but if it's not safe enough for Linus, it's probably not safe enough for the legal departments of large companies.


No Java in the kernel!


The BPF has bytecode and a JIT. The JVM has a very good JIT. Let's add Java to the kernel and see!


Google and Microsoft too!


Microsoft and Google contribute more to open source than any other company.

Source: https://www.techrepublic.com/google-amp/article/microsoft-ma... https://techcrunch.com/2019/01/17/google-remains-the-top-ope...


"Oracle's litigious nature". What a beautifully short and concise phrase. I immediately had to write this down and stash it as an argument for the next time someone at work pushes to go for "Oracle $product" after having received a bottle of wine from their sales team.


Honestly, at this point if I can't get ZFS in Linux I would move to FreeBSD whenever I need big filesystem. How does Linux® Binary Compatibility layer work on FreeBSD?


It implements the x86 and x86_64 Linux system call ABI. Linux ELF binaries get vectored to an alternate system call table implemented by the compatibility layer. There are some other components like an implementation of a Linux-compatible procfs. How well it works in practice really depends on how far off the beaten path you go. There are lots of non-essential pieces that are not implemented, but for example I know of people running Steam on FreeBSD.


I've run Oracle JRE and OpenJDK with it and both work OK; also some BlackBerry SDK tools built for Linux. I'm sure there's some rough edges, and I don't know about performance, but once I mounted the appropriate filesystems, things were working, and that was good enough for me. I think that you do have to pick between a current release of FreeBSD and 64-bit Linux binaries or an older release of FreeBSD and 32-bit Linux binaries, and no way to support both sizes on the same host; but that might be me misremembering.


The main thing I want is for OnlyOffice or Collabora to work on FreeBSD in some capacity and I haven't been able to do it (both have open issues that receive very little attention). I want to run my self hosted office solution on the same machine as the data, and I'd really rather avoid VMs.

So, I use Linux because Docker and BTRFS work just fine for me use case. I prefer FreeBSD, but unfortunately I'm unable to solve my problems easily with just FreeBSD, so I'm using something else.


Linus is correct in his arguments. As the lead for the linux project, he shouldn't merge things that he feels isn't up to stuff from a license point of view.

That's why we have different software.

If you asked Theo to merge an encryption algorithm for example into OpenSSH and OpenBSD - he's going to have an opinion about it - and that's his thing.

Why would this be controversial at all?


Because people like ZFS and Linux so they want to combine the two.


Then they can go ahead and combine the two! ZFS on linux has a whole team of maintainers.

Linus can do what he wants in regards to his branch (which because he's the lead, becomes official Linux), but there's no reason any one else (or any distro) can't do the integration. That's how open source works!

Of course, whoever does the integration may incur Oracle's wrath. Tread at your own discretion. If those people like it so much that they will put up their own money when Oracle's lawyers come calling, that's completely up to them.

In my opinion, people who constantly clamour for such things against the technical judgment of open source maintainers are freeloaders. They can propose ideas, but just because the maintainer doesn't want to do it doesn't mean they can scream bloody murder. Just put up your own money and fork it and/or maintain your own fork, which is exactly what the ZFS on linux community is doing - which is the right thing.

Anyone else can work with the ZFS on linux maintainers to take a bit of the burden on, whether it's rebasing or updating docs on how the integration works, etc. It's a group effort.


It's too bad. ZFS is amazing and so ridiculously simple to use and manage if you can find the recent docs among all the old docs online.

Anyone using Stratis? Just noticed it recently went to 2.0. https://en.wikipedia.org/wiki/Stratis_(configuration_daemon) Curious to know how it handles device failures, removals, additions and so on.


Polite reminder that Kent Overstreet is still plugging away at a new copy-on-write FS for Linux called bcachefs. One day, I hope it'll replace ZFS for my uses.

I'm not involved with the project in any way, apart from sending him a few bucks a month on patreon. It's literally the only open source thing I sponsor; it seems like a really worthwhile effort especially considering Linus' advice here...


Perhaps if one wanted to use ZFS they should just use a kernel that supports it?

ZFS is certainly "nice to have" on desktop, but the main use case is going to be servers and NAS. You can use BSD there, it won't bite.


Or do use ZFS, just know that Oracle sucks and you have to go through hoops because of it...

Also, while ZFS for me has been performant, that seems like a silly reason to decide to use it or not use it. I think ZFS pools and snapshots would be among the deciding factors to use it or not.

FWIW, as some other commenters have said, I'd rather drop Linux than drop ZFS. I'm actually only even running Linux on my home server right now because I decided to try Proxmox out on it months ago and it was soooo obscenely easy to install I haven't bothered to reset it yet (though I need to for various reasons; Proxmox itself being the first, ha).

Really all I care about for my host OS these days is the ability to do virtualization and GPU pass-through... Linux is an option but not the only one. Having a robust storage system where drives can fail, exist as one logical drive, are replicated, has snapshots (including RAM), and I literally don't have to worry about it-- that though, that's really only available with ZFS.


Oracle has not actually done anything yet. Its the lack of believe in the license.


Yet another user testimony: For simple volumes and snapshots at home I switched from ZFS to BTRFS because ZFS on Fedora was giving me too many issues. It simply wasn't integrated well enough into the system. Had nothing to do with the ZFS FS implementation, merely the packaging.

Either way, BTRFS works for everything I need it to do and it's native.


More technical version of the same discussion here: https://lore.kernel.org/lkml/CAB9dFdsZb-sZixeOzrt8F50h1pnUK2...


Typical Linus bullshit ...

"Don't use ZFS. It's that simple. It was always more of a buzzword than anything else, I feel, and the licensing issues just make it a non-starter for me."

Yeah, just use XFS or EXT4 without ANY data consistency ... you can use FAT32 also, similar level filesystem.


Until I have a viable alternative that gives me snapshotting (so I can make consistent backups), that advice is worthless to me.


Yeah. There's no decent replacement for ZFS. I use ZFS + KVM + Sanoid + Borg(matic) + Borgbase. With the native encryption and TRIM support added to ZFS in 0.8 there isn't anything close in terms of ease of use.

Linus seems out of touch on this one IMHO.


XFS on LVM thin pool thin LV gives you fas CoW snapshots and is rock solid. Really, try it. :)


I don't remember the details, but when I looked into switching to LVM snapshots, I ran into some sort of blocker.

My use-case is that I run Sandstorm, and want to be able to back it up while it's running. That means:

  - Ensure there aren't any existing snapshots
  - Take a snapshot
  - Mount the snapshot as a filesystem
  - Run tarsnap against that filesystem
  - Release the snapshot
I think the trouble I ran into was at the mount step.


in the application-level, snapshotting is not a way to do "consistent backup"s . consistent backup is a backup with a planned or known state when "restore"ing.


Sure it is. Quiesce your application, take a snapshot, then resume application. Then you can back up the snapshot. The alternative can be a lengthy downtime for your application.


I can't do anything about partial file writes in the general case, but it's close enough—and any ACID databases should be able to restore from such a snapshot.


LVM snapshots has been good enough for consistent backups for the last 20 years. Then there is also thin snapshots if you feel fancy.


Maybe offtopic, but I'm impressed by the rest of the conversation that generated that message from Linus: there is a whole thread in which Linus gets to explain with detail various locking mechanism in the kernel, pros and cons, etc.

I think we don't see this normally happening; situations in which the (technical) responsible for the Apple or MS kernels get to answer question and explain with such level of detail.

I think also that is even more interesting than the original blog post that originated that thread. Someone should harvest all these Linus comments and order them in some kind of "lectures" library.


His reasons may not be perfect but he's right.

If you use a filesystem that isn't mainline and it breaks it's all on you to figure it out and fix it. Having used experimental filesystems before and been burned I would rather stick with what I know and won't change overnight.

I'll keep an eye open for new filesystems of course but if it's not mainline unless it's for personal hacking no.

> I own several SBC's (Small Board Computer) that do not run mainline kernels. The company that makes them provides their own patched kernel so if it breaks I'm up creek but I know who to bitch at.


I kind of wish somebody with money took Oracle into court on ZFS and established it doesn't have Oracle taint, so we can move on. Java would be good too but that fight went to the wrong point of law to argue. Opensolaris .. I feel meh about but perhaps it too needs this.

Money doesn't solve all problems, but money can solve legal problems c/f the lawsuits people like newegg do, to get rid of the IPR leeches. (And cloudflare?)


So what would be a viable ZFS alternative on a Linux based NAS I'm planning to build in a few months? I'm currently running Nas4Free on a giant rack sized thing I built using a Mini-ITX Atom board years ago, and ZFS works like a charm, but I also intend to move some day to the ARM architecture which unfortunately the *BSD based NAS software doesnt support (yet). I'm tempted by this smaller hardware in particular: https://wiki.kobol.io/helios64/intro/ So far the only viable option would be Openmediavault which supports ZFS only through external modules which I'd be not entirely comfortable with. I'd only arrange disks as RAID1 pairs however.


I use BTRFS and it works fine. Just don't use RAID5/6 and you should be good to go.

I've also heard success stories using ZFS on Linux, but I haven't bothered because I'd just rather use something that's in the kernel instead of something outside it.


Are there any lawyers that can verify the legal claims made? Anyone can file a lawsuit at anytime for whatever reason, that doesn't mean the lawsuit is valid. Sure using ZFS on your home system or even a small implementation is not a big deal but if there's a company with money lawsuits happen.

I would think the wide availability and usage of ZFS and the fact that this is no longer 20 years ago when companies would sue as if they were 19th century tycoons there must be some sort of "well you let ZFS be used this long without suing you can't just let it be open for so long and then sue when it is profitable" sort of statue in American law. Again I'm not a lawyer and I know enough about law to know I do not know enough about law.


To be honest, given Oracle's history, I'm not sure I would even trust a lawyer's opinion, since it could still cost you a lot of money to defend against an Oracle lawsuit even if you win.


Don't go too far people. Linus's criticism against ZFS is concise: buzzword & licensing.

Here, Linus is putting emphasis on license, not technical whatever details of ZFS. He clearly doesn't use ZFS and is not even interested in the problem ZFS solves. He is only "interested" is his (and community's) control over the ZFS source code.

So, his logic basically becomes this:

ZFS is not mainline-able, so veto it until Oracle changes its attitude - a simple old FOSS infestation tactic.

So, please, move on, people. The discussion is not even about file system...


Actually what stops mainline integration is the believe of the community in the strength of open source licensing.


mdam + LVM + ext4 does everything I need to do and more. Thin and thick provisioning, ssd caching, snapshots. If I were to use another file system it would be something like ceph or glusterfs.


> and given Oracle's interface copyright suits (see Java)

This seems pretty ironic, given that the whole problem is Linux developers trying to claim and enforce that only other GPL code is allowed to use its APIs (which, IMHO, goes beyond both the intent and letter of the license). The issues aren't exactly the same, but Linux sure seems to be a lot closer to Oracle than to Google here.


Oracle doesn't own all the rights to OpenZFS, people are intentionally adding their own copyright to ensure it stays under a copyleft license, and Oracle can't make it closed source by owning the rights (which they did with the old version).


Title is miss-leading tbh, I know its a direct quote from the article but it takes it out of the context of "I don't want to support your third party code".


Nicely written Linus, very soft and considerate of other's feelings. Not so funny anymore but overall it will make more people happy :)


Both my main servers at my house use ZFS, neither use Linux on bare metal. FreeNAS (multiple ZFS mirrors) and SmartOS running a ZFS mirror.


I hope bcachefs will get to usable state soon.


The situation rules out Linux for many potential applications and side projects I might undertake.


Reminds me of Facebook and React licensing scandal back in 2016/17.

Can never trust someone who easily changes their licenses from private to public. Who knows in future they might change it back to private!?


Yes, don't use ZFS! Use FreeBSD and ZFS!


The Linux kernel broke user space by egregiously decreeing that kernel modules cannot use certain CPU unless they are GPL. Derived work, my ass.


So is Linus back from his break now?


linus mouthing off on something he doesn't care to learn about: news at eleven.


Honestly ext4 is fine for most use cases, even SSDs. If you really need more performance look at HAMMER, it’s meant for high availability. At that point you shouldn’t be running Linux anyway, even with RT_PREEMPT it’s not going to be the most performant for those kinds of RTOS workloads anyway.


HAMMER2 is now the default on DragonflyBSD. If I made a bunch of money during the boom and could spend my days doing open source (like Matt Dillon), porting HAMMER2 might be one of the projects I'd pick up.


That would be a neat project. It’s hard to set aside the time when you don’t have much in the time bank. I’ve been wanting to do a lot more research on kernel scheduling, writing my own alternative scheduler, and more research on RTOS design and real-time computing in general.


So my take on this is that the future is Ceph and you would do better running single node ceph than ZFS or BTRFS.


[flagged]


No personal attacks on HN, please. Maybe you don't owe Linus better (though why not?), but you owe the community better if you want to post here.

Also, "typical autistic savant type" breaks the site guideline against calling names. Please don't do that.

https://news.ycombinator.com/newsguidelines.html

Edit: you've unfortunately been doing this repeatedly:

https://news.ycombinator.com/item?id=21889883

https://news.ycombinator.com/item?id=21089837

Can you please not? Making an account to be anonymous on HN is fine in principle, but people sometimes start breaking the rules after they do so, and that is not cool.


Oracle rears its head again lolz


I'm glad Linus is now acting as legal council for Linux. It's scary that he is implying he's making these decisions without aid to council


ZFS threatens the power of Linux and therefore Linus’ job. That’s the long and short of it. Mac and Windows have been able to maintain stable interfaces for binary kernel drivers for 20 years.


i wouldn't use ZFS either. my guess is 90% of ZFS users have never run failure scenarios and grappled with potential failure modes of ZFS, nor even know that you really need ECC RAM to run ZFS without fear of existential data corruption due to bit flips.

furthermore, the allure of ZFS means people aren't testing their disaster plans until it's too late, bc ZFS is "resilient".

lastly, data recovery is expensive as all hell if even possible. i am talking order of magnitude four figures for 100s of GBs and sketchy probabilities.

ZFS is the ultimate "pet" in the pets vs. cattle continuum. in a world where shoddy engineering and "break things fast" is the zeitgeist, i'm happy to use a classic dumb FS like ext4 and pathologically backing it up and testing said backups.

i would not risk any of my personal treasured data to ZFS due to inherent existential threats. i would implore ZFS users to evaluate and test their setups, and especially use ECC RAM - like, starting now - to protect their assets.


> you really need ECC RAM to run ZFS

This is FUD. ZFS does as well, if not better than the average file system with its focus on integrity, online scrubs etc. On the other hand "use ECC RAM" is standard best practices for any mission critical data, no file system magic is going to fix computer RAM lying to you 100% of the time. Its the standard recommendation for ZFS because its rare to be deployed in environments that can tolerate data corruption.

> pathologically backing it up and testing said backups.

ZFS doesn't remove the need for backups and no one seriously makes that arguments. Though snapshots + send/receive make them very easy to do in ZFS.


I've detected broken memory chips thanks to BTRFS checksumming finding errors, luckily before it had a chance to corrupt any written data. So if anything, a properly checksummed filesystem makes non-ECC RAM less dangerous.


> ZFS is the ultimate "pet" in the pets vs. cattle continuum. in a world where shoddy engineering and "break things fast" is the zeitgeist,

Live storage is never 'cattle', that is idiotic your running filesystem IS actually a pet. Harddrives are 'cattle' and that's exactly what ZFS treats like 'cattle'.

ZFS was born out of long frustrations with file system and was systematically designed to protect against data corruption and bad hardware. It is literally the exact opposite of 'move fast and break things'.

Go and actually watch the videos where the designer show it for the first time. It speaks very clearly about how and why they designed it.

> i would not risk any of my personal treasured data to ZFS due to inherent existential threats. i would implore ZFS users to evaluate and test their setups, and especially use ECC RAM - like, starting now - to protect their assets.

ZFS has always recommended ECC to its users. No filesystem can protect you from not having it.


> you really need ECC RAM to run ZFS without fear of existential data corruption due to bit flips

https://arstechnica.com/civis/viewtopic.php?f=2&t=1235679&p=...

http://www.open-zfs.org/wiki/User:Mahrens


The problem is that ZFS doesn't have an offline repair tool. A (granted unlikely) bit flip in an important data structure that gets written to disk makes the whole fs unmountable and that's it (idk if it has a tool to rescue file data from a unmountable pool? Maybe we should ask Gandi...).

With e.g. ext4 you can get back to a mountable state pretty much guaranteed with e2fsck. You might loose a few files, or find them in lost+found, etc. but at least you have something.

The reason ZFS doesn't have a offline repair tool is pretty convincing. Once you have zettabytes (that's the marketing) of data, running that repair tool will take too long, so you'd have to do everything to prevent that in the first place anyway. Including checksumming everything, storing everything redundantly and using ECC RAM.


AFAIK it stores multiple copies of those important data structures though, so should take more than a single bit flip.


So better to use ext4 and let it silently corrupt your data?

ZFS does indeed catch memory errors. If you are running without ECC, most filesystems will happily write that corrupt data to disk. Unless the corruption is in the metadata, you will be none the wiser.


ZFS has seen me through 6 disk failures since I started using it on Nexenta about 10 years ago; zero data loss.

It's not a backup by itself, but it makes a fine backup target if it's located somewhere else, since it's both redundant (hard to lose data by accident) and snapshotted (hard to lose data by mistake) - it was my local CrashPlan target (alongside cloud) back when CrashPlan supported home users.


So, is Linus pretty much the same guy? I know he took some time off and the kernel team adopted a code-o-conduct, and had that introspective e-mail ... but now that he's back ... is it any different?


Because he didn't attack any individual, use derogatory language, or break the code of conduct?

He never agreed to roll over and agree with every technological persuasion, he agreed to be nicer to people. This was nice to people, but rude to a technology (ZFS), that seems consistent.


This post and a few related ones in the chain seem perfectly fine to me. Does something here seem offensive or harsh to you?


Looking at reddit.com/r/linusrants ...

Most of his worst rants were directed at maintainers who commited changes that resulted in bugs in the kernel.

The OP of this thread is from a user, so perhaps we need to wait until another big bug gets commited to see the results of his hiatus.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: