Linus: Don't Use ZFS

Jonnax · on Jan 9, 2020

Here's his reasoning:

"honestly, there is no way I can merge any of the ZFS efforts until I get an official letter from Oracle that is signed by their main legal counsel or preferably by Larry Ellison himself that says that yes, it's ok to do so and treat the end result as GPL'd.

Other people think it can be ok to merge ZFS code into the kernel and that the module interface makes it ok, and that's their decision. But considering Oracle's litigious nature, and the questions over licensing, there's no way I can feel safe in ever doing so.

And I'm not at all interested in some "ZFS shim layer" thing either that some people seem to think would isolate the two projects. That adds no value to our side, and given Oracle's interface copyright suits (see Java), I don't think it's any real licensing win either."

stiray · on Jan 10, 2020

Btrfs crashed for me on two occations, last time, around 2 years back I have installed zfs (which I am using for ~10 years on FreeBSD server) which works like a charm since then.

I understand Linus reasoning but there is just no way I will install btrfs, like ever. I rather dont update kernel (I am having zfs on fedora root with degular kernel updates and scripts which verify that everything is with kernel modules prior to reboot) than use file system that crashed twice in two years.

Yes it is very annoying if update crashes fs, but currently:

- in 2 years two time btrfs crashed itself

- in next 2 years update never broke zfs

As far as I am concerned, the case for zfs is clear.

This might be helpful to someone: https://www.csparks.com/BootFedoraZFS/index.md

Anyway Linus is going too far with his GPL agenda, the MODUL_LICENSE writting kernel moduls explains why the hardware is less supported on linux - instead of devs. focusing on more support from 3rd party companies, they try to force them to do GPL. Once you set MODUL_LICENSE to non GPL, you quickly figure out that you can't use most of kernel calls. Not the code. Calls.

blacklight · on Jan 10, 2020

The Linux kernel has been released under GPL2 license since day 1, and I don't think that's ever going to change. Linus is more pragmatic than many of his detractors think - he thankfully refused to migrate to GPL3 because the stricter clauses would have scared away a lot of for-profit users and contributors.

Relaxing on anything more permissive than GPL2 would instead mean the end of Linux as we know it. A more permissive license means that nothing would prevent Google or Microsoft from releasing their own closed-source Linux, or replacing the source code of most of the modules with hex bloats.

I believe that GPL2 is a good trade-off for a project like Linux, and it's good that we don't compromise on anything less than that.

Even though I agree on the superiority of ZFS for many applications, I think that the blame for the missed inclusion in the kernel is on Oracle's side. The lesson learned from NTFS should be that if a filesystem is good and people want to use it, then you should make sure that the drivers for that filesystem are as widely available as possible. If you don't do it, then someone sooner or later will reverse engineer the filesystem anyway. The success of a filesystem is measured by the number of servers that use it, not by the amount of money that you can make out of it. For once Oracle should act more like a tech company and less like a legal firm specialised in patent exploitation.

stiray · on Jan 10, 2020

The blame is on Oracle side for sure. No question about it.

> or replacing the source code of most of the modules with hex bloats.

Ok good point, I am no longer pissed off on MODULE_LICENSE, didn't even thought about that.

amithanda81 · on Jan 10, 2020

I agree with the stand btrfs, around same time (2 years back), it crashed on me while I was trying to use it for external hard disk attached to raspberry pi. nothing fancy. since then, I cant tolerate the fs crashes, for a user, its supposed to be one of the most reliable layers.

zepearl · on Jan 10, 2020

Concerning the BTRFS fs:

I did use it as well many years ago (probably around 2012-2015) in a raid5-configuration after reading a lot of positive comments about this next-gen fs => after a few weeks my raid started falling apart (while performing normal operations!) as I got all kind of weird problems => my conclusion was that the raid was corrupt and it couldn't be fixed => no big problem as I did have a backup, but that definitely ruined my initial BTRFS-experience. During those times even if the fs was new and even if there were warnings about it (being new), everybody was very optimistic/positive about it but in my case that experiment has been a desaster.

That event held me back until today from trying to use it again. I admit that today it might be a lot better than in the past but as people have already been in the past positive about it (but then in my case it broke) it's difficult for me now to say "aha - now the general positive opinion is probably more realistic then in the past", due e.g. to that bug that can potentially still destroy a raid (the "write hole"-bug): personally I think that if BTRFS still makes that raid-functionality available while it has such a big bug while at the same time advertising it as a great feature of the fs, the "irrealistically positive"-behaviour is still present, therefore I still cannot trust it. Additionally that bug being open since forever makes me think that it's really hard to fix, which in turn makes me think that the foundation and/or code of BTRFS is bad (which is the reason why that bug cannot be fixed quickly) and that therefore potentially in the future some even more complicated bugs might show up.

Concerning alternatives:

I am writing and testing since a looong time a program which ends up creating a big database (using "Yandex Clickhouse" for the main DB) distributed on multiple hosts where each one uses multiple HDDs to save the data and that at the same time is able to fight against potential "bitrot" ( https://en.wikipedia.org/wiki/Data_degradation ) without having to resync the whole local storage each time that a byte on some HDD lost its value. Excluding BTRFS, the only other candidate that I found is ZFSoL that perform checksums on data (both XFS and NILFS2 do checksums but only on metadata).

Excluding BTRFS because of the reasons mentioned above, I was left only with ZFS.

I'm now using ZFSoL since a couple of months and so far everything went very well (a bit difficult to understand & deal with at the beginning, but extremely flexible) and performance is as well good (but to be fair that's easy in combination with the Clickhouse DB, as the DB itself writes data already in a CoW-way, therefore blocks of a table stored on ZFS are always very likely to be contiguous).

On one hand, technically, now I'm happy. On the other hand I do admit that the problems about licensing and the non-integration of ZFSoL in the kernel do have risks. Unluckily I just don't see any alternative.

I do donate monthly something to https://www.patreon.com/bcachefs but I don't have high hopes - not much happening and BCACHE (even if currently integrated in the kernel) hasn't been in my experience very good (https://github.com/akiradeveloper/dm-writeboost worked A LOT better, but I'm not using it anymore as I don't have a usecase for it anymore, and it was a risk as well as not yet included in the kernel) therefore BCACHEFS might end up being the same.

Bah :(

thomasfedb · on Jan 10, 2020

I'd avoid making an argument for or against a filesystem on the basis of anecdotal evidence.

Unklejoe · on Jan 10, 2020

For your own personal use, your own personal anecdotes are really all that matter.

kqr · on Jan 10, 2020

Your personal anecdotes are indeed all that matter when it comes to describing your past.

When it comes to predicting your future, though, your personal anecdotes may not hold water to more substantial data.

goatinaboat · on Jan 10, 2020

Btrfs like OCFS is pretty much junk. You can do everything you need to on local disk with XFS and if you need clever features buy a NetApp.

bitL · on Jan 10, 2020

Both ZFS and BTRFS are essentially Oracle now. BTRFS was an effort largely from Oracle to copy SUN's ZFS advantages in a crappy way which became moot once their acquired SUN. ZFS also requires (a lot of) ECC memory for reliable operation. It's a great tech, pity it's dying slow death.

sokoloff · on Jan 10, 2020

I’d argue that other file systems also require ECC RAM to maximize reliability. Zfs just makes it much more explicit in their docs and surfaces errors rather than silently handing back any memory corrupted data.

rhinoceraptor · on Jan 10, 2020

ZFS needs ECC just as much as any other file system. That is, it has no way of detecting in memory errors. So if you want your data to actually be written correctly, it's a good idea to use ECC. But the myth that you "need" ECC with ZFS is completely wrong. It would be better if you did have ECC, but don't let that stop you from using ZFS.

As far as it needing a lot of memory, that is also not true. The ARC will use your memory if it's available, because it's available! You paid good money for it, so why not actually use it to make things faster?

bitL · on Jan 10, 2020

I worked at SUN when ZFS was "invented" and the emphasis on a large amount of proper ECC memory was strong, especially in conjunction with Solaris Zones. I can't recall if it was 1GB of RAM per 1TB of storage or something similar due to how it performed deduplication and stored indices in hot memory. And that was also the reason for insisting on ECC, in order to make sure you won't get your stored indices and shared blocks messed up, leading to major uncorrectable errors.

rhinoceraptor · on Jan 10, 2020

I can see how a (perhaps, less than competitive) hardware company would want you to think that :)

bitL · on Jan 10, 2020

Sure, all about internal marketing, right? :D

But there was nothing like that on the market at that time anyway.

ahofmann · on Jan 11, 2020

I have examined all the counterarguments against ZFS myself and none of them have been confirmed. ZFS is stable and not RAM-hungry as is constantly claimed. It has sensible defaults, namely to use all RAM that is available and to release it quickly when it is used elsewhere. ZFS on a Raspberry Pi? No problem. I myself have a dual socket, 24 Core Intel Server with 128 GB RAM and a virtual Windows SQL Server instance running on it. For fun, I limited the amount of RAM for ZFS to 40 MB. Runs without problems.

Hamuko · on Jan 9, 2020

That's his reasoning for not merging ZFS code, not for generally avoiding ZFS.

btilly · on Jan 9, 2020

Here are his reasons for generally avoiding ZFS from what I consider most important to least.

- The kernel team may break it at any time, and won't care if they do.

- It doesn't seem to be well-maintained.

- Performance is not that great compared to the alternatives.

- Using it opens you up to the threat of lawsuits from Oracle. Given history, this is a real threat. (This is one that should be high for Linus but not for me - there is no conceivable reason that Oracle would want to threaten me with a lawsuit.)

pizza234 · on Jan 9, 2020

I'm baffled by such arguments.

> It doesn't seem to be well-maintained.

The last commit is from 3 hours ago: https://github.com/zfsonlinux/zfs/commits/master. They have dozens of commits per month. The last minor release, 0.8, brought significant improvements (my favorite: FS-level encryption).

Or maybe this is referred to the 5.0 kernel (initial) incompatibility? That wasn't the ZFS dev team's fault.

> Performance is not that great compared to the alternatives.

There are no (stable) alternatives. BTRFS certainly not, as it's "under heavy development"¹ (since... forever).

> The kernel team may break it at any time, and won't care if they do.

That's true, however, the amount is breakage is no different from any other out-of-tree module, and it unlikely to happen with a patch version of a working kernel (in fact, it happen with the 5.0 release).

> Using it opens you up to the threat of lawsuits from Oracle. Given history, this is a real threat. (This is one that should be high for Linus but not for me - there is no conceivable reason that Oracle would want to threaten me with a lawsuit.)

"Using" it won't open to lawsuits; ZFS has a CDDL license, which is a free and open-source software license.

The problem is (taking Ubuntu as representative) shipping the compiled module along with the kernel, which is an entirely different matter.

---

[¹] https://btrfs.wiki.kernel.org/index.php/Main_Page#Stability_...

zamalek · on Jan 10, 2020

> ZFS has a CDDL license

Java is GPLv2+CPE. That didn't stop Oracle because, as Linus pointed out in the email, Oracle regards their APIs as a separate entity to their code.

josefx · on Jan 10, 2020

Googles Java implementation wasn't GPL licensed, so neither its implementation nor its interface could have been covered by the OpenJDK being GPLv2. I think RMS wouldn't sit by idly either if someone took GCC and forked it under the Apache license.

vetinari · on Jan 10, 2020

But Google didn't fork the OpenJDK; they forked Apache Harmony, which was already Apache-licensed.

So it's not comparable with GCC; but comparable to forking clang and keeping clang license. I doubt RMS could be able to say anything.

LaGrange · on Jan 9, 2020

> There are no (stable) alternatives. BTRFS certainly not, as it's "under heavy development"¹ (since... forever).

Note that they don't mean "it's unstable," just "there are significant improvements between versions." Most importantly:

> The filesystem disk format is stable; this means it is not expected to change unless there are very strong reasons to do so. If there is a format change, filesystems which implement the previous disk format will continue to be mountable and usable by newer kernels.

...and only _new features_ are expected to stabilise:

> As with all software, newly added features may need a few releases to stabilize.

So overall, at least as far as their own claims go, this is not "heavy development" as in "don't use."

HorstG · on Jan 10, 2020

Some features such as Raid5 were still firmly in "don't use if you value your data" territory last I looked. So it is important to be informed as to what can be used and what might be more dangerous with btrfs

berti · on Jan 10, 2020

Keep in mind that RAID5 isn’t feasible with multi-TB disks (the probability of failed blocks when rebuilding the array is far too high). That said, RAID6 also suffers the same write-hole problem with Btrfs. Personally I choose RAIDZ2 instead.

XelNika · on Jan 10, 2020

> Keep in mind that RAID5 isn’t feasible with multi-TB disks (the probability of failed blocks when rebuilding the array is far too high).

What makes you say that? I've seen plenty of people make this claim based on URE rates, but I've also not seen any evidence that it is a real problem for a 3-4 drive setup. Modern drives are specced at 1 URE per 10^15 bits read (or better), so less than 1 URE in 125 TB read. Even if a rebuild did fail, you could just start over from a backup. Sure, if the array is mission critical and you have the money, use something with more redundancy, but I don't think RAID5 is infeasible in general.

bleuarff · on Jan 10, 2020

Last time I checked (a few years ago I must say), a 10^15 URE was only for enteprise-grade drives and not for consumer-level, where most drives have a 10^14 URE. Which means your build is almost guaranteed to fail on a large-ish raid setup. So yeah, RAID is still feasible with multi-TB disks if you have the money to buy disks with the appropriate reliability. For the common folk, raid is effectively dead with today's disk sizes.

labawi · on Jan 10, 2020

Theoretically, if you have a good RAID5, without serious wire-hole and similar issues, then it is strictly better than no RAID and worse than RAID5 and RAID1.

* All localized error are correctable, unless they overlap on different disks, or result in drive ejection. This precisely fixes UREs of non-raid drives.

* If a complete drive fails, then you have a chance of losing some data from the UREs / localized errors. This is approximately the same as if you used no RAID.

As for URE incidence rate - people use multi-TB drives without RAID, yet data loss does not seem prevalent. I'd say it depends .. a lot.

If you use a crappy RAID5, that ejects a drive on a drive partial/transient/read failure, then yes, it's bad, even worse than no RAID.

That being said, I have no idea whether a good RAID5 implementation is available, one that is well interfaced or integrated into filesystem.

XelNika · on Jan 10, 2020

I have a couple of Seagate IronWolf drives that are rated at 1 URE per 10^15 bits read and, sure, depending on the capacity you want (basically 8 TB and smaller desktop drives are super cheap), they do cost up to 40% more than their Barracuda cousins, but we're still well within the realm of cheap SATA storage.

paulmd · on Jan 10, 2020

Manufacturer-specified UBE rates are extremely conservative. If UBE were a thing then you'd notice transient errors during ZFS scrubs, which are effectively a "rebuild" that doesn't rebuild anything.

metaphor · on Jan 10, 2020

To be sure, it's entirely feasible, just not prudent with today's typical disk capacities.

Godel_unicode · on Jan 10, 2020

Feasible is different than possible, and carries a strong connotation of being suitable/able to be done successfully. Many things are possible, many of those things are not feasible.

rleigh · on Jan 10, 2020

Btrfs has many more problems than dataloss with RAID5.

It has terrible performance problems under many typical usage scenarios. This is a direct consequence in the choice of core on-disc data structures. There's no workaround without a complete redesign.

It can become unbalanced and cease functioning entirely. Some workloads can trigger this in a matter of hours. Unheard of for any other filesystem.

It suffers from critical dataloss bugs in setups other than RAID5. They have solved a number of these, but when reliability is its key selling point many of us have concerns that there is still a high chance that many still exist, particularly in poorly-exercised codepaths which are run in rare circumstances such as when critical faults occur.

And that's only getting started...

disordinary · on Jan 10, 2020

There's differing opinions of BTRFS's suitability in production - it's the default filesystem of SUSE on one hand, on the other RedHat has deprecated BTRFS support because they see it as not being production ready and they don't see it being production ready in the near future. They also feel that the more legacy linux filesystems have added features to compete.

jauer · on Jan 10, 2020

Facebook runs on btrfs: https://facebookmicrosites.github.io/btrfs/docs/btrfs-facebo...

m-ueberall · on Jan 10, 2020

But then, your personal requirements/use cases might not be the same as Facebook's. (And this does not only apply to Btrfs[1]/ZFS, it also applies to GlusterFS, use of specific hardware, ...)

[1] which I used for nearly two years on a small desktop machine on a daily basis; ended up with (minor?) errors on the file system that could not be repaired and decided to switch to ZFS. No regrets, nor similar errors since.

kalleboo · on Jan 10, 2020

It's also the default file system of millions of Synology NASes running in consumer hands (although Synology shimmed on their own RAID5/6 support)

mehhh · on Jan 10, 2020

Kroger (and their subsidiaries like QFC, Fred Meyer, Fry's Marketplace, etc), Walmart, Safeway (and Albertsons/Randalls) all use Suse with BTRFS for their point of sale systems.

vetinari · on Jan 10, 2020

Synology uses standard linux md (for btrfs too). Even SHR (Synology Hybrid RAID) is just different partitions on the drive allocated to different volumes, so you can use mixed-capacity drives effectively.

kalleboo · on Jan 11, 2020

Right, instead of BTRFS RAID5/6, they use Linux md raid, but I believe they have custom patches to BTRFS to "punch through" information from md, so that when BTRFS has a checksum mismatch it can use the md raid mirror disk for repair.

m4rtink · on Jan 10, 2020

Check what features of BTRFS SUSE actually uses and considers supported/supportable.

liuliu · on Jan 9, 2020

bcachefs should be heavily supported, it doesn't get nearly enough for what it supposes to do: https://www.patreon.com/bcachefs

uep · on Jan 10, 2020

I've been looking forward to using bcachefs as I had a few bad experiences with btrfs.

Is bcachefs more-or-less ready for some use cases now? Does it still support caching layers like bcache did?

ZoomZoomZoom · on Jan 10, 2020

It's quite usable, but of course, do not trust it with your unique unbacked-up data yet. I use it as a main FS for a desktop workstation and I'm pretty happy with it. Waiting impatiently for EC to be implemented for efficient pooling of multiple devices.

Regarding caching: "Bcachefs allows you to specify disks (or groups thereof) to be used for three categories of I/O: foreground, background, and promote. Foreground devices accept writes, whose data is copied to background devices asynchronously, and the hot subset of which is copied to the promote devices for performance."

zaarn · on Jan 10, 2020

To my knowledge, caching layers are supported but require some setup and don't have much documentation to setup rn.

If all you need is a simple root FS that is CoW and checksummed, bcachefs works pretty good, in my experience. I've been using it productively as a root and home FS for about two years or so.

ac29 · on Jan 10, 2020

Many of the advanced features aren't implemented yet though, like compression, encryption, snapshots, RAID5/6....

jeltz · on Jan 10, 2020

Compression and encryption have been implemented, but not snapshots and RAID5/6.

mschuster91 · on Jan 10, 2020

why would you want to embed raid5/6 in the filesystem layer? Linux has battle-tested mdraid for this, I'm not going to trust a new filesystem's own implementation over it.

Same for encryption, there are already existing crypto layers both on the block and filesystem (as an overlay) level.

wheybags · on Jan 10, 2020

Because the FS can be deeply integrated with the RAID implementation. With a normal RAID, if the data at some address is different between the two disks, there's no way for the fs to tell which is correct, because the RAID code essentially just picks one, it can't even see the other. With ZFS for example, there is a checksum stored with the data, so when you read, zfs will check the data on both and pick the correct one. It will also overwrite the incorrect version with the correct one, and log the error. It's the same kind of story with encryption, if its built in you can do things like incremental backups of an encrypted drive, without ever decrypting it on the target.

notacoward · on Jan 10, 2020

> when you read, zfs will check the data on both and pick the correct one.

Are you sure about that? Always reading both doubles read I/O, and benchmarks show no such effect.

> there's no way for the fs to tell which is correct

This is not an immutable fact that precludes keeping the RAID implementation separate. If the FS reads data and gets a checksum mismatch, it should be able to use ioctls (or equivalent) to select specific copies/shards and figure out which ones are good. I work on one of the four or five largest storage systems in the world, and have written code to do exactly this (except that it's Reed-Solomon rather than RAID). I've seen it detect and fix bad blocks, many times. It works, even with separate layers.

This supposed need for ZFS to absorb all RAID/LVM/page-cache behavior into itself is a myth; what really happened is good old-fashioned NIH. Understanding other complex subsystems is hard, and it's more fun to write new code instead.

mbreese · on Jan 10, 2020

> If the FS reads data and gets a checksum mismatch, it should be able to use ioctls (or equivalent) to select specific copies/shards and figure out which ones are good. I work on one of the four or five largest storage systems in the world, and have written code to do exactly this (except that it's Reed-Solomon rather than RAID).

This is all great, and I assume it works great. But it is no way generalizable to all the filesystems Linux has to support (at least at the moment). I could only see this working in a few specific instances with a particular set of FS setups. Even more complicating is the fact that most RAIDS are hardware based, so just using ioctls to pull individual blocks wouldn’t work for many (all?) drivers. Convincing everyone to switch over to software raids would take a lot of effort.

There is a legitimate need for these types of tools in the sub-PB, non-clustered, storage arena. If you’re working on a sufficiently large storage system, these tools and techniques are probably par for the course. That said, I definitely have lost 100GBs of data from a multi-PB storage system from a top 500 HPC system due to bit rot. (One bad byte in a compressed data file left the data after the bad byte unrecoverable). This would not have happened on ZFS.

ZFS was/is a good effort to bring this functionality lower down the storage hierarchy. And it worked because it had knowledge about all of the storage layers. Checksumming files/chunks helps best if you know about the file system and which files are still present. And it only makes a difference if you can access the lower level storage devices to identify and fix problems.

notacoward · on Jan 10, 2020

> it is no way generalizable to all the filesystems Linux has to support

Why not? If it's a standard LVM API then it's far more general than sucking everything into one filesystem like ZFS did. Much of this block-mapping interface already exists, though I'm not sure whether it covers this specific use case.

throw0101a · on Jan 10, 2020

> This supposed need for ZFS to absorb all RAID/LVM/page-cache behavior into itself is a myth; what really happened is good old-fashioned NIH.

At the time that ZFS was written (early 2000s) and released to the public (2006), this was not a thing and the idea was somewhat novel / 'controversial'. Jeff Bonwick, ZFS co-creator, lays out their thinking:

* https://blogs.oracle.com/bonwick/rampant-layering-violation

Remember: this was a time when Veritas Volume Manager (VxVM) and other software still ruled the enterprise world.

* https://en.wikipedia.org/wiki/Veritas_Storage_Foundation

notacoward · on Jan 10, 2020

I debated some of this with Bonwick (and Cantrill who really had no business being involved but he's pernicious that way) at the time. That blog post is, frankly, a bit misleading. The storage "stack" isn't really a stack. It's a DAG. Multiple kinds of devices, multiple filesystems plus raw block users (yes they still exist and sometimes even have reason to), multiple kinds of functionality in between. An LVM API allows some of this to have M users above and N providers below, for M+N total connections instead of M*N. To borrow Bonwick's own condescending turn of phrase, that's math. The "telescoping" he mentions works fine when your storage stack really is a stack, which might have made sense in a not-so-open Sun context, but in the broader world where multiple options are available at every level it's still bad engineering.

throw0101a · on Jan 11, 2020

> ... but in the broader world where multiple options are available at every level it's still bad engineering.

When Sun added ZFS to Solaris, they did not get rid of UFS and/or SVM, nor prevent Veritas from being installed. When FreeBSD added ZFS, they did not get rid of UFS or GEOM either.

If an admin wanted or wants (or needs) to use the 'old' way of doing things they can.

bcantrill · on Jan 10, 2020

Sorry, I'm pernicious in what way, exactly?

notacoward · on Jan 10, 2020

Heh. I was wondering if you were following (perhaps participating in) this thread. "Pernicious" was perhaps a meaner word than I meant. How about "ubiquitous"?

rhinoceraptor · on Jan 10, 2020

The fact that traditionally RAID, LVM, etc. are not part of the filesystem is just an accident of history. It's just that no one wanted to rewrite their single disk filesystems now that they needed to support multiple disks. And the fact that administering storage is so uniquely hard is a direct result of that.

notacoward · on Jan 10, 2020

However it happened, modularity is still a good thing. It allows multiple filesystems (and other things that aren't quite filesystems) to take advantage of the same functionality, even concurrently, instead of each reinventing a slightly different and likely inferior wheel. It should not be abandoned lightly. Is "modularity bad" really the hill you want to defend?

throw0101a · on Jan 10, 2020

> However it happened, modularity is still a good thing.

It may be a good thing, and it may not. Linux has a bajillion file systems, some more useful than others, and that is unique in some ways.

Solaris and other enterprise-y Unixes at the time only had one. Even the BSDs generally only have a few that they run on instead of ext2/3/4, XFS, ReiserFS (remember when that was going to take over?), btrfs, bcachefs, etc, etc, etc.

At most, a company may have purchased a license for Veritas:

* https://en.wikipedia.org/wiki/Veritas_Storage_Foundation

By rolling everything together, you get ACID writes, atomic space-efficient low-overhead snapshots, storage pools, etc. All this just be removing one layer of indirection and doing some telescoping:

* https://blogs.oracle.com/bonwick/rampant-layering-violation

It's not "modularity bad", but that to achieve the same result someone would have had to write/expand a layer-to-layer API to achieve the same results, and no one did. Also, as a first-order estimate of complexity: how many lines of code (LoC) are there in mdraid/LVM/ext4 versus ZFS (or UFS+SVM on Solaris).

rhinoceraptor · on Jan 10, 2020

Other than esoteric high performance use cases, I'm not really sure why you would really need a plethora of filesystems. And the list of them that can be actually trusted is very short.

notacoward · on Jan 10, 2020

I'd like to agree, but I don't think the exceptions are all that esoteric. Like most people I'd consider XFS to be the default choice on Linux. It's a solid choice all around, and also has some features like project quota and realtime that others don't. OTOH, even in this thread there's plenty of sentiment around btrfs and bcachefs because of their own unique features (e.g. snapshots). Log-structured filesystems still have a lot of promise to do better on NVM, though that promise has been achingly slow to materialize. Most importantly, having generic functionality implemented in a generic subsystem instead of in a specific filesystem allows multiple approaches to be developed and compared on a level playing field, which is better for innovation overall. Glomming everything together stifles innovation on any specific piece, as network/peripheral-bus vendors discovered to their chagrin long ago.

boomer_joe · on Jan 10, 2020

>I work on one of the four or five largest storage systems in the world

What would you recommend over zfs for small-scale storage servers? XFS with mdraid?

I'd also love to hear your opinion on the Reiser5 paper.

vetinari · on Jan 10, 2020

> With a normal RAID, if the data at some address is different between the two disks, there's no way for the fs to tell which is correct, because the RAID code essentially just picks one, it can't even see the other.

That's problem only with RAID1, only when copies=2 (granted, most often used case) and only when the underlying device cannot report which sector has gone bad.

tremon · on Jan 10, 2020

why would you want to embed raid5/6 in the filesystem layer?

There are valid reasons, most having to do with filesystem usage and optimization. Off the top of my head:

- more efficient re-syncs after failure (don't need to re-sync every block, only the blocks that were in use on the failed disk)

- can reconstruct data not only on disk self-reporting, but also on filesystem metadata errors (CRC errors, inconsistent dentries)

- different RAID profiles for different parts of the filesystem (think: parity raid for large files, raid10 for database files, no raid for tmp, N raid1 copies for filesystem metadata)

and for filesystem encryption:

- CBC ciphers have a common weakness: the block size is constant. If you use FS-object encryption instead of whole-FS encryption, the block size, offset and even the encryption keys can be varied across the disk.

rhinoceraptor · on Jan 10, 2020

I think to even call volume management a "layer" as though traditional storage was designed from first principles, is a mistake.

Volume management is a just a hack. We had all of these single-disk filesystems, but single disks were too small. So volume management was invented to present the illusion (in other words, lie) that they were still on single disks.

If you replace "disk" with "DIMM", it's immediately obvious that volume management is ridiculous. When you add a DIMM to a machine, it just works. There's no volume management for DIMMs.

lolc · on Jan 10, 2020

Indeed there is no volume management for RAM. You have to reboot to rebuild the memory layout! RAM is higher in the caching hierarchy and can be rebuilt at smaller cost. You can't resize RAM while keeping data because nobody bothered to introduce volume management for RAM.

Storage is at the bottom of the caching hierarchy where people get inventive to avoid rebuilding. Rebuilding would be really costly there. Hence we use volume management to spare us the cost of rebuilding.

RAM also tends to have uniform performance. Which is not true for disk storage. So while you don't usually want to control data placement in RAM, you very much want to control what data goes on what disk. So the analogy confuses concepts rather than illuminating commonalities.

dralley · on Jan 11, 2020

One of my old co-workers said that one of the most impressive things he's seen in his career was a traveling IBM tech demo in the back of a semi truck where they would physically remove memory, CPUs, and disks from the machine without impacting the live computation being executed apart from making it slower, and then adding those resources back to the machine and watching them get recognized and utilized again.

throw0101a · on Jan 10, 2020

> why would you want to embed raid5/6 in the filesystem layer?

One of the creators of ZFS, Jess Bonwick, explained it in 2007:

> While designing ZFS we observed that the standard layering of the storage stack induces a surprising amount of unnecessary complexity and duplicated logic. We found that by refactoring the problem a bit -- that is, changing where the boundaries are between layers -- we could make the whole thing much simpler.

* https://blogs.oracle.com/bonwick/rampant-layering-violation

pizza234 · on Jan 10, 2020

It's not about ZFS. It's about CoW filesystems in general; since they offer functionalities beyond the FS layer, they are both filesystems and logical volume managers.

ptman · on Jan 10, 2020

Why does ZFS do RAIDZ in the filesystem layer?

rleigh · on Jan 10, 2020

It doesn't.

RAIDZ is part of the VDEV (Virtual Device) layer. Layered on top of this is the ZIO (ZFS I/O layer). Together, these form the SPA (Storage Pool Allocator).

On top of this layer we have the ARC, L2ARC and ZIL. (Adaptive Replacement Caches and ZFS Intent Log).

Then on top of this layer we have the DMU (Data Management Unit), and then on top of that we have the DSL (Dataset and Snapshot Layer). Together, the SPA and DSL layers implement the Meta-Object Set layer, which in turn provides the Object Set layer. These implement the primitives for building a filesystem and the various file types it can store (directories, files, symlinks, devices etc.) along with the ZPL and ZAP layers (ZFS POSIX Layer and ZFS Attribute Processor), which hook into the VFS.

ZFS isn't just a filesystem. It contains as many, if not more, levels of layering than any RAID and volume management setup composed of separate parts like mdraid+LVM or similar, but much better integrated with each other.

It can also store stuff that isn't a filesystem. ZVOLs are fixed size storage presented as block devices. You could potentially write additional storage facilities yourself as extensions, e.g. an object storage layer.

cookiecaper · on Jan 10, 2020

Honestly just use ZFS. We've wasted enough effort over obscure licensing minutia.

admax88q · on Jan 10, 2020

> We've wasted enough effort over obscure licensing minutia.

Which was precisely Sun/Oracle's goal when they released ZFS under the purposefully GPL incompatible CDDL. Sun was hoping to make OpenSolaris the next Linux whilst ensuring that no code from OpenSolaris could be moved back to linux. I can't think of another plausible reason why they would write a new open source license for their open source operating system and making such a license incompatible with the GPL.

lizknope · on Jan 10, 2020

https://en.wikipedia.org/wiki/Common_Development_and_Distrib...

Some people argue that Sun (or the Sun engineer) as creator of the license made the CDDL intentionally GPL incompatible.[13] According to Danese Cooper one of the reasons for basing the CDDL on the Mozilla license was that the Mozilla license is GPL-incompatible. Cooper stated, at the 6th annual Debian conference, that the engineers who had written the Solaris kernel requested that the license of OpenSolaris be GPL-incompatible.[18]

    Mozilla was selected partially because it is GPL incompatible. That was part
    of the design when they released OpenSolaris. ... the engineers who wrote Solaris 
    ... had some biases about how it should be released, and you have to respect that.

throw0101a · on Jan 10, 2020

And the very next paragraph states:

> Simon Phipps (Sun's Chief Open Source Officer at the time), who had introduced Cooper as "the one who actually wrote the CDDL",[19] did not immediately comment, but later in the same video, he says, referring back to the license issue, "I actually disagree with Danese to some degree",[20] while describing the strong preference among the engineers who wrote the code for a BSD-like license, which was in conflict with Sun's preference for something copyleft, and that waiting for legal clearance to release some parts of the code under the then unreleased GNU GPL v3 would have taken several years, and would probably also have involved mass resignations from engineers (unhappy with either the delay, the GPL, or both—this is not clear from the video). Later, in September 2006, Phipps rejected Cooper's assertion in even stronger terms.[21]

So of the available licenses at the time, Engineering wanted BSD and Legal wanted GPLv3, so the compromise was CDDL.

spamizbad · on Jan 10, 2020

Wow... talk about cutting off your nose to spite your face. Oracle ended up abandoning OpenSolaris within a year or so.

Edit: Nevermind, debunked by Bryan Cantrill. It was to allow for proprietary drivers.

armitron · on Jan 10, 2020

Not at all really. Danese Cooper says that Cantrill is not a reliable witness and one can say he also has an agenda to distort the facts in this way [1].

[1] https://news.ycombinator.com/item?id=22008921

throw0101a · on Jan 10, 2020

And Cooper's boss:

> Simon Phipps (Sun's Chief Open Source Officer at the time), who had introduced Cooper as "the one who actually wrote the CDDL",[19] did not immediately comment, but later in the same video, he says, referring back to the license issue, "I actually disagree with Danese to some degree",[20] while describing the strong preference among the engineers who wrote the code for a BSD-like license, which was in conflict with Sun's preference for something copyleft, and that waiting for legal clearance to release some parts of the code under the then unreleased GNU GPL v3 would have taken several years, and would probably also have involved mass resignations from engineers (unhappy with either the delay, the GPL, or both—this is not clear from the video). Later, in September 2006, Phipps rejected Cooper's assertion in even stronger terms.[21]

* https://en.wikipedia.org/wiki/Common_Development_and_Distrib...

So of the available licenses at the time, Engineering wanted BSD and Legal wanted (to wait for) GPLv3, so the compromise was CDDL.

ilikejam · on Jan 10, 2020

There were genuine reasons for the CDDL - it wasn't an anti-gpl thing. https://www.youtube.com/watch?v=-zRN7XLCRhc&feature=youtu.be...

morning_gelato · on Jan 10, 2020

Danese Cooper, one of the people at Sun who helped create the CDDL, responded in the comment section of that very video:

Lovely except it really was decided to explicitly make OpenSolaris incompatible with GPL. That was one of the design points of the CDDL. I was in that room, Bryan and you were not, but I know its fun to re-write history to suit your current politics. I pleaded with Sun to use a BSD family license or the GPL itself and they would consider neither because that would have allowed D-Trace to end up in Linux. You can claim otherwise all you want...this was the truth in 2005.

notacoward · on Jan 10, 2020

This needs to be more widely known. Sun was never as open or innovative as its engineer/advertisers claim, and the revisionism is irksome. I saw what they had copied from earlier competitors like Apollo and then claimed as their own ideas. I saw the protocol fingerprinting their clients used to make non-Sun servers appear slower than they really were. They did some really good things, and they did some really awful things, but to hear proponents talk it was all sunshine and roses except for a few misguided execs. Nope. It was all up and down the organization.

toyg · on Jan 10, 2020

The thing is - it was a time of pirates. In an environment defined by the ruthlessness of characters like Gates, Jobs, and Ellison, they were among the best-behaved of the bunch. Hence the reputation for being nice: they were markedly nicer than the hive of scum and villainy that the sector was at the time. And they did some interesting things that arguably changed the landscape (Java etc), even if they failed to fully capitalize on them.

(In many ways, it still is a time of pirates, we just moved a bit higher in the stack...)

notacoward · on Jan 10, 2020

> In an environment ... they were among the best-behaved

I wouldn't say McNealy was that different than any of those, though others like Joy and Bechtolsheim had a more salutary influence. To the extent that there was any overall difference, it seemed small. Working on protocol interop with DEC products and Sun products was no different at all. Sun went less-commodity with SPARC and SBus, they got in bed with AT&T to make their version of UNIX seem more standard than competitors' even though it was more "unique" in many ways, there were the licensing games, etc. Better than Oracle, yeah, but I wouldn't go too much further than that.

wolfgke · on Jan 10, 2020

> Sun was never as open or innovative as its engineer/advertisers claim, and the revisionism is irksome.

For (the lack of) openness, I agree, but the claim that they were not innovative needs stronger evidence.

notacoward · on Jan 10, 2020

Just to be clear, I'm not saying they weren't innovative. I'm saying they weren't as innovative as they claim. Apollo, Masscomp, Pyramid, Sequent, Encore, Stellar, Ardent, Elxsi, Cydrome, and others were also innovating plenty during Sun's heyday, as were DEC and even HP. To hear ex-Sun engimarketers talk, you'd think they were the only ones. Reality is that they were in the mix. Their fleetingly greater success had more to do with making some smart (or lucky?) strategic choices than with any overall level of innovation or quality, and mistaking one for the other is a large part of why that success didn't last.

admax88q · on Jan 10, 2020

Java was pretty innovative. The worlds most advanced virtual machine, a JIT that often outperforms C in long running server scenarios, and the foundation of probably 95% of enterprise software.

notacoward · on Jan 10, 2020

ANDF had already done (or at least tried to do) the "write once, run anywhere" thing. The JVM followed in the footsteps of similar longstanding efforts at UCSD, IBM and elsewhere. There was some innovation, but "world's most advanced virtual machine" took thousands of people (many of them not at Sun) decades to achieve. Sun's contribution was primarily in popularizing these ideas. Technically, it was just one more step on an established path.

admax88q · on Jan 11, 2020

Sure plenty of the ideas in Java were invented before, standing on the shoulders of giants and all that. The JIT came from Self, the Object system from Smalltalk, but Java was the first implementation that put all those together into a coherent platform.

toyg · on Jan 10, 2020

Yeah, it's hard to understand this without context. Sun saw D-Trace and ZFS as the differentiators of Solaris from Linux, a massive competitive advantage that they simply could not (and would not) relinquish. Opensourcing was a tactical move, they were not going to give away their crown jewels with it.

The whole open-source steer by SUN was a very disingenous strategy, forced by the changed landscape in order to try and salvage some parvence of relevance. Most people saw right through it, which is why SUN ended up as it did shortly thereafter: broke, acquired, and dismantled.

throw0101a · on Jan 10, 2020

And Cooper's boss:

> Simon Phipps (Sun's Chief Open Source Officer at the time), who had introduced Cooper as "the one who actually wrote the CDDL",[19] did not immediately comment, but later in the same video, he says, referring back to the license issue, "I actually disagree with Danese to some degree",[20] while describing the strong preference among the engineers who wrote the code for a BSD-like license, which was in conflict with Sun's preference for something copyleft, and that waiting for legal clearance to release some parts of the code under the then unreleased GNU GPL v3 would have taken several years, and would probably also have involved mass resignations from engineers (unhappy with either the delay, the GPL, or both—this is not clear from the video). Later, in September 2006, Phipps rejected Cooper's assertion in even stronger terms.[21]

So of the available licenses at the time, Engineering wanted BSD and Legal wanted GPLv3, so the compromise was CDDL.

ilikejam · on Jan 10, 2020

I stand corrected!

chc · on Jan 10, 2020

I don't think something that is the subject of an ongoing multi-billion-dollar lawsuit can rightly be called "obscure licensing minutia." It is high-profile and its actual effects have proven pretty significant.

quantummkv · on Jan 10, 2020

> Honestly just use ZFS. We've wasted enough effort over obscure licensing minutia.

I am willing to bet that Google had the same thought. And I am also willing to bet that Google is regretting that thought now.

Dylan16807 · on Jan 10, 2020

It's not just licensing. ZFS has some deep-rooted flaws that can only be solved by block pointer rewrite, something that has an ETA of "maybe eventually".

macdice · on Jan 12, 2020

Care to elaborate?

Dylan16807 · on Jan 12, 2020

You can't make a copy-on-write copy of a file. You can't deduplicate existing files, or existing snapshots. You can't defragment. You can't remove devices from a pool.

That last one is likely to get some kind of hacky workaround. But nobody wants to do the invasive changes necessary for actual BPR to enable that entire list.

ColanR · on Jan 14, 2020

Wow. As a casual user - someone who at one point was trying to choose between RAID, LVM and ZFS for an old NAS - some of those limitations of ZFS seem pretty basic. I would have taken it for granted that I could remove a device from a pool or defragment.

quantummkv · on Jan 10, 2020

> There are no (stable) alternatives. BTRFS certainly not, as it's "under heavy development"¹ (since... forever).

Unless you are living in 2012 on a RHEL/CENTOS 6/7 machine, btrfs has been stable for way too long. I have been using btrfs as the sole filesystem on my laptop in standard mode, on my desktop as RAID0 and my NAS as RAID1 for more that two years. I have experienced absolutely zero data loss. Infact, btrfs recovered my laptop and desktop from broken package updates many times.

You might have had some issues when you tried btrfs on distros like RHEL that did not backport the patches to their stable versions because they don't support btrfs commercially. Try something like openSUSE that backports btrfs patches to stable versions or use something like arch.

> That's true, however, the amount is breakage is no different from any other out-of-tree module, and it unlikely to happen with a patch version of a working kernel (in fact, it happen with the 5.0 release).

This is a filesystem that we are talking. In no circumstances will any self respecting sysadmin use a file system that has even a small change of breaking with a system update.

throwaway-9320 · on Jan 10, 2020

I also used btrfs not too long ago in RAID1. I had a disk failure and voila, the array would be read-only from now on and I would have to recreate it from scratch and copy data over. I even utilized the different data recovery methods (at some point the array would not be mountable no matter what) and in the end that resulted in around 5% of the data being corrupt. I won't rule out my own stupidity in the recovery steps, but after this and the two other times when my RAID1 array went read-only _again_ I just can't trust btrfs for anything other than single device DUP mode operation.

Meanwhile ZFS has survived disk failures, removing 2 disks from an 8 disk RAIDZ3 array and then putting them back, random SATA interface connection issues that were resolved by reseating the HDD, and will probably survive anything else that I throw at it.

cerved · on Jan 10, 2020

I believe he's referring to the raid 5/6 issues

vetinari · on Jan 10, 2020

RAID 5/6 issue is the write hole, which is common to all software RAID 5/6 implementations. If it is a problem for you, use either BBU or UPS.

RAIZ/Z2 avoids the issue by having slightly different semantics. That's why it is Z/Z2, not 5/6.

organsnyder · on Jan 9, 2020

A former employer was threatened by Oracle because some downloads for the (only free for noncommercial use) VirtualBox Extension Pack came from an IP block owned by the organization. Home users are probably safe, but Oracle's harassment engine has incredible reach.

talideon · on Jan 10, 2020

My employer straight up banned the use of VirtualBox entirely _just in case_. They'd rather pay for VMWare Fusion licenses than deal with any potential crap from Oracle.

umanwizard · on Jan 10, 2020

Anecdotal, but VirtualBox has always been a bit flaky for me.

VMWare Fusion, on the other hand, powers the desktop environment I've used as a daily work machine for the last 6 months, and I've had absolutely zero problems other than trackpad scrolling getting emulated as mouse wheel events (making pixel-perfect scroll impossible).

Despite that one annoyance, it's definitely worth paying for if you're using it for any serious or professional purpose.

vetinari · on Jan 10, 2020

On the other hand, VMWare Fusion kernel extension is the only culprit, why I've seen kernel panic on Mac.

pizza234 · on Jan 10, 2020

This is throwing the baby along with the bathwater.

VirtualBox itself is GPL. There is no lawsuit risk.

What requires "commercial considerations" is the extension pack.

The extension pack is required for:

> USB 2.0 and USB 3.0 devices, VirtualBox RDP, disk encryption, NVMe and PXE boot for Intel cards

If licensing needs to be considered (ie. in a corporate environment), but one doesn't need the functionalities above, then there's no issue.

talideon · on Jan 14, 2020

> This is throwing the baby along with the bathwater.

It might be, but let's just say that Oracle aren't big fans of $WORK, and our founders are big fans of them. Thus our legal department are rather tetchy about anything that could give them even the slightest chance of doing anything.

> What requires "commercial considerations" is the extension pack.

And our legal department are nervous about that being installed, even by accident, so they prefer to minimise the possibility.

thu2111 · on Jan 9, 2020

Well ... that sounds initially unreasonable, but then if I think about it a bit more I'm not sure how you'd actually enforce a non-commercial use only license without some basic heuristic like "companies are commercial".

Is the expectation here that firms offering software under non-commercial-use-is-free licenses just run it entirely on the honour system? And isn't it true that many firms use unlicensed software, hence the need for audits?

smileybarry · on Jan 9, 2020

IIRC VirtualBox offers to download the Extension Pack without stating it's not free for commercial use. There isn't even a link to the EULA in the download dialog as far as I can tell (from Google Images, at least). Conversely, VirtualBox itself is free for commercial use. Feels more like a honeypot than license auditing.

They can also apply stronger heuristics, like popping up a dialogue box if the computer is centrally-managed (e.g.: Mac MDM, Windows domain, Windows Pro/Enterprise, etc.).

Polylactic_acid · on Jan 10, 2020

Wait is this the pack that gets screen resizing and copy/paste working?

waste_monk · on Jan 10, 2020

You're thinking of the Guest Additions which is part of the base Virtualbox package and free for commercial use.

The (commercially licensed) Extensions pack provide "Support for USB 2.0 and USB 3.0 devices, VirtualBox RDP, disk encryption, NVMe and PXE boot for Intel cards"[1] and some other functionality e.g. webcam passthrough [2]. There may be additional functionality enabled by the Extension pack I cannot find at a glance, but those are the main things.

[1] https://www.virtualbox.org/wiki/Downloads [2] https://www.virtualbox.org/manual/ch01.html#intro-installing

DoofusOfDeath · on Jan 10, 2020

A tad offtopic, but on my 2017 Macbook Pro the "pack" was called VMWare Fusion.

With my MBP as host and Ubuntu as guest, I found that VirtualBox (with and without guest extensions installed) had a lot of graphical performance issues that Fusion did not.

hackmiester · on Jan 9, 2020

They harass universities about it too. Which is ludicrous, because universities often have residence halls, and people who live there often download VirtualBox extensions.

malty7301 · on Jan 9, 2020

Their PUEL license even has a grant specifically for educational use.

secabeen · on Jan 10, 2020

It does, but it's not 100% clear if administrative employees of universities count as educational. Sure, if you are teaching a class with it, go for it; but running a VM in it for the university accounting office is not as clear.

HorstG · on Jan 10, 2020

Education might not be the same as research in this license's terms. And there are even software vendors picking nits about writing a thesis being either research or education, depending on their mood and purse fill level...

nottorp · on Jan 9, 2020

> There is no conceivable reason that Oracle would want to threaten me with a lawsuit.

I don't think it has to be conceivable with Oracle...

Unfortunately I have to agree with Linus on this one. Messing with Oracle's stuff is dangerous if you can't afford a comparable legal team.

markvdb · on Jan 10, 2020

"Oracle's stuff" can most often be described more accurately as "what Oracle considers its stuff".

boomboomsubban · on Jan 10, 2020

Linus is distributing the kernel, a very different beast from using a kernel module. I can't imagine Oracle targeting someone for using ZFS on Linux without first establishing that the distribution of ZFS on Linux is illegal.

bigiain · on Jan 10, 2020

> there is no conceivable reason that Oracle would want to threaten me with a lawsuit.

Money. Anecdotally that's the primary reason Oracle do anything.

eitland · on Jan 10, 2020

If anyone thinks this is hyperbole :

I worked for a tiny startup (>2 devs full time) where Oracle tried to extract money from us because we used MariaDB on AWS.

If you think this sounds ridiculous you probably got it right.

(Why? Because someone inexperienced with Oracle had filled out the form while downloading the mySQL client.)

eitland · on Jan 10, 2020

Re-reading my comment in daylight I realize I got one detail almost exactly wrong: we were always <= 2 developers, but it seems everyone understood the point anyway - we were tiny, but not too tiny for Oracles licensing department.

llcoolv · on Jan 10, 2020

Well... Serves you about right for choosing MySQL over PostgreSQL :)

eitland · on Jan 10, 2020

In my defense it wasn't my choice ;-)

barkingcat · on Jan 10, 2020

"there is no conceivable reason that Oracle would want to threaten me with a lawsuit."

Don't be so sure about this.

shawnz · on Jan 9, 2020

None of these are good reasons to purposely hinder the optional use of ZFS as a third party module by users, which is what Linux is doing.

spinningslate · on Jan 9, 2020

Can you expand? I'm no expert - use linux daily but have always just used distro default file system. Linus' reasons for not integrating seems pretty sensible to me. Oracle certainly has form on the litigation front.

shawnz · on Jan 9, 2020

Linus' reasons for not integrating ZFS are absolutely valid and it's no doubt that ZFS can never be included in the mainline kernel. There's absolutely no debate there.

However the person he is replying to was not actually asking to have ZFS included in the mainline kernel. As noted above, that could never happen, and I believe that Linus is only bringing it up to deflect from the real issue. What they were actually asking is for Linux to revert a change that was made for no other reason than to hinder the use of ZFS.

Linux includes a system which restricts what APIs are available to each module based on the license of the module. GPL modules get the full set of APIs whereas non-GPL modules get a reduced set. This is done strictly for political reasons and has no known legal basis as far as I'm aware.

Not too long ago a change was made to reduce the visibility of a certain API required by ZFS so only GPL modules could use it. It's not clear why the change was made, but it was certainly not to improve the functionality of the kernel in any way. So the only plausible explanation to me is that it was done just to hinder the use of ZFS with Linux, which has been a hot political issue for some time now.

throwaway6779 · on Jan 9, 2020

If I remember correctly, the reasoning for the GPL module stuff was/is, that if kernel modules integrate deeply with the kernel, they fall under gpl. So the GPL flag is basically a guideline of what kernel developers believe is safe to use from non gpl-compatible modules

Dylan16807 · on Jan 10, 2020

But from what I can see, marking the "save SIMD registers" function as GPL is a blatant lie by a kernel developer that wanted to spite certain modules.

Saving and restoring registers is an astoundingly generic function. If you list all the kernel exports and sort by how much they make your work derivative, it should be near the very bottom.

vetinari · on Jan 10, 2020

You are not supposed to use FP/SSE in kernel mode.

It was always frowned upon:

> In other words: it's still very much a special case, and if the question was "can I just use FP in the kernel" then the answer is still a resounding NO, since other architectures may not support it AT ALL.

> Linus Torvalds, 2003

and these specific functions, that were marked as GPL were already deprecated for well over a decade.

Dylan16807 · on Jan 10, 2020

> You are not supposed to use FP/SSE in kernel mode.

> It was always frowned upon

Whether it's frowned upon is a completely different issue from whether it intertwines your data so deeply with the kernel that it makes your code a derivative work subject to the GPL license. Which it doesn't.

> if the question was "can I just use FP in the kernel" then the answer is still a resounding NO, since other architectures may not support it AT ALL.

It's not actually using floating point, it's using faster instructions for integer math, and it has a perfectly viable fallback for architectures that don't have those instructions. But why use the slower version when there's no real reason to?

> and these specific functions, that were marked as GPL were already deprecated for well over a decade.

But the GPL export is still there, isn't it? It's not that functionality is being removed, it's that functionality is being shifted to only have a GPL export with no license-based justification for doing so.

shawnz · on Jan 10, 2020

So what meets the criteria of being a "special case" and what doesn't? One of the examples that Linus gives is RAID checksumming. How come RAID checksumming is a special case but ZFS checksumming isn't? I don't think it has anything to do with the nature of the usage, the only problem is that the user is ZFS.

vetinari · on Jan 10, 2020

RAID checksuming is in the kernel, and when Linus says jump, the RAID folks ask back how high.

He is not going to beg people outside kernel, whether he is allowed to change something that may break their module. On the contrary, they must live with any breackage that is thrown at them.

Again, that symbol was deprecated for well over a decade. How long does it take to be allowed to remove it?

shawnz · on Jan 10, 2020

Sometimes in life we do things even though we are not explicitly obligated to do them. Nobody is asking for ZFS to get explicitly maintained support in the Linux kernel. They are simply asking for this one small inconsequential change to be reverted just this one time, since it would literally be no harm to the kernel developers to do so, and it would provide substantial benefits to any user wanting to use ZFS. Furthermore the amount of time that kernel developers have spent arguing in favour of this change has been significantly greater than the time it would have taken to just revert it.

> Again, that symbol was deprecated for well over a decade.

But not the GPL equivalent of the symbol. That symbol is not deprecated.

shawnz · on Jan 9, 2020

This is the commonly recited argument but I don't believe it was ever proven to be legally necessary. Furthermore, even if it was, it's not clear what level of integration is "too deep". So in practice, it's just a way for kernel developers to add political restrictions as they see fit.

bcrosby95 · on Jan 9, 2020

Proven legally necessary, as in, a court telling them to stop doing something? I'm pretty sure they don't want it to get to that point.

shawnz · on Jan 10, 2020

Proven legally necessary, as in, a court ever telling anyone in that situation to stop doing it. Or even to start doing it in the first place. There's just no legal justification behind it whatsoever.

HorstG · on Jan 10, 2020

"Proven" is a maybe impossible standard: Kernel devs hint at the GPLonly exports having been useful in certain cases they prefer not to discuss on a ML. https://lore.kernel.org/lkml/20190110131132.GC20217@kroah.co...

One can interpret this as something legally significant, or an embarrassing private anecdote, or nothing substantial at all, maybe even just talk. However, I'd give them the benefit of the doubt. Not the least since they could be the ones against Oracle's legal dept...

shawnz · on Jan 10, 2020

What he is referring to is the use of the GPL export restriction to strong-arm companies into releasing their code as GPL. It's nothing to do with a legal requirement, he is just an open source licensing hardhead. See: https://lwn.net/Articles/603145/

teddyuk · on Jan 10, 2020

Surely the kernel developers can do whatever the hell they like.

If you don’t like that don’t use it.

supercanuck · on Jan 9, 2020

>This is done strictly for political reasons and has no known legal basis as far as I'm aware.

let me stop you right there. This being "Oracle," and its litigious nature, how can you truly be aware or sure?

Linus is literally saying there is a legal basis.

shawnz · on Jan 9, 2020

> This being "Oracle," and its litigious nature, how can you truly be aware or sure?

The functionality I'm describing has absolutely nothing to do with ZFS or Oracle in any way. If you really think the reach of Oracle is so great, then why not block all Oracle code from ever running on the OS? That seems to me to be just as justified as this change.

yjftsjthsd-h · on Jan 10, 2020

> why not block all Oracle code from ever running on the OS?

...to be fair, I would probably run that module.

Nasrudith · on Jan 10, 2020

I think that it would be a mandated module by many companies.

supercanuck · on Jan 9, 2020

Oracle sued Google for copying the same names of the functions.

crashbunny · on Jan 10, 2020

And I believe oracle copied the amazon s3 api.

I can't make a informed opinion but my uninformed gut feeling is oracle have done what they are suing google for having done.

tylerl · on Jan 10, 2020

This want a case of "purposely hinder", but rather the zfs nodule broke because of some kernel changes. The kernel is careful to never break userspace and never break its own merged modules. But if you're a third-party module then you're on your own. The kernel developers can't be responsible for maintaining compatibility with your stuff.

shawnz · on Jan 10, 2020

The changes conveniently accomplished nothing except for breaking ZFS. Furthermore, just because they don't officially support ZFS doesn't mean they must stonewall all the users who desire the improved compatibility. Reverting this small change would not be a declaration that ZFS is officially supported.

stock_toaster · on Jan 9, 2020

> - Performance is not that great compared to the alternatives.

CoW filesystems do trade performance for data safety. Or did you mean there are other _stable/production_ CoW filesystems with better performance? If so, please do point them out!

m4rtink · on Jan 9, 2020

XFS on LVM thin pool LV. Stable and performant as far as I can tell.

zlynx · on Jan 10, 2020

My terrible experiences with thin pools makes me see btrfs as a pool of wonderful, trouble-free and perfect code.

Just ask yourself what happens when a thin pool runs out of actual, physical disk blocks?

m4rtink · on Jan 10, 2020

Isn't this a problem for any over provisioned storage pool ? You can avoid that if you want by not over provisioning & checking space consumed by CoW snapshots. Also what does ZFS do if you run out of blocks ?

I have actually managed to run out of blocks on XFS on thin LV and it's an interesting experience. XFS always survoved just fine, but some files basically vanished. Looks like mostly those that were open and being written to at exhaustion time, like for example a mariadb database backing store. Files that were just sitting there were perfectly fine as far as I could tell.

Still, you definitely should never put data on a volume where a pool can be exhausted, without a backup as I don't think there is really a bulletproof way for a filesystem to handle that happening suddenly.

awused · on Jan 10, 2020

>Isn't this a problem for any over provisioned storage pool ?

ZFS doesn't over-provision anything by default. The only case I'm aware of where you can over-provision with ZFS is when you explicitly choose to thin provision zvols (virtual block devices with a fixed size). This can't be done with regular file systems which grow as needed, though you can reserve space for them.

File systems do handle running out of space (for a loose definition of handle) but they never expect the underlying block device to run out of space, which is what happens with over-provisioning. That's a problem common to any volume manager that allows you to over provision.

m4rtink · on Jan 10, 2020

Can't you over provision even just by creating too many many snapshots ? Even if you never make the filesystems bigger then the backing pool, the snapshots will allocate some blocks from the pool and over time, boom.

awused · on Jan 10, 2020

Snapshots can't cause over-provisioning, not for file systems. If I mutate my data and keep snapshots forever, eventually my pool will run out of free space. But that's not a problem of over-provisioning, that's just running out of space.

With ZFS, if I take a snapshot and then delete 10GB of data my file system will appear to have shrunk by 10GB. If I compare the output of df before and after deleting the data, df will tell me that "size" and "used" have decreased by 10GB while "available" remained constant. Once the snapshot is deleted that 10GB will be made available again and the "size" and "available" columns in df will increase. It avoids over-provisioning by never promising more available space than it can guarantee you're able to write.

I think you're trying to relate ZFS too much to how LVM works, where LVM is just a volume manager that exposes virtual devices. The analogue to thin provisioned LVM volumes is thin-provisioned zvols, not regular ZFS file systems. I can choose to use ZFS in place of LVM as a volume manager with XFS as my file system. Over-provisioned zvols+XFS will have functionally equivalent problems as over-provisioned LVM+XFS.

rleigh · on Jan 10, 2020

ZFS doesn't work this way. The free blocks in the ZFS pool are available to all datasets (filesystems). The datasets themselves don't take up any space up front until you add data to them. Snapshots don't take up any space initially. They only take up space when the original dataset is modified, and altered blocks are moved onto a "deadlist". Since the modification allocates new blocks, if the pool runs out of space it will simply return ENOSPC at some point. There's no possibility of over-provisioning.

ZFS has quotas and reservations. The former is a maximum allocation for a dataset. The latter is a minimum guaranteed allocation. Neither actually allocate blocks from the pool. These don't relate in any comparable way to how LVM works. They are just numbers to check when allocating blocks.

tomatocracy · on Jan 10, 2020

LVM thin pools had (maybe still have - I haven't used them recently) another issue though, where running out of metadata space caused the volumes in the thinpool to become corrupt and unreadable.

HorstG · on Jan 10, 2020

ZFS does overprovision all filesystems in a zpool by default. Create 10 new filesystems and 'df' will now display 10x the space of the parent fs. A full fs is handled differently than your volume manager running out of blocks. But the normal case is overprovisioning.

mnw21cam · on Jan 10, 2020

That's not really overprovisioning. That's just a factor of the space belonging to a zpool, but 'df' not really having a sensible way of representing that.

awused · on Jan 10, 2020

That is not over-provisioning, it's just that 'df' doesn't have the concept of pooled storage. With pools it's possible for different file systems to share their "available" space. BTRFS also has its own problems with ouput when using df and getting strange results.

If I have a 10GB pool and I create 10 empty file systems, the sizes reported in df will be 100GB. It's not quite a lie either, because each of those 10 file systems does in fact have 10GB of space available I could write 10GB to any one of them. If I write 1GB to one of those file systems, the "size" and "available" spaces for the other nine will all shrink despite not having a single byte of data written to them.

With ZFS and df the "size" column is really only measuring the maximum possible size (at this point in time, assuming nothing else is written) so it isn't very meaningful, but the "used" and "available" columns do measure something useful.

HorstG · on Jan 10, 2020

This is exactly what overprovisioning is: The sum of possible future allocations is greater than available space.

awused · on Jan 10, 2020

In my example the sum of possible future allocations for ZFS is still only 10GB total. Each of the ten file systems, considered individually, does truthfully have 10GB available to it before any data is written. The difference is that with over-provisioning (like LVM+XFS), if I write 10GB of data to one file system the others will still report 10GB of free space, but with ZFS or BTRFS they'll report 0GB available, so I can never actually attempt to allocate 100GB of data.

You could build a pool-aware version of DF that reflects this, by grouping file systems in a pool together and reporting that the pool has 10GB available. But frankly there's not enough benefit to doing that because people with storage pools already understand summing up all the available space from df's output is not meaningful. Tools like zpool list and BTRFS's df equivalent already correctly report the total free space in the pool.

Annatar · on Jan 10, 2020

XFS is not copy on write.

vetinari · on Jan 10, 2020

XFS has supported reflinks for some time already, just the deduplication is kind of experimental.

Supporting reflinks is actually more, than can be said about ZoL (see zfsonlinux#405).

zepearl · on Jan 10, 2020

I think that you're both right - under normal conditions XFS not CoW, but when using the reflink option it does use CoW => kind of a mix.

gsich · on Jan 10, 2020

>- Using it opens you up to the threat of lawsuits from Oracle. Given history, this is a real threat. (This is one that should be high for Linus but not for me - there is no conceivable reason that Oracle would want to threaten me with a lawsuit.)

No. Distributing (ie. precompiled distro with ZFS) will. You are free to run any software on your machine as you so desire.

kqr · on Jan 10, 2020

This reminds me of the adaptation of a Churchill quote that "ZFS is the worst of the file systems, except for all others."

lmilcin · on Jan 9, 2020

The problem with ZFS is that it isn't part of Linux kernel.

Linux project maintains compatibility with userspace software but it does not maintain compatibility with 3rd party modules and for a good reason.

Since modules have access to any internal kernel API it is not possible to change anything within kernel without considering 3rd party code, if you want to keep that code working.

For this reason the decision was made that if you want your module to work you need to make it part of Linux kernel and then if anybody refactors anything they need to consider modules they would be affecting by the change.

Not allowing the module to be part of the kernel is a disservice to your user base. While there are modules like that that are maintained moderately successfully (Nvidia, vmware, etc.) this is all at the cost of the user and userspace maintainers who have to deal with it.

lmm · on Jan 10, 2020

It isn't just ZFS. All sorts of drivers get broken because Linux refuses to offer a stable API, saying your code should be in the kernel, but also often refuses to accept drivers into the kernel, even open-source code with no particular quality issues (e.g. quickcam, reiserfsv4).

Use FreeBSD where there's a stable ABI and you don't have these problems.

zaarn · on Jan 10, 2020

Plenty of drivers get rejected because the kernel developers have no confidence that they will be maintained going forward, which would mean the driver would be removed fairly quickly again.

the_why_of_y · on Jan 10, 2020

FreeBSD does not really have a stable ABI; every major release breaks the ABI, so it's only stable for 2 years.

https://wiki.freebsd.org/VendorInformation

rleigh · on Jan 10, 2020

Stable for each major and minor release is still a vast step up on Linux.

Having a stable ABI for two years is vastly easier to support than an ABI which changes every two weeks. This is reflected by the number of binary modules which are packaged for FreeBSD in the ports tree, and provided by third-party vendors. This stability makes it possible to properly support for a reasonable timeframe, and vendors are doing so.

lmilcin · on Jan 10, 2020

Honestly, I don't like binary modules and I am happy with policy that let's me have functional operating system with modern hardware with source code that I have access to (well... except the firmware that even Linux can't do anything about until open-source hardware projects get more traction).

It is enough that almost all devices around me have a bunch of running code that I have absolutely no control over. I need at least one computer I can trust to do MY bidding.

rleigh · on Jan 11, 2020

The problem I have with this is that Linux shoots itself in the foot here. It's conflating two different problems: (1) supporting third-party modules and (2) supporting proprietary modules. All modules are ultimately binary; only a small subset are both proprietary and binary-only.

If you look at FreeBSD, the majority of third-party modules are free software. It's stuff like graphics drivers, newer ZFS modules, esoteric HBAs etc. Proprietary modules, like nVidia's graphics driver, are the minority.

I can see and understand why things are the way they are, and indeed I agreed with the approach for many years. Today, I see it being as short sighted as the GCC vs LLVM approach to modular architecture.

Linux is nearly 30 years old now. To not have stable internal interfaces seems to me to be indicative of either bad initial design or ill discipline on the part of its maintainers. Every other major kernel seems to manage to have a stable ABI for third-party functionality, and Linux is an outlier in its approach. Having to upgrade the kernel for a new GPU driver is painful. Not only do I have to wait for a new kernel release, I have to hope that none of the other changes in that release cause breakage or change the behaviour in unexpected ways. Upgrading a third-party module is much less risky.

lmilcin · on Jan 11, 2020

I don't see how it shoots itself in the foot given that these rules were basically since forever and it is currently most popular open source operating system by a huge margin.

lmm · on Jan 12, 2020

Well, I left Linux in part because a lot of my hardware stopped working - FreeBSD probably has a fraction of the developers that Linux does, yet I actually have more faith in its hardware support because of this issue. YMMV I guess.

unethical_ban · on Jan 9, 2020

Parent updated their post and my comment is no longer relevant.

TallGuyShort · on Jan 9, 2020

I don't see how it's an insult to the users. It's saying that not allowing ZFS code to be distributed under the GPL and be maintained as part of the Linux kernel, is a disservice to ZFSonLinux users. Which I think is clearly right.

lmilcin · on Jan 9, 2020

I edited it out before I saw your comment.

ghaff · on Jan 9, 2020

And he was doing fine up to that point. For IMO good reasons, ZFS will likely never be merged into Linux. And filesystem kernel modules from third parties have a pretty long history of breakage issues going back to some older Unixes.

That's going to be plenty of reason not to use ZFS for most people. The licensing by itself is also certainly a showstopper for many.

But I'm not sure his other comments are really fair and, had Oracle relicensed ZFS n years back, ZFS would almost certainly be shipping with Linux, whether or not as the typical default I can't say. It certainly wasn't just a buzzword and there were a number of interesting aspects to its approach.

bscphil · on Jan 9, 2020

Well, he says

> It was always more of a buzzword than anything else, I feel, and the licensing issues just make it a non-starter for me.

So presumably the licensing problem mentioned by your parent's comment is weighing heavily here. I think this "don't use ZFS" statement is most accurately targeted at distro maintainers. Anyone not actually redistributing Linux and ZFS in a way that would (maybe) violate the GPL is not at any risk. That means even large enterprises can get away with using ZoL.

ploxiln · on Jan 9, 2020

It's exactly that, when combined with the longstanding practice of maintaining compatibility with userspace, but reserving the right to refactor kernel-space code whenever and wherever needed. If ZFS-on-linux breaks in a subtle or obvious way due to a change in linux, he can't afford to care about that - keeping the linux kernel codebase sane while adding new features, supported hardware, optimizations, and fixes at an honestly scary rate, is not that easy.

See also https://www.kernel.org/doc/html/latest/process/stable-api-no...

(fuse is a stable user-space API if you want one ... it won't have the same performance and capabilities of course ...)

dathinab · on Jan 10, 2020

> he can't afford to care about that - keeping the linux kernel codebase sane while adding new features, supported hardware, optimizations, and fixes at an honestly scary rate, is not that easy.

Maybe, but the complains seem to be more related to the (problematic) changes not being of technical nature accidentally braking ZFS, but being more of political nature. With speculation that it might have been meant to _intentionally_ brake ZFS and then pretend this was a accident because ZFS isn't (and can never) be maintained in tree. Basically on the line of "we don't like out of tree kernel modules so we make the live hard for them". No idea if this is actually the case or people just spin thinks together. Even if it is the case I'm not sure what I should think about, because it's at least partially somewhat understandably.

ploxiln · on Jan 10, 2020

Linus is rather tolerant (or apathetic) about non-GPL modules, but what he doesn't care to do is ensure that there is an appropriate set of non-GPL-marked exports available for external modules. If some other developer happens to mark some export GPL and it happens to be one key export needed by a non-GPL external module, Linus doesn't care, because he doesn't care about external modules.

This has come up many times in the past. Keep in mind that linux has always been GPLv2-only, it is not LGPL or anything like that.

https://lwn.net/Articles/769471/

https://lwn.net/Articles/603131/

https://lkml.org/lkml/2012/2/7/451

m463 · on Jan 9, 2020

"Don't use ZFS. It's that simple. It was always more of a buzzword than anything else, I feel, and the licensing issues just make it a non-starter for me."

When he says that, I think on the $500 million Sun spent on advertising java.

anon73044 · on Jan 9, 2020

Sun isn't going to sue anyone into oblivion any time soon, but Oracle sure will

ailideex · on Jan 10, 2020

Sun is all but defunct, I don't think I would characterize it as a subsidiary of Oracle.

TheFiend7 · on Jan 9, 2020

That's kinda non-sensical IMO. If Oracle, the parent company is trigger happy, there are no guarantees they won't go deeper to protect their children companies IP if they feel they're being infringed.

m463 · on Jan 9, 2020

I was thinking more of "the buzzword" bit, and how it got to be such a well known technology.

djsumdog · on Jan 9, 2020

Well he had this:

> as far as I can tell, it has no real maintenance behind it either any more

Which simply isn't true. They just released a new ZFS version with encryption built in (no more ZFS + LUKS) and they removed the SPL dependency (which didn't support Linux 5.0+ anyway).

I use ZFS on my Linux machines for my storage and I've been rather happy with it.

Covzire · on Jan 9, 2020

Same, for at least 6 years in a 4 drive zraid array. It always reads and writes at full gigabit ethernet speeds and I haven't had any downtime other than maintaining FreeBSD updates which are trivial even when going from 10.x to 11 to 12.

glenneroo · on Jan 10, 2020

"Same" for the last ~4 years, starting with 8 disks and as of 2018, the 24-bay enclosure is full. Each vdev is a mirrored pair split across HBAs to sedate my paranoia. I've replaced a few drives after watching unreadable sector count slowly increase over a few months. I've also switched out most of the original 3TB pairs to 8TB and 10TB pairs. ~42TB usable and the box only has 16GB of RAM (because I can't get the used 32GB sticks to work, it's a picky mainboard and difficult to find matching ECC memory here in Europe). I haven't powered down much except to attempt to replace the RAM or during extremely hot days. Read/write speed is more or less max gigabit, even during rebuild after hot-swapping drives.

stavros · on Jan 9, 2020

Same here (4-drive raidz for many years), though I do have an issue where deleting large files (~1 GB) takes around a minute and nobody seems to know why (I have plenty free space and RAM)...

dekhn · on Jan 9, 2020

do you have lots of snapshots? every snapshotting FS I've worked with has really slow deletes, especially when the volume is near capacity.

rleigh · on Jan 10, 2020

Snapshots are one thing ZFS is fast at. All the blocks for a given snapshot are placed on a "deadlist". Snapshot deletion is essentially just returning this list of blocks back to the free pool. A terabyte snapshot will take a short while (in the background) to recycle those blocks. But the deletion itself is near instantaneous.

dekhn · on Jan 11, 2020

I think you misunderstand: file deletions are what is slow (I don't use ZFS, my reference is WAFL, but my understanding is that all snapshotting file systems have this problem).

rleigh · on Jan 11, 2020

Even this should have minimal overhead. If the file is present in the snapshot, then it's simply moving the blocks over to the deadlist which is a very cheap operation. If it's not in the snapshot then the blocks will get recycled in the background. In both cases you should have the unlink complete almost immediately.

All of the snapshot functionality is based upon simple transaction number comparisons plus the deadlist of blocks owned by the snapshot. Only the recycling of blocks should have a bit of overhead, and that's done by a background worker--you see the free space increase for a few minutes after a gargantuan snapshot or dataset deletion, but the actual deletion completed immediately.