Hacker News new | past | comments | ask | show | jobs | submit login

How real is the risk?

ZFS on Linux is loaded via DKMS. DKMS is basically the compile-from-source alternative to binary blob drivers. Is there a reason to believe that ZFS via DKMS is materially different, legally, to Nvidia drivers being loaded via binary modules?

Indeed it appears that Nvidia proprietary drivers these days go via the DKMS route. Are they also at risk?




The risk is in trying to keep compatible with the Linux Kernel, when not part of the Linux Kernel dev team.

If ZFS was a module within the Linux kernel source tree, Linux kernel devs would keep it in mind when making changes to the Linux kernel's APIs. They wouldn't make potentially breaking changes to that API, or replace APIs with ones that can't be made compatible with ZFS at all, without considering the effect it would have on ZFS. The Kernel devs have to take the whole tree into account, including all the modules.

But since ZoL is not part of the kernel, and is instead a separate independent module that gets loaded into the kernel, the ZoL have to play catch up. Instead of Kernel devs working with them to make sure things stay compatible, or giving them a heads up when something needs to be changed for a new version, the ZoL team is on the outside, trying to make sure they don't get boxed out of the Kernel API.

It can be a tenuous position to be in. For a more everyday example, look at the Firefox and Chromium browser updates. For Firefox Quantum they completely revamped their extensions system, making all existing extensions worthless. Either the extension devs rewrote them with the new API, or the extension is just gone. Similarly Chromium is changing its API for how extensions can interact with the DOM and networking, which would make Ublock unusable. If these features were officially part of the browser or official extensions, the browser devs would work with the extension devs to make sure things stayed working; instead the extension developers need to keep up or watch their product die.

If the Linux Kernel completely changed its module API, and ZoL was caught flat footed and needed to make substantial rewrites, what would the projects backers do? Would they commit the dev time and resources needed to make those changes, or would they compare the benefits of ZFS vs the competition -- solutions based on btrfs or LVM, both built into the kernel -- and decide to jump ship before committing more time? That's where the danger really lies. If the Linux Kernel makes a serious breaking change, ZoL's supporters might double down to fix it, or they might just cut their losses.

And while there is an official way to integrate external filesystems into the Linux kernel, FUSE, it's extremely slow compared to Kernel native modules. There was a userland FUSE implementation of ZFS, and there's a reason it's fallen out of favor. ZoL is significantly faster because of its direct kernel extension, but it's playing with fire by not being part of the kernel tree itself.


>ZoL was caught flat footed

Which would essentially be impossible.

OpenZFS is designed in a way that uses a shim layer between ZFS's internal Solarisish API usage and the native OS's usage This allows the same ZFS code to run essentially unmodified on many *nixes without major changes.

In ZoL, this is called SPL (Solaris Porting Layer), and is one of the two kernel modules required to use ZoL. Linux does not export the proper APIs that ZFS requires to work here, and SPL fills that gap.

ZFS modules do "break" often due to internal API changes, but the fix is usually shipped in ZFS stable before the kernel itself is shipped. Only people who follow ZoL development closely ever see the sausage being made.

It is highly unlikely that Linux can ever do anything that ends up with a situation that the SPL and ZFS kernel modules can't easily #ifdef their way out of. It wouldn't make sense for Linux to, either.


> ZFS modules do "break" often due to internal API changes, but the fix is usually shipped in ZFS stable before the kernel itself is shipped. Only people who follow ZoL development closely ever see the sausage being made.

Eh, it depends on the distribution. I use Fedora, and there have been quite a few times when a kernel update has resulted in ZoL becoming useless for a while until they push an updated version. The solution, of course, is just booting the earlier kernel until ZoL catches up.

In fact, this is happening right now. The 4.20 kernel was pushed to updates on Fedora (at least my system) on 1/23. The latest stable ZoL doesn't work with it. They're on RC3 for the next release, though, and that does. Hopefully it comes out soon.

In the mean time, I'm running the last 4.19 kernel. shrug


> I use Fedora, and there have been quite a few times when a kernel update has resulted in ZoL becoming useless for a while until they push an updated version.

Alternatively, you can use the variation of a distro, if it exists, that provides a more stable kernel and package version set. In this case, that would be RHEL/CentOS.

For a workstation, that can be annoying, but as you said, just use an older kernel for a bit longer. Perhaps mark it as to be skipping in normal package operations, and have a cron job that runs to check the kernel specifically and emails you if there are updated versions of the kernel that exist.

For a server, I imagine it's rarely a problem since those should be running more stable distros anyway, since 99% of the time the kernel is older and back-patched (it may be the current kernel, but still weeks after it was released right at point release updates), which should result in a stable API for ZoL.

The trade-offs there don't seem too onerous to me. A little hand configuration for a workstation (of which there's likely one or two for a person to deal with as long as there's smooth server support which is generally both more important to have solid because it can be harder to fix if there's a problem and because it can scale from none to many per person.


Sure, it doesn't bother me a whole bunch, I just thought I'd point out that, at least with Fedora, ZoL is occasionally behind tracking the most recent kernel release.


How does RHEL help there? They ship kernel updates. Those kernel updates once broke HP b120i properietary driver (HP releases new version of this driver for every minor RHEL release). I don't see how fakeraid driver is fundamentally different from ZFS.


Kernel updates are back patched. That means that in-between point release updates (which generally happen every 6-12 months) the kernel version stays the same, and any bug-fixes are ported into the older kernel that is shipped. Point releases may update the kernel version (I think?), but generally keep it the same as well, but they will back port some features into the older kernel as well, not just bug/security fixes. You can see here[1] for RHEL versions and the kernels they ship with.

If a security fix breaks your ZoL integration, my guess is you're actually better off waiting for that to play out and resolve itself than to expect it to work. If a feature back port breaks it, that might be a little more annoying, but I imagine it will be fixed in short order, and you only have to worry about that once every 6-12 months (and it's well publicized).

> Those kernel updates once broke HP b120i properietary driver (HP releases new version of this driver for every minor RHEL release).

If HP is releasing closed source drivers for RHEL, I imagine they would want to be on the certified hardware list and test, or at a minimum seek access to the beta spins of the point releases (which I think is where it broke) so they can test before it comes out. I'm not sure I blame Redhat for HP trying to specifically support RHEL and failing to do so, given the systems I know they have in place to help companies in just that situation (because it helps RHEL users).

In any case, all I'm really noting is that between Fedora, which ships a new kernel version every kernel update (AFAIK) and RHEL/CentOS, which ship larges the same kernel with only the specific changes needed the majority of the time, keeping ZoL working should be vastly easier on RHEL systems (and in fact, any OS which does back patching of kernels, which I believe includes SuSE and the LTS releases of Ubuntu).

1: https://en.wikipedia.org/wiki/Red_Hat_Enterprise_Linux#RHEL_...


It's not really the same kernel. They backport a lot of features with every minor release. They still call it 2.6.x or whatever, but it really is different. I know that RHEL has some subset of internal kernel API that they promised to keep stable within major release, so if HP failed to rely on those API, it's their problem, but it might happen.


On this tangent, other out of tree patchsets like OpenVZ have had similar issues where the kernel has massively changed between versions, and forward porting their changes is challenging at best, even with a massive userbase.


That allows ZoL to run on many different * nixes, which means that if Linux made a drastically breaking change, you could use it on another OS, sure.

And the Linux kernel has made numerous breaking changes to their APIs that ZoL has been able to work around. So it has happened, they've just been able to deal with it.

Despite the belief that a breaking change that ZoL can't work around being improbable, it is still possible. The Linux Kernel could majorly overhaul an API in a major version release, in a way that ZoL can't handle. Given ZoL's status as a separate kernel module not under a GPL license, it's entirely possible that no amount of yelling gets the Linux maintainers to change their mind. In fact Wowfunhappy notes a discussion along those lines is happening currently based on function symbols being removed for an API required for Mac hardware support.

And sure, that compatibility layer means that users could switch to FreeBSD, or Solaris, etc, and keep using ZFS. If that happens, does Delphix change their target platform again to move to FreeBSD? Or do they come up with another solution, and stop supporting ZoL?


Except only recently we have seen an instance of access to an API being removed that has broken the build of ZoL, while this one is at worst case a performance regression it does show that the mainline kernel is very much able to breaking ZoL with an API change.

The last paragraph is pretty telling too.

https://marc.info/?l=linux-kernel&m=154714516832389


Example of this happening with ZFS on Linux, right now: https://lore.kernel.org/lkml/20190111180617.2k5uundov6hf4m7h...


From Greg Kroah-Hartman at https://lore.kernel.org/lkml/20190110182413.GA6932@kroah.com

> My tolerance for ZFS is pretty non-existant. Sun explicitly did not want their code to work on Linux, so why would we do extra work to get their code to work properly?

Ouch. It hurts to see Greg still holding onto that old grudge.

Wariness around commercial Unix vendors may've made some sense in 2005 when ZFS was released, but not only does it not make sense 14 years later, but the company the community viewed with suspicion has since entirely ceased to exist.

I spent a long time trying to finagle btrfs because it was the blessed, in-tree copy-on-write filesystem. It was a nightmare. It doesn't take long for ZFS to prove itself the massively superior solution.

Canonical's adoption of ZFS is a welcome relief, along with SFLC guidance that there is no incompatibility with the CDDL and GPL.

We need to bring the rest of the crew along and stop reinventing the wheel here. Linux will be so much better off once it accepts ZFS.


The biggest wtf here is that only some exports are GPL only. One has to wonder if there aren't any ulterior motives for this cherry picking.


Seriously that's crazy.

Its why I've never been a fan of the GPL. Its restrictive instead of permissive.


I feel like there's an 800lbs gorilla in the room that people are either forgetting or intentionally ignoring. Ubuntu promoted ZFS to being a first class citizen over a year ago. They have plenty of devs involved in the Linux kernel. What makes you think they won't continue to advocate for ZoL as well as proactively work on fixing any integration issues they find when new kernels are being developed?


Ubuntu is my daily driver, but don't expect Ubuntu promoting something to mean they're sticking with it for the long haul. See Wayland[1], Unity[2], and NetworkManager (EDIT apparently still supported with Netplan). I once backported a bug in Ubuntu's version of a Torrent Client (Deluge), and got to inform the software's lead developer that Ubuntu had actually stopped using his software by default in the next release in favor of Transmission.

I've been using Ubuntu since 8.04, ran a LoCo team and even did some package maintenance, and I've watched Ubuntu embrace tech whole heartedly, and then drop it like a rock 2 releases later, over and over again. Ubuntu moves fast and breaks things, and them promoting something should not indicate to you that they'll promote it for the long term. Long Term Support does not mean long term new development, and LTS support does not mean helping to implement support in new major Kernel versions.

1. https://blog.ubuntu.com/2018/01/26/bionic-beaver-18-04-lts-t...

2. https://arstechnica.com/information-technology/2018/05/ubunt...


They haven't given up Wayland though, it's just not ready for primetime quite yet - especially not for an LTS. They stuck with Unity for almost eight years as well, that's pretty long haul. I'm unaware of them dumping NetworkManager?

I'd agree their support of ZFS seems to be in odd state, Neil McGovern had to deal with that weirdness as DPL [1] which was very annoying as the DKMS was a "good enough" solution.

[1] https://debconf16.debconf.org/talks/9/


Ah, I saw they had moved to Netplan, I hadn't realized NetworkManager supported a Netplan interface.


Canonical has notoriety as not contributing to the Linux kernel, although they do contribute (1% of contributions from kernel 4.8 to 4.13 [1]). They do promote it, but their power within the kernel developer community is minimal.

[1] https://go.pardot.com/l/6342/2017-10-24/3xr3f2/6342/188781/P...


For a non-pdf source of development stats:

https://lwn.net/Articles/775440/


> The risk is in trying to keep compatible with the Linux Kernel, when not part of the Linux Kernel dev team.

But... that was already the case.


Yeah, even if there _weren't_ (arguable, debatable) license incompatabilities, the ZFS team doesn't _want_ it to be just a linux submodule, they want it to be an independent thing that can be used with other OSs too, no? So it's really just part of the challenge of doing _that_, regardless of legal license issues. No?


I don't see how ZoL being the canonical ZFS repository means ZFS is now "just a linux submodule". No, it's a Git repository that easily could (and will, no doubt) support building for multiple target systems.


Any idea why they don't maintain a stable kernel mode counterpart to FUSE? (FKEE?)


https://lkml.org/lkml/2018/8/3/621

> Yeah, HELL NO!

Guess what? You're wrong. YOU ARE MISSING THE #1 KERNEL RULE.

We do not regress, and we do not regress exactly because your are 100% wrong.

And the reason you state for your opinion is in fact exactly WHY you are wrong.

Your "good reasons" are pure and utter garbage.

The whole point of "we do not regress" is so that people can upgrade the kernel and never have to worry about it.

-Linus Torvalds


The linux kernel has made breaking changes [1] and deprecated APIs [2] internal APIs before.

Specifically, your example brings up the danger that ZoL is in. The Linux kernel does everything in their power to not break User Land APIs. ZoL isn't User Land, and it replaced the previously extremely slow User Land ZFS Linux implementation. It is a Linux kernel module, and as [2] notes, Linux kernel devs are absolutely allowed to make breaking changes to internal APIs. The cost of the increased speed of a Kernel module is the risk of internal API breaking changes. I highly doubt that Linus would scream anywhere near as much about an internal API change that broke a separately maintained linux module with non-GPL code.

1. https://stackoverflow.com/questions/24897511/what-is-the-rep...

2. https://lwn.net/Articles/769365/


Pretty sure that only applies to userland apps, not kernel modules.


That's about the ABI to user-land, not about internal-to-the-kernel interfaces.


My understanding is that Nvidia's non-open source drivers are a pre-compiled binary blob that uses DKMS to compile a wrapper whose only purpose is to re-export existing kernel functionality in a questionably legal attempt to sidestep the GPL requirement. So that's completely different from any module that's actually compiled from source via DKMS.


There’s no questional legality, and end-user can do whatever they want with GPL software. As long as the modules aren’t bundled with the kernel or on the same media the GPL is perfectly happy with the arrangement.


End users aren't the ones accused of violating the GPL. The issue is the non-GPL binary NVidia drivers linking against the GPL kernel without releasing the source code.


The post isn't talking about legal risk?


That's exactly what I'm asking. How real is the risk to Nvidia of using DKMS?

It seems like an exactly parallel situation to me.

(PS: further, I don't think you can separate out the legal risk from attitude / API risk, because of things like this, mentioned elsewhere in this conversation: https://marc.info/?l=linux-kernel&m=154722999728768&w=2 - specifically marking symbols GPL to make them non-available.)


The risk to Nvidia is much lower than ZoL -- Nvidia has the resources to keep up with the kernel. It's maintained in house by the company selling the cards, and they have the resources directly to do development. But just like ZoL, if the Kernel makes more and more breaking changes, Nvidia would eventually make the business decision to cut the proprietary driver loose. However that threshold is likely much higher for a profitable company than a community funded project.

But the real price of being an external module is very clear for Nvidia proprietary driver issues. If you look at distro and kernel bug reports, you'll see piles of reports for Nvidia driver users, which devs won't look into because of the tainted kernel. Some Linux software even blacklists Nvidia driver users from using graphics hardware acceleration for their apps, because it causes so many bugs. Nvidia is operating outside of the open source ecosystem, which means they don't benefit from that ecosystem like open source implementations do. When an Nvidia driver user runs into a bug, they're usually just told to shove off and not use the driver.


It's more compatibility.

I've had the DKMS modules break on what were supposed to be minor API compliant kernel upgrades on a major distro. The code was then not fixed for a month (our fix was to rollback kernel versions and just tag it there, it's since been removed entirely).

Soured me on using ZFS for Linux at all in production.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: