Native ZFS for Linux

wnoise · on Sept 23, 2010

Sort of.

From the FAQ:

1.3 How do I mount the file system?

You can’t… at least not today. While we have ported the majority of the ZFS code to the Linux kernel that does not yet include the ZFS posix layer. The only interface currently available from user space is the ZVOL virtual block device.

dataguy · on Sept 23, 2010

... ZPL, so the Z POSIX Layer, is the really awesome thing. Virtual block device access is neat, though. I wonder when the first one starts building a high performance data base project based on virtual blocks in zfs for linux. Could be interesting.

ssmoot · on Sept 23, 2010

ZVOLs are block devices. So you could install another fs on top of one. Or more usefully export it as an iSCSI target.

ithkuil · on Sept 24, 2010

sorry for the stupid question, but what features I get with ZVOLS exported as iSCSI/AoE versus exporting a LVM device?

Freaky · on Sept 23, 2010

As I recall, the FreeBSD ZFS port had ZVOL integrated into GEOM (FreeBSD block device layer) about a week after the porting effort was started. ZPL took significantly longer.

binomial · on Sept 23, 2010

What I'd love is for the L2ARC system built on top of ZFS to be ported as well. That's a killer feature for web applications. Basically, it can extend your cache to include an SSD, but manage that all internally, so you just use your database as usual on top of ZFS and let it handle all the caching and moving less used data from SSD to disk and vice versa, and of course the usual filesystem caching in RAM as well.

Though you could get similar functionality by using membase at the expense of being limited to get/set operations (it's a K/V store).

kmavm · on Sept 23, 2010

See Facebook's FlashCache if you get tired of waiting: http://www.facebook.com/note.php?note_id=388112370932

wmf · on Sept 23, 2010

Or bcache: http://bcache.evilpiepirate.org/FrontPage

binomial · on Sept 23, 2010

Very nice, thanks for both.

ssmoot · on Sept 23, 2010

If it's as small as a single server, you probably don't need an SSD to cache access. If your db server is partitioned out, just put the whole fs on SSD. You can't partition the L2ARC, so you really don't want to mix web assets with database volumes. The assets will push your db out of the cache unless it's so big that you're massively overpowered, or traffic so low it didn't matter in the first place.

That's some hard won knowledge there. :-)

(You can choose to cache assets meta-data only, but that tuning has major downsides itself and while it may protect your db, it's also likely to render your caches very under-utilized. Basically it isn't a silver bullet and it's still important to think of volumes and work-loads in terms of what "spindles" they're on.

steve19 · on Sept 23, 2010

It has been successfully ported to FreeBSD and now there is a functional Linux ZFS kernel port too

So this is a port and therefor CDDL licesned and no able to be merged into the mainline Linux Kernel, right?

Is ZFS still a hot commodity with BTRfs just around the corner?

ssmoot · on Sept 23, 2010

I've heard about btrfs being "just around the corner" for a good while now. ZFS on Nexenta and even FreeBSD is pretty robust. It's a great storage platform. I wouldn't let waiting on a stable btrfs make my SAN/NAS decisions for me.

The other concern in that realm is that when you're centralizing storage concerns to lower costs, boost performance and increase reliability you don't want software issues or corruption to take down your entire business.

ZFS can be a painful enough learning curve when it comes to that environment. I wouldn't trust btrfs until it's been stable for a couple years there. And outside of that environment, there are plenty of good stable alternatives for the DAS space. While ZFS is nice there, and I'm sure btrfs would be as well, that's not the bread and butter for these systems.

Andys · on Sept 23, 2010

I believe btrfs comes with Ubuntu 10.10

alphabeat · on Sept 23, 2010

Can you provide a source for that? Sounds exciting.

spahl · on Sept 23, 2010

It was supposed to get into 10.10 for testing but according to the blueprint it was deferred (https://blueprints.launchpad.net/ubuntu/+spec/foundations-m-...).

rm-rf · on Sept 23, 2010

I still see btrfs as years behind zfs as measured strictly by maturity (not features), and therefore consider it to be at least a few years away from being usable in situations where an advanced file system really matters (mission critical databases, 30+TB file systems, etc.)

ZFS has been around 4-5 years, yet in the last few months we hit a severe data loss bug (ZIL corruption) and a service affecting ZFS cache performance bug (ARC cache maintenance routine with math error).

I'd assume that btrfs will have similar teething pains.

lmz · on Sept 23, 2010

It will be interesting to see what happens with btrfs when Oracle now has Solaris and ZFS.

nailer · on Sept 23, 2010

Chris Mason pasted about this: according to him, nothing (remember BtrFS is btree based, ZFS isn't).

lmz · on Sept 24, 2010

This post[1] maybe? I just have some difficulty believing Oracle would continue to fund btrfs when they have ZFS. Then again, the code is already out there, the developer can simply move to Red Hat / IBM / Canonical / wherever and continue working if funding is pulled.

[1]: http://www.linux-foundation.org/weblogs/amanda/2009/06/22/a-...

nailer · on Sept 24, 2010

Why? Asides from the BtrFS having btrees and some (acknowledged by the ZFS developers) better design decisions (see the btrfs wikipedia page), ZFS keeps Solaris customers locked into Solaris and away from Linux.

lmz · on Sept 24, 2010

Oracle owns Solaris now. Why would they not want to lock their customers into Solaris?

nailer · on Sept 24, 2010

Yes. We agree.

patrickgzill · on Sept 23, 2010

I would say yes, given my experience with ZFS of a number of years in production.

mhansen · on Sept 23, 2010

Please explain?

patrickgzill · on Sept 23, 2010

It makes a lot of the common sysadmin tasks a whole lot easier.

For instance, creating raid sets - by default it works with whole disks, which is exactly what you want 99% of the time (but you can specify a partition if you really want to).

Want to check whether there is a data corruption error?

"zpool scrub poolname" and it will go through all the disks on that pool, using the built-in ECC to verify that the data on the disks is OK. If not, it will try to fix it.

"zpool status -v" lists all the disks, which mirrors or RAID sets each is part of, and when the last scrub was performed, along with errors or ECC problems.

The one suggestion I would make for the btrfs guys is to really work on making the CLI commands simple and intuitive.

ssmoot · on Sept 23, 2010

Minor correction, Scrub just rewrites all data. And since ZFS is copy-on-write and compares new blocks to old, you'll find corruption or bad blocks that way.

Also, since ZFS doesn't dedupe (just don't use it) or compress at rest, a scrub is the practical way to apply changes to those settings to the existing blocks already on disk.

patrickgzill · on Sept 23, 2010

Good point, this post talks about their need to rewrite the scrub code to fix a bug and make it faster. http://blogs.sun.com/ahrens/entry/new_scrub_code

bensummers · on Sept 23, 2010

Would you like to be one of the first to discover a data-loss bug in brtfs? If not, you probably want the tried and trusted ZFS.

spahl · on Sept 23, 2010

I wouldn't really call this ZFS port "tried and trusted". :-D

bensummers · on Sept 23, 2010

I wouldn't use it on Linux! Obviously Solaris would be the 'best' place to use it.

But even after a port, the underlying FS will have many more years of production use than brtfs. You might find bugs in the interface between the kernel and ZFS, but hopefully they wouldn't mess with any of the core fs code.

sigil · on Sept 24, 2010

Fully agree there. FreeBSD has had ZFS for 3+ years, and I still see some serious issues once in a while. Beware all new filesystems.

wazoox · on Sept 23, 2010

Actually you can't even distribute binaries at all. Btrfs is quite far from being "around the corner". A solid, dependable filesystem must be in the wild for a couple of years to be called "production ready".

nailer · on Sept 23, 2010

RHEL6 is shipping btrfs as a 'tech preview' (like, say, SystemTap used to be). So yeah, maybe 2 years, maybe 18 months, but the point is: the time starts pretty close to now.

wazoox · on Sept 23, 2010

Given that I actually was hurt by bugs in filesystems as old and stable as xfs, I'd still wait a little more than that before using btrfs in a large scale. I'll build and set up about 1.2 PB of storage this year....

nivertech · on Sept 23, 2010

Can you implement shared filesystem on multiple EC2 istances with this ZFS implementation? Which Linux distro is best for it?

spahl · on Sept 23, 2010

ZFS is not a distributed filesystem. What you can do is export it over nfs for example. It also has replication features with send/receive but I don't think this implementation supports it.

Edit: I notice that a lot of people think ZFS is a cluster/distributed filesystem. I don't understand where they get this idea.

nivertech · on Sept 23, 2010

Shared filesystem is not the same as cluster/distributed filesystem. It's mere a management convenience and letting legacy code work without changes.

There is a problem on Amazon EC2, that you can mount EBS (Elastic Block Storage) only to one EC2 instance at a time.

I was told by several people (including some from Sun), that the only way to mount the same EBS instance to multiple EC2 instances is to use OpenSolaris and ZFS. I don't think this solution involves NFS.

Anyway now when ZFS almost ported to Linux, it would be great to check this issue again (requires ZPL).

carson · on Sept 23, 2010

There is no way of attaching EBS volumes to multiple instances, see the EBS API overview: http://docs.amazonwebservices.com/AWSEC2/2010-08-31/Develope...

What might work if you wanted to do this is to mount the EBS volume to an instance and then expose the volume using a network block device to multiple instances. Seems convoluted. Being able to attach EBS to multiple instances may be something Amazon will add if there is enough interest.

ssmoot · on Sept 23, 2010

So I'd guess EBS is nothing more than ISCSI targets then. ZFS won't do anything to make those usable on multiple machines at once.

ZFS does make it easy to mirror multiple LUNs for redundancy though, so if you have a timeout or permanent disconnect from the device you remain operational. In addition detaching the device and moving it to another machine is trivial with zfs export/import commands.

leif · on Sept 23, 2010

So...when can I use this to share data between a BSD and a Linux machine?

c00p3r · on Sept 23, 2010

If it isn't backed by RedHat it is useless. Redhat have his amazing triple develop/test/bugfix process Fedora-development->Fedora->RHEL. But even Fedora is quite stable.

regularfry · on Sept 23, 2010

You could just as easily say "wait until this is in Debian testing before going near it."

c00p3r · on Sept 24, 2010

It takes ages. ^_^

c00p3r · on Sept 23, 2010

to stupid down-voters: try to count what amount of testing a standalone package gets compared to the package which is a part of Fedora Project? In term of number of installations? community involved? Fedora's team and infrastructure?

wmf · on Sept 23, 2010

Doesn't Ubuntu have more users than Fedora? You really sound like you're trolling. Maybe you should have said "if it isn't backed by a major distro it's useless".

c00p3r · on Sept 23, 2010

More than Fedora + CentOS + RHEL?

Are web-hosting or shared-hosting providers using Ubuntu?

What do you thing Amazon or RackSpace are using?

moe · on Sept 23, 2010

http://www.google.com/insights/search/#q=%20Centos%2C%20RedH...

RPM based distros generally have a tough standing since Ubuntu began to eclipse everything else in terms of popularity.

c00p3r · on Sept 24, 2010

OK. I'll try to explain.

This search results showing that there are a lot of Linux home users who're using Ubuntu because it is most popular Desktop distro. Its popularity in a large extend due to money Canonical spent on promotion. (Remember free shipping of CD's and other actions)

But if you're considering a server segment, there are traditionally CentOS, RHEL or Fedora. Or even that stupid Unbreakable Linux. Why it is so? (Because it got that way!)

First, because most of really important key developers, like mr. Ulrich Drepper, works for RedHat or with Redhat. Redhat also actively supported by IBM, that's why RHEL is a such huge success.

Second, since RHEL4, when they made that stupid decision to maintain all the patches by themselves (RHEL4's kernel source rpm had up to 300 patches in it, but I'm not quite sure you know what it is all about.)

Now, they work with a mainstream kernel developers, so all their code got a most extensive testing possible. This is what Google's Linux engineers still cannot understand - the best testing and bug hunting is in primary linux source tree.

So, what about Ubuntu? Compared to Fedora it is outdated, very conservative, desktop-oriented distro for an average mediocre user. I don't want to go deeper and try to compare source packages of key components, such as kernel, glibc and related, but I'm pretty sure there are different set of patches. One can compare common packages like perl, python, erlang, gcc, php by themselves.

Conclusion? OK, 3-5 years ago using Ubuntu or Debian as a server instead of Fedora-derived distros was a sign of non-professionalism, if not a fanboyism or even ignorance. What changed today? Almost nothing. Ubuntu is following the same conservative policies. There are some so-called server editions of Ubuntu, but it is rather a marketing movement.

And, you might be surprised, but key kernel developers are using Fedora. =)

moe · on Sept 24, 2010

using Ubuntu or Debian as a server instead of Fedora-derived distros was a sign of non-professionalism

Funny. In my circles it's considered a sign of masochism to run a RPM based distro in this day & age. To each their own, let's keep the evangelism off HN.

c00p3r · on Sept 24, 2010

Please re-read the part about RHEL4 and kernel patches. Take into account that some people were using Oracle and Informix before any Unbreakable things.

moe · on Sept 24, 2010

How does your response relate to what I just said?

c00p3r · on Sept 24, 2010

Of course, I'm trolling.

I'm trolling because I was around when FreeBSD 2.0 was considered as too modern, when there were no RedHat, let alone Ubuntu. I survived migrations from libc5 to glibc, from LinuxThreads to NTPL, from 2.4 to 2.6 kernels. I made a specialized distros for corporate usage from RH's and then RHEL4's srpms long before CentOS was here. I did a migration of Informix-based products from SCO Open Server to Linux in 1999 as a solution for so-called problem 2000.

Now, when any newfag without any understanding of what depends of what and why it is so, can do yum install or apr-get install php* what should I do?

I'm still unable to comprehend what mc is or what nautilus is for. I'm still using only 5 programs - vi, emacs, mit-scheme, chromium and mplayer, and have an addiction for rebuilding anything with clang.

So, seems like I'm a troll. In this case yes, I'm trolling.

BrandonM · on Sept 23, 2010

"If it isn't backed by RedHat it is useless," is a pretty ridiculous statement. Taken to its logical conclusion, much of the software running the world's critical infrastructure is "useless." If you had said something like, "I will avoid this piece of software until it is backed by RedHat because [yadda yadda yadda]," I'm sure your comment would have been better received.

c00p3r · on Sept 24, 2010

much of the software running the world's critical infrastructure In case when it runs on top of Linux it uses RHEL or CentOS. Or even that ridiculous Unbreakable Linux.