I'm currently managing a Postgres cluster with a petabyte of data running on ZFS on Linux on AWS. Most of the issues we've come across are around us not knowing ZFS.
The first main issue was the arc_shrink_shift default being poor for machines with a large ARC. Our machines have Arc at several 100GB, so the default arc_shrink_shift was flushing several GBs to disk at a time. This was causing our machines to become unresponsive for several seconds at a time pretty frequently.
The other main issue we encountered was when we tried to delete lots of data at a time. We aren't sure why, but when we tried to delete a lot of data (~200GB from each machine which each contain several TB of data), our databases become unresponsive for an hour.
Other then these issues, ZFS has worked incredibly well. The builtin compression has saved us lots of $$$. It's just the unknown unknowns that have been getting us.
Agreed, ZFS has its caveats, but feature-wise and stability-wise ZFS is to -- a large degree -- what BTRFS should have been.
The licensing is incredibly unfortunate, though. (I don't care about the reasoning for the license, it's just bad that it isn't GPL-compatible so that it could be compatible with the most prolific kernel in the world.)
Anyway, back to BTRFS-vs-ZFS. It seems abundantly clear that a filesystem is (no longer) a thing where you can just "throw an early idea out there" and hope that others will pick up the slack and fix all the bugs. There's just too much design (not code) that goes into these things that it's not just about code any more.
My (small) bet right now as to the "next gen" FS on Linux is on bcachefs[1, 2]. It sounds much sounder from a design perspective than BFS, plus it's built on the already-proven bcache, etc. etc. (Read the page for details.)
According to Canonical, it _is_ GPL compatible. Either way, that shouldn't get in the way of the best file system in existence being used with the kernel of last resort.
Canonical ships ZoL binaries as of April 2016. They claim doing so doesn't violate the GPL since they are shipping it as a module rather than built into the kernel.
No, they're supplied as kernel modules, packaged separately from the kernel. Before Ubuntu 15.10 you could still install it as a DKMS module (such that it compiled on the system it's being installed on). Now they just ship the pre-built .ko's, saving the user compilation time. There are still userland tools to interact with it zpool, zfs etc.
>We aren't sure why, but when we tried to delete a lot of data (~200GB from each machine which each contain several TB of data), our databases become unresponsive for an hour.
There used to be an issue where users hitting their quota couldn't delete files since for some reason deleting a file meant creating a file somehow. The trick was to find some reasonably large file and `echo 1 > large_file` which truncates the file and frees up enough space that you can begin removing files. Maybe this kind of trick could help you guys.
That said, it's inadvisable to run a database on a btree file system like ZFS or btrfs if you're keeping an eye on the write performance.
cf Postgres 9.0 High Performance by Gregory Smith (https://www.amazon.com/PostgreSQL-High-Performance-Gregory-S...)
> That said, it's inadvisable to run a database on a btree file system like ZFS or btrfs if you're keeping an eye on the write performance.
Our writes are actually heavily CPU bound because of how we architected the system[0]. We recently made some changes that dramatically improved our write throughput, so AFAICT, we aren't going to need to focus much on write performance in the near future.
Could you elaborate more on your setup? What's in the ZFS pool that supports the performance of running a DB as well as a PB of data without breaking the bank?
It's not a single machine. We have a cluster of machines, each of which have several TB of data. The only parameter I clearly remember changing is recordsize=8k, since postgres works with 8k pages.
This is quite a bit easier to do with Btrfs since there are installers that support it; but also with two neat features lacking in ZFS.
1. Reflink copies. Basically this is a file level snapshot. The metadata is unique to the new file, but it shares (initially) extents with the original.
2. Seed device. The volume is set read-only, mounted, add a 2nd device, remount rw, delete the seed. This causes the seed to be replicated to the 2nd device, but with a new volume UUID. Use case might be to do an installation minimally configured so as to be generic, and then use it as a seed for quickly creating multiple unique instances.
Another use case: don't delete the 1st volume. Each VM has two devices: the read only seed (shared), a read write 2nd device (unique). The rw device (sprout) internally references the read-only seed, so you need only mount by the UUID of the sprout.
Seed-sprout is something like an overlay, or a volume-wide snapshot.
I have an Ubuntu image for a thumb drive that I made that does both mbr/efi boot with root on ZoL. I've used it to install ~10 computers now by booting, partitioning, attaching the internal disk to rpool and then detaching the thumbdrive and then installing the bootloader without ever rebooting to have a fully working install after that. It is pretty slick.
Ideally, ZFS has exclusive control over the storage. When it's virtualized, it doesn't and there may be various HBAs, raid controllers, etc., in between ZFS and the actual disks. These can (do) "get in the way" and you (can) lose one of the biggest features of ZFS (data integrity and error correction).
The keyword is definitely "ideally" :-) With a cloud provider like AWS, storage is always virtualized - so we've always got that working against us. I see ZFS in AWS more about flexibility than data integrity, although having said that ZFS should do just as well or better (certainly no worse) than EXT4, XFS, or BTRFS for reliability. The ability to add storage dynamically without having to move bits around is powerful.
And I've actually had (a couple times now) random hardware failings, with random notification emails from Amazon, on EBS storage on AWS, with accompanying data loss.
Might as well treat your zpool like it's on real hardware and configure raidz accordingly. The cloud does have real, problematic hardware behind it, and it's important to remember that.
[edit] especially if you can configure your block devices such that you know they're sitting on different physical hardware at the cloud data center, you will gain that benefit of ZFS
For a little more detail about this, it's somewhat common for these controllers to lie about when thinks hit the disk. They'll cache it and then write either out but if they lose power or a write gets corrupted zfs will have no way of discovering it until it's too late.
They are not, but they are better at swallowing the errors and not bothering you with such details. ZFS fails fast & early, while EXT4 will fail when you realize your Postgres DB is borked.
I guess it's possible that some type of disk command timing could cause unexpected lockups or slowdowns that you wouldn't get with a system that doesn't try to control the hardware to quite the same extent as ZFS, but my (cursory) understanding is that it's rare/hardware specific.
My personal take is that running ZFS on hardware that lies is no worse than running EXT4 on it. YMMV as I'm not a storage expert.
Search online for published papers related to "IRON File Systems." Some researchers injected errors into various parts of common file systems and see how well they recovered. I think ZFS was the best of the bunch though that research is from a few years back and things may have improved elsewhere.
If the storage lies about syncs, the best you can hope for is replaying a consistent state somewhere in the past. Log structured filesystems with checksums would be a good bet here.
There are virtualization solutions that provide exclusive access to storage. Hyper-V surprisingly is one such solution and I've been running ZFS inside of Hyper-V instances for a couple of years now with no issue.
I haven't tested it with /boot in ZFS, but I have a laptop that has an EXT /boot partition and a LUKS partition with ZFS for everything else (/, /home, etc). It works just fine - and I've never noticed any performance issues with this setup - obviously LUKS has some overhead, but running ZFS on LUKS works as well as EXT4 or BTRFS on LUKS, with all the advantages of ZFS.
there's nothing ZFS specific about this. Your admins should be ensuring that you're not doing concurrent access to shared storage, or if you are, you're using a lock-based cluster manager.
I've been waiting for Ubuntu to finish getting the native installed to do simple ZFS-root installs for a while now. This document is basically the same process: bootstrap linux on a supported FS, then use user space tools to make a ZFS fs on a new block device, copy linux there, adjust the boot system's pointers, and reboot.
Considering how the current installer supports btrfs, it really doesn't seem like it should take too much effort. Someone however, will have to put in that effort.
And maybe people using ZFS want proper volume/subvolume management as good as support for traditional partitions or LVM volumes? If so, it will probably take a while longer to land.
i've been waiting years now, not really a problem. When it's available I will install it on a test machine and make a copy of my live server's data there and then test it for a year.
I can (I'm the author). The partitions are aligned to 2048 sectors (4096 is evenly divisible by 2048). But also note the first usable sector is 2048, not 0 - so the first partition, although we tell sgdisk to start at 0, actually is from sector 2048-4095. I don't know the exact reason why the first usable sector is 2048 - I believe it has to do with legacy support and MBR compatibility - but I'm not sure of the details.
Yes. It always has. The "ZFS needs ECC RAM" meme comes from the fact that on many systems the (non-ecc) RAM is the weakest point in the data integrity path if you are running ZFS.
For an analogy, consider a world where car engines have a non-trivial chance to instantly explode when involved in a crash. Then someone comes out with an engine that doesn't explode. People say "you should wear seatbelts if you use this non-exploding engine," but your car has no seatbelts; clearly you are still safer with the non-exploding engine, but all of the sudden, seatbelts are more likely to save your life than before.
In think the concern is more that if you encounter a bit-flip checksumming a block on write, then ZFS will mark the data corrupt on read; and thus, make good data unreadable.
A non-checksumming FS would not be vulnerable to this particular issue. On the other hand, would undetected corruption through bit rot be a worse problem? Almost certainly.
And considering how vanishingly unlikely such a scenario is, I do agree with your sentiment.
That said, I'm not sure if my understanding of the issue is complete and would welcome an explanation of the failure scenarios that [very] occasional RAM bit-flips expose ZFS to.
It all depends on the relative probability of those two scenarios occurring, though. And the problem with ZFS is that each time you are scrubbing you are essentially rolling the dice, so you are rolling them a lot more times.
Let's say that you have a 99.9% chance of the scrub running correctly on a big pool with non-ECC memory (0.1% chance of a bit-flip during the scrub). Any single scrub is extraordinarily likely to succeed, but if you run a scrub every day then over the course of a year then your chance of your pool surviving falls to 0.999^365 = 69.4%.
Pick your favorite numbers here, 0.1% failure chance per scrub is probably way high. With five nines your yearly survival rate is 99.6%. But do remember that soft errors are fairly rare in modern servers mostly because they use ECC RAM, you can't look at data with ECC and assume you'll get comparable results by using non-ECC RAM.
In general, if you scrub infrequently you are probably going to be OK. (but then why are you using ZFS instead of LVM?) If you live at high altitude, however - let's say in Denver - you are also facing significantly increased soft error rates. The extra atmosphere at sea level does make a strong difference in shielding, something around 5x reduction in strike events.
On the plus side - the SRAM and some parts of the processor do use ECC internally, which is good because fault rates increase with reduced feature size and increased number of transistors. The CPU is potentially the most sensitive part of the system per unit area, so it's very important to protect against errors there.
And on the other hand - disk corruption or failure probably outweigh those kinds of concerns in practice. But it's not like it's expensive to get a system with ECC. An Avoton C2550 runs like $250. So why take the risk anyway? Your data's worth an extra $100 in hardware.
Heck, you can run ECC RAM on the Athlon 5350 and the Asus AM1M-A motherboard. Boom, ECC mobo/CPU combo for under $100. It's just a little thin on SATA channels. It's a shame there's no "server" version of this board with dual NICs, IPMI, and an extra SATA controller tossed on there.
If your memory is error-free already, you'll be fine.
If it's imperfect, ZFS will occasionally calculate a checksum wrong and write this to all drives in the array. At some point in the future, like when you read the file, the checksum will fail on all drives and the whole file will be marked corrupt. This gets annoying fast.
Your memory is not error-free, but it might be close enough.
> We also allow overlay mount on /var. This is an obscure but important bit - when the system initially boots, it will log to /var/log before the /var ZFS filesystem is mounted.
Shouldn't, perhaps, but we definitely have to (and are running with systemd)! I have not made any effort to understand what is logging early - there's not much that should happen prior to ZFS mounting everything.
Yep, it's empty when the target was created/installed. It's something happening early at boot time, but I haven't diagnosed exactly what's going on. And now that I think about it, I've seen the same thing on local ZFS Debian installs as well - worth digging into when I have some time, likely a bug somewhere.
Anyone have thoughts on how to turn this into something more reproducible, like Packer? Love the idea but would hate to rebuild AMI's by hand every X time for Y regions.
I'm working on a Packer builder type that will support this use case as we speak. The existing AWS `chroot` builder would likely be sufficient, but requires running from within AWS.
Seems like a perfectly logical pre-req. Easy enough to automate with say a Lambda function to spin up a pre-configured host instance to build the AMI's.
So... very exciting, looking forward to hearing more about when it "hits the streets"!
I'll be curious to hear what you come up with. You should be able to create the image file locally and then shove it into S3, and use AWS just for dd'ing it to EBS.
Ed White has been doing it in production on RHEL/CentOS for a number of years now. He's got a great write-up on a "roll your own" clustered ZFS-based SAN replacement here: https://github.com/ewwhite/zfs-ha/wiki
You can probably hit-up Ed via Github or Server Fault if you want to talk to him about it directly.
I have messed with ZFS on Linux on Ubuntu and I have to say that I would not yet trust it in production. It's not as bullet proof as it needs to be and still under heavy dev. Not even at version 1.0 yet.
We've actually been running it in production at Netflix for a few microservices for over a year (as Scott said, for a few workloads, but a long way from everywhere). I don't think we've made any kind of announcement, but "ZFS" has shown up in a number of Netflix presentations and slide decks: eg, for Titus (container management). ZFS has worked well on Linux for us. I keep meaning to blog about it, but there's been so many things to share (BPF has kept me more busy). Glad Scott found the time to share the root ZFS stuff.
If I had to choose between a filesystem with silent and/or visible data corruption up to pretty much eating itself and having to restore an entire server, versus a filesystem for which you can trust but could have a kernel deadlock/panic..I would choose the latter, and in-fact did.
I have seen a few servers with ext4/mdraid over the last five years have serious corruption but have had to reset a ZoL server maybe twice.
I transitioned an md RAID1 from spinning disks to SSDs last week. After I removed the last spinning disk, one of the SSDs started returning garbage.
1/3 reads are returning garbage and ext4 freaks out, of course. It's too late and the array is shot. I restore from backup.
This would have been a non-event with ZFS. I've got a few production ZoL arrays running and the only problems I've had have been around memory consumption and responsiveness under load. Data integrity has been perfect.
We're strongly considering using something else until this gets addressed. The problem is, we don't know what, because every other CoW implementation also has issues.
* dm-thinp: Slow, wastes disk space
* OverlayFS: No SELinux support
* aufs: Not in mainline or CentOS kernel; rename(2) not implemented correctly; slow writes
Have you had any issues to report? If so, how quickly were they fixed? Knowing what the typical time is to address these issues would help us make a more educated decision.
Yes, we've run into 2 or 3 ZFS bugs that I can think of that were resolved in a timely fashion (released within a few weeks if I recall) by Canonical working with Debian and zfsonlinux maintainers (and subsequently fixed in both Ubuntu and Debian - and upstream zfsonlinux for ones that were not debian-packaging related). Of course your mileage may vary, and it depends on the severity of the issue. Being prepared to provide detailed reproduction and debug information, and testing proposed fixes, will greatly help - but that can be a serious time commitment on your side (for us, it's worth it). Hope that helps!
zfs is not in mainline or centos kernel, so you are presumably willing to try stuff. I believe all the overlay/selinux work is now upstream, it is supposed to ship in the next RHEL release.
1) Seen users complaining about data loss on issues on github.
2) Had the init script fail on upgrade and had to fix it by hand when upgrading Ubuntu. Probably a one time issue.
We have been running ZFS on Linux in production since April 2015 on over 1500 instances in AWS EC2 with Ubuntu 14.04 and 16.04. Only one kernel panic observed so far, on a Jenkins/CI instance, but that was due to Jenkins doing magic on ZFS mounts, believing it was a Solaris ZFS mount.
In our opinion, when we made the switch, it was much more important to trust the integrity of the data, than any possible kernel panic.
Well, we (and by this I mean myself and my fantastic team) have been running it since 2015 as the main filesystem for a double-digit number of KVM hosts running a triple-digit number of virtual machines executing an interesting mix of workloads, ranging from lightweight (file servers for light sharing, web application hosts) to heavy I/O bound ones (databases, build farms) with fantastic results so far. All this on Debian Stable.
The setup process was a bit painful given some interesting delays when using some HW storage controllers that caused udev to not make some HDD devices available under /dev before the ZFS scripts kicked in and we have been bitten a couple times by changes (or bugs) in the boot scripts, however the gains provided by ZFS in terms of data integrity, backup, and virtual machine provisioning workflow were definitely worth it.
It's maturing rapidly and has proven to be very stable so far. We're not using it by default everywhere, at least not yet, and building out an AMI that uses ZFS for the rootfs is still a bit of a research project - but we have been using it to do RAID0 striping of ephemeral drives for a year or two on a number of workloads.
The implementation might be lacking but the underlying FS should be more reliable. I'd still argue that ZFS should be deployed on FreeBSD or Solaris. There are plenty of ways to fire up a Linux environment from there.
i've been using zfs on ubuntu since ~2010 for a small set of machines, reading/writing 24/7 with different loads. it's worked great through quite a few drive replacements, and various other hardware failures.
i'm perfectly willing to believe there may be some rare situations where zfs on linux will cause you a problem. but i bet they're rare enough it'll have saved you a few times before it bites you.
The first main issue was the arc_shrink_shift default being poor for machines with a large ARC. Our machines have Arc at several 100GB, so the default arc_shrink_shift was flushing several GBs to disk at a time. This was causing our machines to become unresponsive for several seconds at a time pretty frequently.
The other main issue we encountered was when we tried to delete lots of data at a time. We aren't sure why, but when we tried to delete a lot of data (~200GB from each machine which each contain several TB of data), our databases become unresponsive for an hour.
Other then these issues, ZFS has worked incredibly well. The builtin compression has saved us lots of $$$. It's just the unknown unknowns that have been getting us.