Hacker News new | past | comments | ask | show | jobs | submit login
ZFS Root Filesystem on AWS (scotte.org)
174 points by lscotte on Dec 29, 2016 | hide | past | favorite | 106 comments



I'm currently managing a Postgres cluster with a petabyte of data running on ZFS on Linux on AWS. Most of the issues we've come across are around us not knowing ZFS.

The first main issue was the arc_shrink_shift default being poor for machines with a large ARC. Our machines have Arc at several 100GB, so the default arc_shrink_shift was flushing several GBs to disk at a time. This was causing our machines to become unresponsive for several seconds at a time pretty frequently.

The other main issue we encountered was when we tried to delete lots of data at a time. We aren't sure why, but when we tried to delete a lot of data (~200GB from each machine which each contain several TB of data), our databases become unresponsive for an hour.

Other then these issues, ZFS has worked incredibly well. The builtin compression has saved us lots of $$$. It's just the unknown unknowns that have been getting us.


Agreed, ZFS has its caveats, but feature-wise and stability-wise ZFS is to -- a large degree -- what BTRFS should have been.

The licensing is incredibly unfortunate, though. (I don't care about the reasoning for the license, it's just bad that it isn't GPL-compatible so that it could be compatible with the most prolific kernel in the world.)

Anyway, back to BTRFS-vs-ZFS. It seems abundantly clear that a filesystem is (no longer) a thing where you can just "throw an early idea out there" and hope that others will pick up the slack and fix all the bugs. There's just too much design (not code) that goes into these things that it's not just about code any more.

My (small) bet right now as to the "next gen" FS on Linux is on bcachefs[1, 2]. It sounds much sounder from a design perspective than BFS, plus it's built on the already-proven bcache, etc. etc. (Read the page for details.)

[1] https://www.patreon.com/bcachefs [1] https://bcache.evilpiepirate.org/Bcachefs/


According to Canonical, it _is_ GPL compatible. Either way, that shouldn't get in the way of the best file system in existence being used with the kernel of last resort.


The CDDL is incompatible with the GPL license. The GPL however is not incompatible with the CDDL.

This means Linux copyright owners could sue ZoL binary distributers, but Oracle could not.

However no one is shipping ZoL binaries, only the source code. The code itself is 100% conflict free.


Canonical ships ZoL binaries as of April 2016. They claim doing so doesn't violate the GPL since they are shipping it as a module rather than built into the kernel.


So you're saying it's only going to be available in user space and never in the kernel? Why don't oracle just relicense it?


No, they're supplied as kernel modules, packaged separately from the kernel. Before Ubuntu 15.10 you could still install it as a DKMS module (such that it compiled on the system it's being installed on). Now they just ship the pre-built .ko's, saving the user compilation time. There are still userland tools to interact with it zpool, zfs etc.


>We aren't sure why, but when we tried to delete a lot of data (~200GB from each machine which each contain several TB of data), our databases become unresponsive for an hour.

There used to be an issue where users hitting their quota couldn't delete files since for some reason deleting a file meant creating a file somehow. The trick was to find some reasonably large file and `echo 1 > large_file` which truncates the file and frees up enough space that you can begin removing files. Maybe this kind of trick could help you guys.

That said, it's inadvisable to run a database on a btree file system like ZFS or btrfs if you're keeping an eye on the write performance. cf Postgres 9.0 High Performance by Gregory Smith (https://www.amazon.com/PostgreSQL-High-Performance-Gregory-S...)

and

https://blog.pgaddict.com/posts/postgresql-performance-on-ex...


> That said, it's inadvisable to run a database on a btree file system like ZFS or btrfs if you're keeping an eye on the write performance.

Our writes are actually heavily CPU bound because of how we architected the system[0]. We recently made some changes that dramatically improved our write throughput, so AFAICT, we aren't going to need to focus much on write performance in the near future.

[0] http://blog.heapanalytics.com/running-10-million-postgresql-...



Could you elaborate more on your setup? What's in the ZFS pool that supports the performance of running a DB as well as a PB of data without breaking the bank?


It's not a single machine. We have a cluster of machines, each of which have several TB of data. The only parameter I clearly remember changing is recordsize=8k, since postgres works with 8k pages.


This is quite a bit easier to do with Btrfs since there are installers that support it; but also with two neat features lacking in ZFS.

1. Reflink copies. Basically this is a file level snapshot. The metadata is unique to the new file, but it shares (initially) extents with the original.

2. Seed device. The volume is set read-only, mounted, add a 2nd device, remount rw, delete the seed. This causes the seed to be replicated to the 2nd device, but with a new volume UUID. Use case might be to do an installation minimally configured so as to be generic, and then use it as a seed for quickly creating multiple unique instances.

Another use case: don't delete the 1st volume. Each VM has two devices: the read only seed (shared), a read write 2nd device (unique). The rw device (sprout) internally references the read-only seed, so you need only mount by the UUID of the sprout.

Seed-sprout is something like an overlay, or a volume-wide snapshot.


I have an Ubuntu image for a thumb drive that I made that does both mbr/efi boot with root on ZoL. I've used it to install ~10 computers now by booting, partitioning, attaching the internal disk to rpool and then detaching the thumbdrive and then installing the bootloader without ever rebooting to have a fully working install after that. It is pretty slick.


I thought you shouldn't be running ZFS inside a virtualized environment?


Actually with pci pass-through it's quite safe to run zfs on top of a virtual machine. Take a look at this great post: http://www.freenas.org/blog/yes-you-can-virtualize-freenas/


In an ideal perfect world, you wouldn't. In reality, however, plenty of people do and, personally, I've not experienced any issues with it.


Why?


Ideally, ZFS has exclusive control over the storage. When it's virtualized, it doesn't and there may be various HBAs, raid controllers, etc., in between ZFS and the actual disks. These can (do) "get in the way" and you (can) lose one of the biggest features of ZFS (data integrity and error correction).


The keyword is definitely "ideally" :-) With a cloud provider like AWS, storage is always virtualized - so we've always got that working against us. I see ZFS in AWS more about flexibility than data integrity, although having said that ZFS should do just as well or better (certainly no worse) than EXT4, XFS, or BTRFS for reliability. The ability to add storage dynamically without having to move bits around is powerful.


And I've actually had (a couple times now) random hardware failings, with random notification emails from Amazon, on EBS storage on AWS, with accompanying data loss.

Might as well treat your zpool like it's on real hardware and configure raidz accordingly. The cloud does have real, problematic hardware behind it, and it's important to remember that.

[edit] especially if you can configure your block devices such that you know they're sitting on different physical hardware at the cloud data center, you will gain that benefit of ZFS


For a little more detail about this, it's somewhat common for these controllers to lie about when thinks hit the disk. They'll cache it and then write either out but if they lose power or a write gets corrupted zfs will have no way of discovering it until it's too late.


Are other file systems better off with recovery when interacting with virtual disks and host power failure?


They are not, but they are better at swallowing the errors and not bothering you with such details. ZFS fails fast & early, while EXT4 will fail when you realize your Postgres DB is borked.

I guess it's possible that some type of disk command timing could cause unexpected lockups or slowdowns that you wouldn't get with a system that doesn't try to control the hardware to quite the same extent as ZFS, but my (cursory) understanding is that it's rare/hardware specific.

My personal take is that running ZFS on hardware that lies is no worse than running EXT4 on it. YMMV as I'm not a storage expert.


Search online for published papers related to "IRON File Systems." Some researchers injected errors into various parts of common file systems and see how well they recovered. I think ZFS was the best of the bunch though that research is from a few years back and things may have improved elsewhere.


The IRON paper doesn't mention ZFS, I found some citing papers that focus on ZFS but have no comparisons.


Yes you are right. I believe it was a different paper by the same research group... http://pages.cs.wisc.edu/~kadav/zfs/zfsrel.pdf


If the storage lies about syncs, the best you can hope for is replaying a consistent state somewhere in the past. Log structured filesystems with checksums would be a good bet here.


The short answer is no.


Short answer "no" with a longer answer of "ZFS is sometimes better than most other options"


There are virtualization solutions that provide exclusive access to storage. Hyper-V surprisingly is one such solution and I've been running ZFS inside of Hyper-V instances for a couple of years now with no issue.


pass-through device virtualization of various types provide this facility.


Somehow it's a good thing ZFS gets to be tried on AWS, it might trigger some bugs.


"there may be various HBAs, raid controllers, etc., in between ZFS and the actual disks."

Or in the case if EBS - NFS and a network stack. EBS is "Emulated Block Storage" despite what AWS may like you to think.


Source on EBS being NFS?


Does this same idea apply when running ZFS on top of the Linux LUKS mapper (whatever the proper terminology is)?


I haven't tested it with /boot in ZFS, but I have a laptop that has an EXT /boot partition and a LUKS partition with ZFS for everything else (/, /home, etc). It works just fine - and I've never noticed any performance issues with this setup - obviously LUKS has some overhead, but running ZFS on LUKS works as well as EXT4 or BTRFS on LUKS, with all the advantages of ZFS.


there's nothing ZFS specific about this. Your admins should be ensuring that you're not doing concurrent access to shared storage, or if you are, you're using a lock-based cluster manager.


Looks like its not an issue, Was just reading the warnings a year or so ago specifically relating to FreeNAS and how it would corrupt your data.


I've always heard the same. Is there a filesystem that _is_ recommended for virtual envs?


I've been waiting for Ubuntu to finish getting the native installed to do simple ZFS-root installs for a while now. This document is basically the same process: bootstrap linux on a supported FS, then use user space tools to make a ZFS fs on a new block device, copy linux there, adjust the boot system's pointers, and reboot.


I've used this guide and it works fine:

https://github.com/zfsonlinux/zfs/wiki/Ubuntu-16.04-Root-on-...

It however advises to use blockid, not device-id, for mounts.

Any idea if this doesn't apply to AMIs?


My point is I want the installer itself to lay down the original disk FS as ZFS in the installer.


Considering how the current installer supports btrfs, it really doesn't seem like it should take too much effort. Someone however, will have to put in that effort.

And maybe people using ZFS want proper volume/subvolume management as good as support for traditional partitions or LVM volumes? If so, it will probably take a while longer to land.


Right; Canonical is expected to do this work, as they already added ZFS as a supported filesystem to Ubuntu.


In that case: Prepare to wait. Canonical has delayed lots of things lately in their eagerness to launch Unity 8 ;)


i've been waiting years now, not really a problem. When it's available I will install it on a test machine and make a copy of my live server's data there and then test it for a year.


Try out Proxmox (Debian). This works quite well for root ZFS


i don't want a derivative- I just want pure plain debian or Ubuntu.


Can someone explain why the author is aligning to 4096 sectors (2M) boundary. While most tools (gdisk et al) default to 2048 sectors?


I can (I'm the author). The partitions are aligned to 2048 sectors (4096 is evenly divisible by 2048). But also note the first usable sector is 2048, not 0 - so the first partition, although we tell sgdisk to start at 0, actually is from sector 2048-4095. I don't know the exact reason why the first usable sector is 2048 - I believe it has to do with legacy support and MBR compatibility - but I'm not sure of the details.


I was intrigued by the first usable sector exact reason and found out this link that might be an interesting read

http://jdebp.eu./FGA/disc-partition-alignment.html


Is it ok to run ZFS without ECC RAM?


Yes. It always has. The "ZFS needs ECC RAM" meme comes from the fact that on many systems the (non-ecc) RAM is the weakest point in the data integrity path if you are running ZFS.

For an analogy, consider a world where car engines have a non-trivial chance to instantly explode when involved in a crash. Then someone comes out with an engine that doesn't explode. People say "you should wear seatbelts if you use this non-exploding engine," but your car has no seatbelts; clearly you are still safer with the non-exploding engine, but all of the sudden, seatbelts are more likely to save your life than before.


In think the concern is more that if you encounter a bit-flip checksumming a block on write, then ZFS will mark the data corrupt on read; and thus, make good data unreadable.

A non-checksumming FS would not be vulnerable to this particular issue. On the other hand, would undetected corruption through bit rot be a worse problem? Almost certainly.

And considering how vanishingly unlikely such a scenario is, I do agree with your sentiment.

That said, I'm not sure if my understanding of the issue is complete and would welcome an explanation of the failure scenarios that [very] occasional RAM bit-flips expose ZFS to.


It all depends on the relative probability of those two scenarios occurring, though. And the problem with ZFS is that each time you are scrubbing you are essentially rolling the dice, so you are rolling them a lot more times.

Let's say that you have a 99.9% chance of the scrub running correctly on a big pool with non-ECC memory (0.1% chance of a bit-flip during the scrub). Any single scrub is extraordinarily likely to succeed, but if you run a scrub every day then over the course of a year then your chance of your pool surviving falls to 0.999^365 = 69.4%.

Pick your favorite numbers here, 0.1% failure chance per scrub is probably way high. With five nines your yearly survival rate is 99.6%. But do remember that soft errors are fairly rare in modern servers mostly because they use ECC RAM, you can't look at data with ECC and assume you'll get comparable results by using non-ECC RAM.

In general, if you scrub infrequently you are probably going to be OK. (but then why are you using ZFS instead of LVM?) If you live at high altitude, however - let's say in Denver - you are also facing significantly increased soft error rates. The extra atmosphere at sea level does make a strong difference in shielding, something around 5x reduction in strike events.

On the plus side - the SRAM and some parts of the processor do use ECC internally, which is good because fault rates increase with reduced feature size and increased number of transistors. The CPU is potentially the most sensitive part of the system per unit area, so it's very important to protect against errors there.

And on the other hand - disk corruption or failure probably outweigh those kinds of concerns in practice. But it's not like it's expensive to get a system with ECC. An Avoton C2550 runs like $250. So why take the risk anyway? Your data's worth an extra $100 in hardware.

Heck, you can run ECC RAM on the Athlon 5350 and the Asus AM1M-A motherboard. Boom, ECC mobo/CPU combo for under $100. It's just a little thin on SATA channels. It's a shame there's no "server" version of this board with dual NICs, IPMI, and an extra SATA controller tossed on there.


A bit flip on a pool without redundancy will result in an unrecoverable error. There are no recovery tools for ZFS.

Since ZFS caches aggressively in RAM it's conceivable that faulty RAM may write corrupt data to disk.


Except...not wearing a seatbelt might cause a flat tire for the old car, and might cause an engine explosion for the new car.


Yes, recommended but not required. Even then the underlying hardware has ECC RAM. Good question though.


Depends who you ask.

Can you? Sure.

Is it smart to possibly introduce errors into the data path in an otherwise fully checksummed path? Not so much.


If your memory is error-free already, you'll be fine.

If it's imperfect, ZFS will occasionally calculate a checksum wrong and write this to all drives in the array. At some point in the future, like when you read the file, the checksum will fail on all drives and the whole file will be marked corrupt. This gets annoying fast.

Your memory is not error-free, but it might be close enough.


> We also allow overlay mount on /var. This is an obscure but important bit - when the system initially boots, it will log to /var/log before the /var ZFS filesystem is mounted.

You shouldn't need to in a systemd based distro: journald logs to /run until /var comes up. (and then flushes across) See https://www.freedesktop.org/software/systemd/man/journald.co...


Shouldn't, perhaps, but we definitely have to (and are running with systemd)! I have not made any effort to understand what is logging early - there's not much that should happen prior to ZFS mounting everything.


Did you (accidentally) create the directory /var/log/journal in the rootfs? Make sure your /var is actually empty in the rootfs image.


Yep, it's empty when the target was created/installed. It's something happening early at boot time, but I haven't diagnosed exactly what's going on. And now that I think about it, I've seen the same thing on local ZFS Debian installs as well - worth digging into when I have some time, likely a bug somewhere.


for less production instructions on ubuntu 16.04 - here is a simple guide - http://www.howtogeek.com/272220/how-to-install-and-use-zfs-o...


Anyone have thoughts on how to turn this into something more reproducible, like Packer? Love the idea but would hate to rebuild AMI's by hand every X time for Y regions.


I definitely have thoughts about automating the process :-) - we certainly plan on getting the whole process into an automated, repeatable build.


(Disclaimer - I work at HashiCorp)

I'm working on a Packer builder type that will support this use case as we speak. The existing AWS `chroot` builder would likely be sufficient, but requires running from within AWS.


Seems like a perfectly logical pre-req. Easy enough to automate with say a Lambda function to spin up a pre-configured host instance to build the AMI's.

So... very exciting, looking forward to hearing more about when it "hits the streets"!


I'll be curious to hear what you come up with. You should be able to create the image file locally and then shove it into S3, and use AWS just for dd'ing it to EBS.


::claps madly with delight::


Anybody tried ZFS on Centos or other RHEL like distros?


Ed White has been doing it in production on RHEL/CentOS for a number of years now. He's got a great write-up on a "roll your own" clustered ZFS-based SAN replacement here: https://github.com/ewwhite/zfs-ha/wiki

You can probably hit-up Ed via Github or Server Fault if you want to talk to him about it directly.


We run Kafka (and I believe Cassandra) on top of Centos/ZFS in AWS and haven't had any issues so far. (About a TB of data moving through every 3 days)



Thanks! I'd read that before, was more looking for anecdata from someone who'd tried it, especially in production.


As a follow up question to yours, how about Amazon Linux?


I have messed with ZFS on Linux on Ubuntu and I have to say that I would not yet trust it in production. It's not as bullet proof as it needs to be and still under heavy dev. Not even at version 1.0 yet.


We've actually been running it in production at Netflix for a few microservices for over a year (as Scott said, for a few workloads, but a long way from everywhere). I don't think we've made any kind of announcement, but "ZFS" has shown up in a number of Netflix presentations and slide decks: eg, for Titus (container management). ZFS has worked well on Linux for us. I keep meaning to blog about it, but there's been so many things to share (BPF has kept me more busy). Glad Scott found the time to share the root ZFS stuff.


If I had to choose between a filesystem with silent and/or visible data corruption up to pretty much eating itself and having to restore an entire server, versus a filesystem for which you can trust but could have a kernel deadlock/panic..I would choose the latter, and in-fact did.

I have seen a few servers with ext4/mdraid over the last five years have serious corruption but have had to reset a ZoL server maybe twice.


Story time.

I transitioned an md RAID1 from spinning disks to SSDs last week. After I removed the last spinning disk, one of the SSDs started returning garbage.

1/3 reads are returning garbage and ext4 freaks out, of course. It's too late and the array is shot. I restore from backup.

This would have been a non-event with ZFS. I've got a few production ZoL arrays running and the only problems I've had have been around memory consumption and responsiveness under load. Data integrity has been perfect.


I've seen the same type of thing with respect to memory and load.


Do you have any specific reasons not to trust ZoL?

ZFS-on-Linux devs say it's ready for production[1].

Lawrence Livermore laboratory stores petabytes of data using ZoL[2].

If we're sharing anecdotes, ZoL has served me fantastically for several years.

[1] https://clusterhq.com/2014/09/11/state-zfs-on-linux/ [2] http://computation.llnl.gov/newsroom/livermores-zfs-linux-po...


We have encountered a reproducible panic and deadlocks when a containerized process gets terminated by the kernel for exceeding its memory limit:

https://github.com/zfsonlinux/zfs/issues/5535

We're strongly considering using something else until this gets addressed. The problem is, we don't know what, because every other CoW implementation also has issues.

* dm-thinp: Slow, wastes disk space

* OverlayFS: No SELinux support

* aufs: Not in mainline or CentOS kernel; rename(2) not implemented correctly; slow writes


The issue you link to was opened a day ago.

If that were me, I'd see how quickly it was fixed before strongly considering something else.


Have you had any issues to report? If so, how quickly were they fixed? Knowing what the typical time is to address these issues would help us make a more educated decision.


Yes, we've run into 2 or 3 ZFS bugs that I can think of that were resolved in a timely fashion (released within a few weeks if I recall) by Canonical working with Debian and zfsonlinux maintainers (and subsequently fixed in both Ubuntu and Debian - and upstream zfsonlinux for ones that were not debian-packaging related). Of course your mileage may vary, and it depends on the severity of the issue. Being prepared to provide detailed reproduction and debug information, and testing proposed fixes, will greatly help - but that can be a serious time commitment on your side (for us, it's worth it). Hope that helps!


zfs is not in mainline or centos kernel, so you are presumably willing to try stuff. I believe all the overlay/selinux work is now upstream, it is supposed to ship in the next RHEL release.


I look forward to that.


My reasons are as follows:

1) Seen users complaining about data loss on issues on github. 2) Had the init script fail on upgrade and had to fix it by hand when upgrading Ubuntu. Probably a one time issue.

Need a bit more reliability from a file system.


I thought "ZoL" was a pun with ZFS and LOL to tell how not ready it is for production ^^


ZoL is an acronym for ZFSonLinux.


We have been running ZFS on Linux in production since April 2015 on over 1500 instances in AWS EC2 with Ubuntu 14.04 and 16.04. Only one kernel panic observed so far, on a Jenkins/CI instance, but that was due to Jenkins doing magic on ZFS mounts, believing it was a Solaris ZFS mount.

In our opinion, when we made the switch, it was much more important to trust the integrity of the data, than any possible kernel panic.


Well, we (and by this I mean myself and my fantastic team) have been running it since 2015 as the main filesystem for a double-digit number of KVM hosts running a triple-digit number of virtual machines executing an interesting mix of workloads, ranging from lightweight (file servers for light sharing, web application hosts) to heavy I/O bound ones (databases, build farms) with fantastic results so far. All this on Debian Stable.

The setup process was a bit painful given some interesting delays when using some HW storage controllers that caused udev to not make some HDD devices available under /dev before the ZFS scripts kicked in and we have been bitten a couple times by changes (or bugs) in the boot scripts, however the gains provided by ZFS in terms of data integrity, backup, and virtual machine provisioning workflow were definitely worth it.


It's maturing rapidly and has proven to be very stable so far. We're not using it by default everywhere, at least not yet, and building out an AMI that uses ZFS for the rootfs is still a bit of a research project - but we have been using it to do RAID0 striping of ephemeral drives for a year or two on a number of workloads.


It's bullet proof on Solaris and FreeBSD.


Which doesn't say anything about its state on Linux.


The implementation might be lacking but the underlying FS should be more reliable. I'd still argue that ZFS should be deployed on FreeBSD or Solaris. There are plenty of ways to fire up a Linux environment from there.


You didn't get the hint. He's saying you should be using Solaris or FreeBSD instead of Linux.


Depends on what you're worried about. Operationally speaking I agree, it's not plug and play.

But it's at a point where it safely stores your data correctly. Perhaps some init scripts fail on boot to import your pool/etc. but the data is there.

We do run it production, but we also have in-house tooling built around it.


i've been using zfs on ubuntu since ~2010 for a small set of machines, reading/writing 24/7 with different loads. it's worked great through quite a few drive replacements, and various other hardware failures.

i'm perfectly willing to believe there may be some rare situations where zfs on linux will cause you a problem. but i bet they're rare enough it'll have saved you a few times before it bites you.


Do you trust btrfs? Suse has been having it as the default since 2014...


> The parity RAID code has multiple serious data-loss bugs in it. It should not be used for anything other than testing purposes. [0]

[0]: https://btrfs.wiki.kernel.org/index.php/RAID56


Important to note that is only referring to Raid 5 and 6


My newly built (Ubuntu 16.04 LTS) workstation is using ZFS exclusively. I'm keeping my fingers crossed.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: