Hacker News new | past | comments | ask | show | jobs | submit login
ZFS Is the Best Filesystem For Now (fosketts.net)
311 points by ingve on July 12, 2017 | hide | past | favorite | 269 comments



> ZFS never really adapted to today’s world of widely-available flash storage: Although flash can be used to support the ZIL and L2ARC caches, these are of dubious value in a system with sufficient RAM, and ZFS has no true hybrid storage capability.

How is L2ARC not "true hybrid"?

> And no one is talking about NVMe even though it’s everywhere in performance PC’s.

Why should a filesystem care about NVMe? It's a different layer. ZFS generally doesn't care if it's IDE, SATA, NVMe or a microSD card.

> can be a pain to use (except in FreeBSD, Solaris, and purpose-built appliances)

I think it's just a package install away on many Linux distros? Also installable on macOS — I had a ZFS USB disk I shared between Mac and FreeBSD.

Also it's interesting that these two sentences appear in the same article:

> best level of data protection in a small office/home office (SOHO) environment.

> It’s laughable that the ZFS documentation obsesses over a few GB of SLC flash when multi-TB 3D NAND drives are on the market

Who has enough money to get a mutli-TB SSD for SOHO?!


> How is L2ARC not "true hybrid"?

It doesn't persist across import/export or reboot.

It is demand-filled.

Not all data in the main vdevs are eligible for l2arc.

There is memory overhead for l2arc buffers.

There is CPU overhead in processing l2arc headers.

Once a buffer is in l2arc it stays in l2arc until the underlying data is overwritten or destroyed, or until the l2arc has filled up and the buffer is replaced with fresher data.

A true hybrid in the zfs context would let one pin a dataset or zvol onto a particular vdev, or pin only the (zfs) metadata (or a subset thereof) of a dataset, zvol or pool to a particular vdev.

Openzfs will eventually get both persistence and this form of true hybrid.

Unfortunately automatic migration by zfs of hot data to low-latency vdevs and cool data from low-latency vdevs is not really possible without solving the infamous block-pointer-rewrite problem.


Why should a filesystem care about NVMe?

Because at some point the filesystem becomes a bottleneck. ZFS was designed with the assumption that CPUs would be way faster than storage. When you get speeds over 10GB/sec, [0] you are going to spend a lot of time checksumming all that data.

[0] http://www.seagate.com/ca/en/about-seagate/news/seagate-demo...


Fletcher checksums are very cheap and not a bottleneck.

https://github.com/zfsonlinux/zfs/issues/4789#issuecomment-2...


Maybe I'm reading those benchmarks wrong, but they appear to max out well under 10GB/s. This would mean you'd be CPU bound on your checksums alone with one of those Seagate cards.


That is a Xeon Phi, a modern Xeon with 10+ cores should do tens of GB/s

https://cloud.githubusercontent.com/assets/472018/13333262/f...

https://github.com/zfsonlinux/zfs/pull/4330


Hmm, I for one don't want to tie up ten cores with IO.


If you aren't checksumming your data, is it truely your data?


Who checksums the checksummers? ;)


You can't handle the checksum


Mr Edward Checksum Checker.

Update: At least 1 person didn't get this joke :D


> Hmm, I for one don't want to tie up ten cores with IO.

What else are you going to use the two CPUs in a filer for?


On a single thread.

ZFS is well pipelined for multicore throughput.


Although not recommended, you can turn checksum off. The granularity is at the dataset level, too, so you have a lot of flexibility.


What could ZFS do differently to solve that problem while maintaining data integrity?


Just one idea: offload checksum calculation to a DMA engine. Linux already has a generic DMA engine facility in the kernel, backed by e.g. I/OAT on some Intel hardware.


Assuming this is the same thing as hardware-assisted checksums, both the ZFS on Linux maintainer and Intel have said that they are working on it at various cons last year.


Hybrid storage combines flash and disk for better performance and lower cost. Most hybrid storage is tiered, meaning that data can "live" on either flash or disk, but there are other approaches that look a lot more like a cache (see Nimble Storage, for example). True hybrid storage would work like Apple Fusion Drive or Hybrid Storage Spaces Direct - A single pool with SSD and HDD where data can reside wherever is best.

L2ARC is only ever a cache and although it can be on SSD, ZFS generally won't use much more than a few tens of GB. ZIL isn't even a cache and is only used for synchronous writes. Check any ZFS tuning guide and the gist will be "just buy more RAM or create an all-SSD pool" rather than trying to wedge an SSD into L2ARC or ZIL.


L2ARC and the ZIL can be great in many storage situations, but not all.

We have a TrueNAS appliance with a 480GB L2ARC and a small 120GB ZIL (never fully used) used for image/document storage by our ECM suite. Our metadata usage on the filesystem is astronomical due to having billions of small (<16KB) files, the L2ARC may not do a whole lot for actual data (since most of it isn't hit frequently enough to be eligible to be stored in the ARC/L2ARC) but it's instrumental in maintaining performance with such massive amounts of filesystem metadata.


A write-focussed SLOG device can massively improve your write IOPS though, and is massively cheaper than having an "all-ssd" pool, while being able to approach it's normal use performance. If you have a sustained heavy I/O load, sure - going full SSD will be your best bet, but for most usecases, you only need peak IOPS performance in short bursts.


How reliable are hybrid storage drives? I feel like the last one I saw failed pretty shockingly quickly but not sure (it wasn't mine)...


I'm sorry, I was referring to hybrid storage as a technology category, not the hybrid disk drives like Seagate Momentus XT. You're right that a 2-drive hybrid (Fusion Drive) will mathematically be less reliable than a non-hybrid one and that those "hybrid disks" haven't lived up to the hype.

Pretty much every enterprise storage solution designed today is hybrid (SSD plus HDD) or all-flash and includes lots of advanced availability features.


> "Pretty much every enterprise storage solution designed today is hybrid"

By this do you mean simply that there is both SSD and disk storage around, or do you mean that the storage system transparently chooses where to store particular things without the apps having to care.

Because the latter thing is not my experience.


If you buy an enterprise storage frame there are a variety of mechanisms where as a storage consumer you just see a LUN, but in the backend individual extents/blocks/volumes are tiered in different performing areas.

These systems will transparently "promote" hot blocks to flash or faster spinning disk, etc and demote cold blocks based in usage patterns and policy, without the app being aware.

There are a lot of different ways to do it, ranging from tiering within the array to a storage virtualization solution that can tier data across different storage platforms. In one case I worked on a project where that virtualization tech was used to consolidate 10 data centers to 1, pretty much transparent to the end users and mostly transparent to the folks running apps. (Exceptions were ostly apps that built their own Storage HA)


Ohh okay, gotcha.


I had a ridiculously bad experience with the Seagate Momentus XT drives. Never again..


> How is L2ARC not "true hybrid"?

L2ARC in my understanding is only for reads, whereas ZIL is the write ahead log. Ideally, ZFS would "combine" the two into a MRU "write through cache" such that data is written to the SSD first, then asynchronously written to the disk after (ZIL does this already) but then, when the data is read back, it's read back from the SSD.


This makes sense; if you were building a service that served bytes off of disk, with an in-memory LRU cache, you wouldn't have two separate pools of memory, with one for writes and one for reads.

Did the ZIL and L2ARC concepts come up before SSD was widely available? Especially the ZIL seems very much optimized for crazy enterprise 15k rpm spinning rust. Memory and SSD access characteristics are so different from spinning disks; I don't know why ZFS separates ZIL and L2ARC.


Because the ZIL is a journal log, not a cache. It is intended to increase data security without sacrificing too much performance. Many people also confuse ZIL and SLOG devices though...

By default, the ZIL is written on the same disks as where the data will be stored, but an external device (aka SLOG) can be added. From that point on, your write IOPS will be limited by this SLOG device, so normally you add a more expensive fast disk as SLOG device to increase your write IOPS.


> Did the ZIL and L2ARC concepts come up before SSD was widely available?

Yes. After they realized that their initial claims about not needing such things was bullshit (which some of us had told them at the time) but before SSDs became common.


> I think it's just a package install away on many Linux distros? Also installable on macOS — I had a ZFS USB disk I shared between Mac and FreeBSD.

It's easy to install if you want to use it as an additional filesystem. But if you want to install e.g. RHEL on root ZFS, it's quite an adventure. Even Ubuntu with first-class support for ZFS does not support ZFS on root from the box. Actually I don't understand it. Of all the features, snapshots looks like killer feature for Linux distributions. Make snapshot before upgrade, allow easy rollback if upgrade gone wrong. Something like Windows restore points, but much more reliable.


ZFS root absolutely is supported on Ubuntu and Debian. Every Debian system I've built in the past year or so have been 100% ZFS (around 6 physical systems, but I've also built AWS AMIs this way). You have to install via debootstrap, but it's definitely a working and supported configuration. Installer support would sure be welcome though, it's not trivial!


And that’s exactly the issue. You can’t just use ubiquity to set it up, or even the terminal, you have to go a long, complicated, badly documented path.


If you use proxmox (Debian), you do not need a terminal. And you get all the nice proxmox features. But I agree, standalone setup is no fun.


Actually, I do use the terminal to install it. Boot off a liveusb, switch vty, and go.


I agree, snapshots are amazing for / on a linux machine. Makes backups actually work right, and be very inexpensive (computationally).

But, the tooling is still rather new and untested. At work we are still stamping out upstream bugs that really shouldn't exist, but the ecosystem is certainly getting better. It's only been a couple years since the ZFS on Linux project has gotten remotely any uptake by the distro folks, and even now that interest is rather tepid due to the licensing issues.

If ZFS was GPL I think it would have been the default filesystem on Linux for quite some time now.


> if you want to use it as an additional filesystem

Which is totally fine for a NAS! But yeah, I like my ZFS root on all the FreeBSD installs :)

> snapshots looks like killer feature for Linux distributions

True. FreeBSD and of course Solaris/illumos had boot environments for a long time, it's an excellent feature.


Since Btrfs also has snapshots, it seems perfectly adequate for a root FS. It hasn't seemed important to me to run a ZFS root, even though I have all my user data in ZFS.


btrfs snapshots have a lot of gotchas. Quoting myself:

https://news.ycombinator.com/item?id=14724820


If you have a mandatory flush to disk anytime there's a snapshot, they suddenly become a lot more expensive. They are still atomic in that the operation either did completely happen or not at all happen, if a crash occurs after the command returns, i.e. you don't end up with a partial or corrupt snapshot after you reboot.

The need for a snapshot to be on disk when sending, what you cite says you get an obscure error "stale NFS file handle" not that you get missing files. If you're silently getting missing files on a send/receive, that's a bug and should be reported.

I use snapshots quite a bit both for data, backup/replication with send/receive, and for root fs, and haven't had problems with it. The known problem with Btrfs snapshots is that they are deceptively cheap to create, but become expensive later on to delete due to back reference searching, freeing extents (or not if they're still held by other snapshots) and updating metadata. Dozens to small hundreds aren't normally a problem, in my use case I don't notice performance problems.


Good to know. I've used them only very rarely.


RHEL has a repository run by the ZFS on Linux team, and if you know ZFS it's not much harder than partitioning your disks. The big issue is a lack of documentation on using GRUB with ZFS. It's seemingly nonexistent, leading most guides to say /boot needs a separate partition.


> /boot needs a separate partition

This is true with any Linux-supported FS on most modern PCs since UEFI can only read VFAT partitions.


The ESP does not need to contain /boot, it only needs to contain the EFI. And the UEFI specifications are for an independent filesystem based on FAT, VFAT is still patented. In practice, any FAT filesystem usually works.


You actually don't need a separate /boot - recent grub versions know what to do. You do need a small EF02 partition (I make mine 4096 sectors), and then the rest of the disk as a BF02 partition. I wrote this up for an AWS AMI, but I use the exact same setup on my Debian desktops and servers - https://www.scotte.org/2016/12/ZFS-root-filesystem-on-AWS


/boot and the EFI System Partition aren't necessarily the same thing


> Who has enough money to get a mutli-TB SSD for SOHO?!

https://www.amazon.com/Crucial-MX300-Internal-Solid-State/dp...

I was contemplating a build with 2 of these in a RAID 1 configuration for my next homelab server.

Personally I run a gaming (windows) desktop at home and an always-on UPS-backed homelab server that handles minor ops tasks (mostly backing up side projects and some ETL) + provides dev VMs.

My home office "budget" is ~$2k/year. My gaming/work desktop is ~4 years old, represents ~$2.5k of that budget. Monitors/peripherials/desk/chair generally eats another $2k and are of a similar age. I generally spend ~$2.5k on the dev server. I then usually toss ~$1k into a laptop.

I could easily see someone who purely works from home (rather than 1-2 days a week) operating with a larger budget and genuinely needing a ZFS setup of 2TB SSDs.

Realistically, $2-3k/year is 2-5% of the sort of salaries we see on HN given we make a living at this sort of thing...it isn't surprising people would spend that kind of money to me.

https://hurdlr.com/blog/software-web-developer-tax-deduction...

Keep in mind "business equipment" certainly qualifies for such a dev server so you won't be paying taxes on it if you itemize as well.


> Keep in mind "business equipment" certainly qualifies for such a dev server so you won't be paying taxes on it if you itemize as well.

Even factoring in my homelab spend I don't get any more itemizing than taking the standard deduction as a married individual making $85K, and I spent well close to $2000 on it last year.


> How is L2ARC not "true hybrid"?

I think what that may be referring to is that the ARC is in-RAM and obviously cleared on a reboot, so as a result L2ARC on an SSD is also not persistent. After a reboot, you have to allow the ARC to fill up, then as it evicts data from the L1ARC it's pushed to L2ARC. Until that happens the SSD is not used at all.


ARC is not necessarily cleared on reboot (if doing a "fast" reboot; that is, kernel reload); that's platform and implementation-dependent. Lookup "Persistent L2ARC".


Perhaps "true hybrid" == RAM->SSD->HDD->tape, that is, warm-through-cold storage?

Whatever.

What would be really nice is raw SSD/storage access so that ZFS (or other FS) could manage all the wear leveling and bad block mappings.


Wear leveling is a implementation detail of the medium that can and will change. It doesn't seem right to put that in the filesystem itself. If anything, I would think it should go into some 'generic <specific flash technology here>' device driver.


>. It doesn't seem right to put that in the filesystem itself

It absolutely belongs in the filesystem. When you're doing RAID of any type across the devices, you need that layer to manage the underlying media. A single device view will never appropriately manage wear leveling and garbage collection.

There's a reason companies like NetApp have been working with drive vendors to have more control over the underlying media:

http://www.samsung.com/us/labs/pdfs/2016-08-fms-multi-stream...


It's difficult problem to solve. If we do put it in the filesystem, we need some way for the filesystem to know how the NAND behaves. Otherwise the lowest common denominator dictates and nobody gain anything.

With this in mind, I don't see why the disk cannot handle this in the firmware. As long as there is enough free NAND on the drive, it can manage wear level and GC just fine assuming that it gets TRIM commands.


You've just described why we have storage appliances for high performance and enterprise workloads, and why the drive to standardize always swings back around to customized software and hardware.


It belongs in the filesystem in exactly the same way that volume management belongs in the filesystem.

ZFS includes volume management, _naturally_. I say naturally, but that was a radical idea 15 years ago. Even now it's not universally accepted, but it's quite correct!


I'll add that ZFS is also quite concerned with write patterns IIRC, and thus wear leveling.

The author of the article seems to assume we should all be trusting SSD or "hybrid storage" firmware to properly handle this sort of thing for us like nice black boxes.

I think this was one of the major problems ZFS was designed to solve. To make storage hardware more simple the idea was to move a lot of this logic into the OS (especially as large RAM sizes got cheaper). It's why having a RAID controller sitting under your vdevs is advised against.


Whatever is exactly my reaction :)

You get raw NAND access on like, home routers. OpenWrt/LEDE uses JFFS2 on that.


Change Tape to Cloud and I think you're on to something.


A 2TB SSD costs around 400-500USD. That's not exactly out of the realm of possibility for a small office.


I've got 6x 4TB WD RED in a ZFS RAID10 (3x2 mirrors). I get about 500Mbytes/s read and write, on average. You start getting into 10GbE territory pretty easily with even consumer drives and ZFS. With SSDs, you'd quickly need trunked 10GbE if you want to fully saturate your network - in addition to that consider the client requirements - e.g. each workstation with 10GbE or 1GbE. You'd have to have some pretty decent requirements to necessitate a ZFS RAID10 with [NVMe] SSDs.


The benefit of SSD is not the raw sequential transfer. That is easy to max out with spinning platters. You want SSD for random access, it's significantly better at that.


The ARC (and L2ARC on an SSD) give you the best possible IOPs for read operations given the limits of the underlying hardware. Asynchronous write operations are cached to memory before being written to disk, often sequentially. For synchronous writes, you should use a ZIL on a mirrored SSD.


Sure, but it's a rather high price for just faster storage.

And like… 2 TB of cache is a bit high for SOHO NAS, and if you go full SSD for storage, you'd want two of them for a mirror and that's 1000 USD already…


And we spend what? $200 upgrading from an i3 to an i7 for like... 30% faster CPU performance?

Upgrading $100 hard drive to $400 SSD results in like, 500% improvements in storage speed. If you have any storage-related task... such as video editing, handling of large datasets and whatnot... the SSD will have a far bigger impact on your productivity than any CPU upgrade.


Probably better to spend $150 on SSD cache and $150 on RAM cache than to spend $300 on RAM and $0 on SSD, though. Or the other extreme.


A 2TB drive for 400 or 500 USD? Cheapest ok-ones (samsung EVO) I can find are 700 USD, and then we're not really talking about drives you might actually consider putting in a storage pool used to store critical data.

And if you want 6 or 7 of those puppies, in a non-diy server through a vendor which offers support, 10GE and with enough ECC and CPU to handle that IO load/throughput, you're in for a treat :)


From a protocol level, NVMe is actually designed for SSDs - just look at the amount of queues it has https://en.wikipedia.org/wiki/NVM_Express#cite_ref-ahci-nvme.... To really take advantage of this, you'd need your filesystem to be designed for many independent IO streams. ZFS metaslabs probably help, but I'm sure there is more you can do in this area.

On the other hand, I don't think many people would be hitting any limits where this matters.


Both the L2ARC and the ZIL can be put on separate vdevs (IIRC), to use faster or more reliable SSDs. L2ARC typically wants a pair of striped fast SSD vdevs while the ZIL should be on a mirror vdev of higher-reliability (SLC-like) SSDs.

Also, be sure to have 8+ GiB system RAM available at all times or performance is gonna suck.


> I think it's just a package install away on many Linux distros? Also installable on macOS — I had a ZFS USB disk I shared between Mac and FreeBSD.

Having your root on a filesystem that is provided with your kernel is not ideal situation, update issues make your system basically unbootable.


not* provided?


> How is L2ARC not "true hybrid"?

As far as I'm aware, it's not persistent. Reboot, and your cache of recently accessed files is gone.


I've been disappointed in linux filesystems and Intel hardware lately. Little integrity checking in ext4 and btrfs is still having growing pains. Recent search for a svelte laptop with ECC memory yielded nothing. Sheesh, wasn't this stuff invented like 30+ years ago?

I understand Intel is segmenting reliability into higher-priced business gear, but as a developer that depends on this stuff for their livelihood the current status quo is not acceptable.

Linux should have better options since profit margins are not an impediment.


I have to agree, segmenting on ECC feels really outmoded these days. With Intel you basically need a Xeon even before you can start thinking about ECC, and that does limit your options somewhat. Luckily at least these days Intel is making mobile Xeons, so you can get ECC laptops from most major manufacturers (I checked Dell, HP and Fujitsu, and according to sibling comment Lenovo too).


It's the laptop manufacturers that are to blame. Check out the specs on this i3 with ecc support. http://ark.intel.com/products/90734/Intel-Core-i3-6100T-Proc...

Granted, that's still not a laptop CPU. But it's not a Xeon either.


The closest you will get that I know of is a Lenovo P50/P51. They have builds with Xeons with ECC.

But I remember when ECC memory dropped out of favor. I wish it had never happened.


all the new AMD chips have ECC by default which is exciting

we will see how their mobile chips stack up


Hasn't all "big" AMD CPUs supported ECC to some degree as long as the memory controller has been on the CPU? The real problem on that side has been motherboard support which has been afaik pretty much nonexistent.


All of the FX series CPUs supported ECC memory (albeit the unbuffered kind). As you said, finding a compatible motherboard was a pain. All the ones I found used the massively power hungry 990FX-A chipset. But I did get one, and my home NAS uses an FX-8350. Kind of scary getting an automated email every week or so about a bit flip that would have otherwise been silent corruption on my home desktop.


A bit flip every week is way higher than it should be.


Here is a decent build that I recently did for my home server: https://forums.freenas.org/index.php?threads/any-help-with-m...

It uses the Intel G4560 which supports ECC, and is very inexpensive. It's low power too.


Dell Precision 5520 with a Xeon supports ECC memory (up to 64G DDR4) and it's the same size as an XPS.


Ironic, that is the laptop I ended up getting, but because I didn't want a hot, power-sucking laptop I got a lower-speed i5 instead.

Not to mention that no one could tell me if you could install ECC memory or if it would be used by the firmware if it even worked at all.


Are you sure about the 64GB? As far as I see it only has two slots, and I can only configure it with 32 GB.


I'm running one right now with 64G of DDR4 ECC.

Not low power though, 1.3v each 32G DIMM.


The CPU supports it but I still don't see the option to actually configure it with ECC memory (which has been keeping me from buying one for quite a while now).


I have one with ECC memory that's seen my archlinux.

I had to buy the memory myself but I was looking for "better" memory anyway.


So you bought it with 8GB non-ECC, plugged in 2* 32GB ECC Dimms and it works?

I can't find those DIMMs, 16GB DIMMs are the biggest I can find.


Yep, I think I'm using memory that is not commercially available although I didn't know it before.

I can't find them online.

For reference they're Samsung branded, I can take a photo if you like; you can see the "width" of the channel in linux which tells you if you're using ECC or not.


Wow, where did you get them? Is it an engineering sample?


Are you absolutely positive that ECC is functional? What memory did you opt for? Does this void warranty?


Memory upgrades on this line to not void warranty; however it seems my exact memory is not commercially available (yet).

I can see the full "width" in linux, which indicates that the OS can see it.


Start from here. www.dell.com/developers


Holy shit, Ubuntu is $101 cheaper than Windows! It's finally possible to escape from the "Windows is cheaper than Linux because of crapware" quagmire.


Not sure what you're getting at. Security vulnerability?


Sorry, wrong link. www.dell.com/developers. Updated the parent too.


Still don't see ECC there on the 5520.


Search for ECC on that page. Some of them do. It is possible that their SKUs change with time. Currently, 3520, 7510 and 7710 show ECC support. Dell's site is quite terrible for navigation. You may have to spend some time to find the right model.


The point was to have the compact XPS-like casing. All those other ones are bulky. So I'd have to go with replacing the RAM myself... I really don't get why this configuration option is missing here.


It's not available for that model, but the 3520, 7520, 7720, 7510 and 7710 all offer ECC RAM as an option.


See my reply to the other comment pointing this out.


ZFS, at least on Solaris, has issue with many multiple readers of the same file, blocking after ~31 simultaneous readers (even when there are NO writers). Ran into this with a third party library which reads a large TTF to produce business PDF documents. The hundreds of reporting processes all slowed to a crawl when accessing the 20Mb Chinese TTF for reporting because ZFS was blocking.

I can't change the code since it is third party. The only way I saw to easily fix it was on system startup to copy the fonts under a new subdir in /tmp (so in tmpfs, ie RAM, no ZFS at all there ) and then softlink the dir the product was expecting to the new dir off of /tmp, eliminating the ZFS high-volume multiple-reader bottleneck.

Never had this problem with the latest EXT filesystems on my volume groups on my Linux VMs with the same 3rd party library and same volume of throughput.


If you recreate the bug on Linux and have something like a simple reproduder I guess that the ZoL devs are more than happy to fix or at least understand it:

https://github.com/zfsonlinux/zfs

From reading the pull requests and issues on that repo I've got the impression that the next release 0.7.0 will be quite step forward and there seems to be quite sophisticasted work to tackle performance issues.


I suggest getting on to the OpenZFS people and seeing whether this is still a problem in illumos; and fixing it if so.

* http://www.open-zfs.org/wiki/Main_Page

* https://github.com/openzfs/openzfs

* https://wiki.illumos.org/display/illumos/ZFS


DragonFlyBSD's HAMMER [0] is another viable alternative.

Unfortunately the next generation HAMMER2 [1] filesystem's development is moving forward very slowly [2].

Nevertheless, kudos to Matt for his great work.

[0] https://www.dragonflybsd.org/hammer/

[1] https://gitweb.dragonflybsd.org/dragonfly.git/blob_plain/HEA...

[2] https://gitweb.dragonflybsd.org/dragonfly.git/history/HEAD:/...


I had a look at HAMMER as it seemed it might meet my requirements [0] but I couldn't figure out whether it supports replication or erasure coding. Don't suppose anyone here knows?

[0]: https://news.ycombinator.com/item?id=14756787


HAMMER does support replication but it doesn't have erasure coding.

If you are using a recent Linux Kernel I can suggest you to use dm-integrity [0] (optionally with dm-crypt) with your favorite filesystem. It's not erasure coding but it can help detecting silent data corruption on the disk.

[0] https://old.lwn.net/Articles/721738/


Hmm, it does seem like it should be possible to put mdadm on top of dm-integrity devices, I might try that out.


The article does not mention bcachefs as a future alternative: http://bcachefs.org/


HN: Please consider subscribing to make this alternative real.

https://www.patreon.com/bcachefs


I’m certainly an excited audience for modern alternative file system options. Quick look at the homepage of bcachefs reads -

“Not quite finished - it's safe to enable, but there's some work left related to copy GC before we can enable free space accounting based on compressed size: right now, enabling compression won't actually let you store any more data in your filesystem than if the data was uncompressed.”

What is the point of having compression enabled if you can’t store more data than you could if it was uncompressed? Shouldn’t they just say “compression mechanism works but not useful yet. as of now it is just extra overhead”..


Compression can give you a net performance improvement on slow disks if the CPU time required to compress is less than the time saved by writing less data to the disk.


I had never heard of bcachefs. Thanks for the pointer!


I see they use another approach to file systems. They sound very confident. A bit too confident to my taste. Are they experienced enough in building file systems that they won't run into huge problems arising when it actually goes into the dirty details?


The developer of bcachefs already has the experience of creating bcache, a general-purpose block-level caching system that is optimized for the characteristics of flash-based SSDs as the cache devices. The original bcache essentially serves as the proof of concept and practice implementation for all the lower-level portions of bcachefs (not that it was planned that way; they just realized while working on bcache they had implemented most of a filesystem).


We can all hope...


(Near) zero-cost snapshots and filesystem-based incremental backups are amazing. Just today I was saved by my auto snapshots [1]. Apparently I didn't `git add` a file to my feature branch and without the snapshot I wouldn't have been able to recover it after some extensive resetting and cleaning before I switched back to the feature branch. It's really comforting to have this easy to access [2] safety net available at all times.

Now that Ubuntu has ZFS build-in by default, I'm seriously considering switching back, and since I too have been burned by Btrfs, I guess I'll stay with ZFS for quite some time. Still, the criticism of the blog post is fair, e.g. I was only able to get the RAM usage in control after I set hard lower and upper limits of the ARC as kernel boot parameters (`zfs.zfs_arc_max=1073741824 zfs.zfs_arc_min=536870912`).

[1] https://github.com/zfsonlinux/zfs-auto-snapshot

[2] The coolest feature is the virtual auto mount where you can access the snapshots via the magical `.zfs` directory at the root of your filesystem.


"(Near) zero-cost snapshots and filesystem-based incremental backups are amazing."

This.

We[1] offer ZFS filesystems in the cloud[2] and one of the nicest things to explain to customers is that they don't have to think about "incrementals" or "versions" or retention in any way. They can just do a "dumb rsync" to us (mirror) and our ZFS snapshots, on their schedule, will do the rest.

In the event of a restore, the customer just browses right into "5 days ago"[3] and sees their entire offsite filesystem as it existed 5 days ago.

[1] rsync.net

[2] http://www.rsync.net/products/platform.html

[3] rsync.net accounts have a .zfs directory


  "Once you build a ZFS volume, it’s pretty much fixed for life."
The ease of growing/shrinking existing volumes and adding/removing storage is why I made the decision to go with btrfs when I rebuilt my home file server.


I hope you managed to use the fixed BTRFS unlike me that built home NAS with tens of terrabytes on top of BTRFS and then year later learned that the version I have was broken and the fix is not applicable to a running cluster... So I have to buy another batch of HDDs, install fixed BTRFS (who knows what other unknown issue awaits me in the darkness?) and copy over everything...


This advice isn't actually totally correct.

If you build a volume, you can add more zdevs to it.

My suggestion is to avoid RAIDZx and go with mirrors. To add more storage, just add another pair of drives and add it to your pool.

Here is what my main `tank` looks like:

  pool: tank
  state: ONLINE
  scan: scrub repaired 0B in 4h59m with 0 errors on Tue Jul  4 12:47:02 2017
  config:

  NAME                            STATE     READ WRITE CKSUM
  tank                            ONLINE       0     0     0
  	mirror-0                      ONLINE       0     0     0
  		wwn-0x50014ee20eba1695      ONLINE       0     0     0
  		wwn-0x50014ee20eba337e      ONLINE       0     0     0
  	mirror-1                      ONLINE       0     0     0
  		wwn-0x50014ee2640efb3a      ONLINE       0     0     0
  		wwn-0x50014ee2b964ef3f      ONLINE       0     0     0
  	mirror-2                      ONLINE       0     0     0
  		wwn-0x50014ee2b964f2d5      ONLINE       0     0     0
  		wwn-0x50014ee2b964f4f4      ONLINE       0     0     0
  cache
  	wwn-0x5002538d704e9ff1-part4  ONLINE       0     0     0

  errors: No known data errors
You can simply add another mirrored pair and get more storage. Or, you can update an existing mirror by adding new drives, then removing the old ones. Mirroring has better performance and IOPs than striped raid too. If you feel that mirrors might be less reliable than RAIDZx, I'd suggest you read up about it - generally mirrors are more reliable (and you could mirror 3 drives if you liked, get more throughput and IOPs - or have a hot spare).


As you add mirrored pairs your data isn't evenly distributed across all stripes is it?


The simple answer to your question is no.

The more complex answer is that over time as you write more data, ZFS will re-balance the distribution of data over the available zdevs.


I think the concern is that if you have two disks die it is not a big deal, unless those two disk happen to be of the same pair[1], then most likely all data is lost.

[1] if you're not careful, this is actually very likely to happen. If you purchase 2 disks at the same time, and they are under the same usage patterns (which they will be when working in a pair), under the same temperature then they are very likely to die at the same time.


Btrfs is a mess too, especially with its metadata system that you have to manually resize constantly, data-loss bugs and fsck problems. I've had nothing but problems with it and will never use it on a production system.

LVM on the other hand makes it easy to do snapshotting, RAID, volume resizing, adding more drives, etc - at the cost of some performance in some cases. For the most part, it's the best trade off available if expandability is a requirement and you can't get away with simple mdraid.


btrfs has all those features as well and my original setup was mdadm raid 5 + LVM + ext4.

Regardless, I really haven't had any operational problems with btrfs for my 3 TB or so of data and when it has managed to get wedged because it couldn't allocate more space to the metadata pool during the initial import of data from my old array I fixed it with a simple rebalance command.

A cron set up to run a minor rebalance weekly helps ensure you never run into that situation in practice and I've not lost any data so for now I'm comfortable using btrfs as my primary filesystem.

I am a little concerned about the longevity of btrfs in general though because it hasn't been receiving a lot of development work lately.


>its metadata system that you have to manually resize constantly

?

Doesn't ring a bell.

A lot of problems with Btrfs sounds like hardware problems of one sort or another. If it's a legit bug the only way such things get fixed is to report it to the developers <linux-btrfs@vger.kernel.org> with complete logs and system information. Did you?

The fsck is definitely hit or miss, but my perspective is the emphasis is on fixing bugs that obviate file system problems in the first place. The reality is an offline fsck for large file systems is just not scalable, so the best bang for the buck is bug squashing.

And there's a tons of that happening.

For the initial 4.12 pull (now done, and probably had a few dozen changes during rc's) 40 files changed, 1629 insertions(+), 834 deletions(-) https://lkml.org/lkml/2017/5/9/510

For the initial 4.13 pull 47 files changed, 1707 insertions(+), 1400 deletions(-) https://lkml.org/lkml/2017/7/4/436


He may be talking about the need to rebalance ... I run a btrfs balance every month to make sure I don't run out of space again.

It happened to me while on vacation -- I only had satellite internet and had to fix it over ssh at single characters per second.

It might have to do with a bug I saw mentioned on the list that referred to the free space map getting corrupted (but easily fixed with a -oremount,clear_cache). So I also run that monthly as well.

I've been too afraid to remove the cron job even though I moved my CentOS7 to the "mainline" kernel rpms from http://elrepo.org

Also, I only trust RAID1 + Crashplan.


I'm also using Btrfs for my home file server, but Btrfs's current RAID5/6 status leaves a lot to be desired. I'm running Btrfs on top of mdadm right now because of that.


Should be "mostly" resolved from kernel 4.12 : https://www.phoronix.com/scan.php?page=news_item&px=Linux-4....


It used to be "usable for most purposes" at some point already, before those scary data corruption bugs were discovered.

https://btrfs.wiki.kernel.org/index.php?title=RAID56&diff=30...


My last file server used mdadm raid 5 across 4 drives and I just couldn't take the write performance loss anymore so I switched to btrfs raid1(ish) block duplication.


This is only difficult with ZFS if you care about performance. If you are a typical home file server user, you can add vdevs in a rather ad-hoc fashion as they fill and it works pretty well.

There is a huge performance penalty of course, as all the majority of new data will reside on the latest vdevs - but it generally works.

I do agree this is one of the largest drawbacks of ZFS, but very few filesystems get it right.


Yeah, as mentioned in the post, you can indeed just add a vdev to the pool whenever you want. But the risk, apart from performance, is data protection If you add one disk, your whole pool is at risk. So of course you have to add two or more disks as a mirror or RAIDZ but lots of home users don't want to do this. They want that Drobo thing where they throw another log on the fire and everything gets rebalanced. It's sad that ZFS can't do this, and that it can't "upgrade" RAID-Z1 to RAID-Z2. It's hard to do, but would be so much nicer.


you can add vdevs in a rather ad-hoc fashion as they fill and it works pretty well

As long as they're expensive mirror vdevs, right? My impression is that home users want the efficiency of RAID-6 and they want incremental expansion (regardless of whether this combination is "good for them"). ZFS can't do that.


You should always use mirrors! http://jrs-s.net/2015/02/06/zfs-you-should-use-mirror-vdevs-... "don’t be greedy. 50% storage efficiency is plenty". Mirrors perform better, perform MUCH better when degraded, rebuild MUCH faster.


Mirrors are also less safe. [0]

For a 6x8TB and assuming a (optimistic) 10^-16 URE, you get 3.5% failure rate for a RAID5 array, 0.7% for a RAID10 array and a 1.06e-08% failure rate for RAID 6.

Why be greedy for all that performance? Most home-grade NAS or even some business-grade NAS isn't used for performance sensitive operations, more like Word Documents and Family Pictures, stuff you don't want to loose.

I'd rather take safety over performance here.

[0]: https://redd.it/6i4n4f


Raid is for convinience and performance. It is not and cannot replace backup. In any case, if you want safety, RAID1 is the way forward, and not RAID6. An 8 drive RAID6 with 8TB WD Red NAS drives (URE <1 in 10^14) is virtually guranteed to have at least one read error during a rebuild (if the URE rate is true, which I believe it is not).

Regardless, the determining factor here is how much data do you need to read in the case of a failure to rebuild the array. RAID1 wins every single time because you cannot read less than the single drive you need to replace.


The URE rate is most likely much lower.

However, the chance of failure for a RAID6 of 100x10TB disks is less than 0.482% after 1'000'000 rebuilds.

RAID1 is space-inefficient, a 100x10TB RAID array might never fail but it has only 10TB of storage space.

A RAID10 Array has a 14% failure chance for just 4x2TB disks using 10^14 failure rates.

RAID1 and RAID10 are definitely not the way forward, it is less secure, something that should be immediately apparent if you read the link in my previous comment.

A 10 Disk RAID6 with 10TB disks is more reliably than a RAID10 by multiple orders of magnitude and more space efficient than a simple RAID1.


Your math is completely off. a 100x10TB RAID6, with a failed disk need to read 990TB of data to rebuild in the case of a failed disk. With an URE of 1 in 10^14 you will see 79.2 URE events on average if the URE rate is correct (again, I don't believe it is) during single a rebuild - this is the reason no serious engineer recommends a RAID6 for large arrays.

In the case of a RAID1, noone uses 100 mirrored drives. You use RAID10, and in the case of a failed disk, you must read 10TB to recover. With the same URE, we'd see on average 8 URE for every 10 rebuilds, or around 2 orders of magnitude less failure rate compared to the RAID6 example.


Your logic is sadly incorrect.

During a RAID6 rebuild, a URE is non-critical as the Array can recover the data with one lost disk an a URE on any other disk during the stripe rebuild.

The only critical error would be a URE on two disk on the same stripe, 80 URE's during a 990TB rebuild have an amazingly low chance of having two UREs on the same stripe on two seperate disks.

In case of the RAID10, you get 8 URE's over 10 rebuilds, which aren't recoverable unless you have 3 disks. So you'll corrupt data.

edit: URE of 10^14 is what most vendors specify for consumer harddrives, 10^16 is closer to what people encounter in the real world but 10^14 is considered the worst case URE rate.


Good point with the URE on a RAID6, but that still doesn't make it superiour. The strain of rebuild have been known to kill many arrays, both RAID5 and 6.

URE does not have to corrupt data, if you use a proper filesystem with checksumming such as the ZFS.

When a disk fails, a RAID10 is simply in a far better position as it only have to read a single disk, and it doesn't have any complicated striping to worry about. Just clone a disk.


>URE does not have to corrupt data, if you use a proper filesystem with checksumming such as the ZFS.

No but afaik there is no way to recover data once ZFS has declared it corrupted. (ie, no parity)

>The strain of rebuild have been known to kill many arrays, both RAID5 and 6.

I haven't actually encountered that yet. Despite that, a RAID 6 can loose a disk, so as long as you don't encounter further URE's after loosing another disk, it's fine.

If you're worried about that, go for RAIDZ3 or equivalent. With something like SnapRAID you can even have a RAIDZ6, loosing 6 disks without loosing data. The chances of that happening are relatively low.

>When a disk fails, a RAID10 is simply in a far better position as it only have to read a single disk

A RAID 10 is in no position to recover from URE's once a disk has failed unless you reduce your space efficiency to 33%.

I personally favor not corrupting data over rebuild speeds.

Striping might be complicated but that doesn't make it worse.

It might be acceptable too loose a music file, but once the family image collection gets corrupted or even lost on ZFS because a disk in a RAID 1 encountered a URE, it's personal.

I'd rather life with the thought that even if a disk has a URE, the others can cover for it. Even during a rebuild.


While it's true you can't expand a vdev, you can always add another vdev to a pool at any redundancy level you desire to expand capacity. For example, you could add a trio of drives as a raidz vdev to a pool with an existing mirror vdev (`zpool add mypool raidz dev1 dev2 dev3`). However, the drawback is that the expanded pool won't be "balanced" unless its datasets are rewritten.


Home users would rather just add a single drive instead of having to worry about adding entire RAID setups.


Exactly that. I can essentially add/remove drives on a whim (for any reasonable definition of "on a whim") using btrfs and run a rebalance command afterward and I'm done.

That, of course, is really not a concern for enterprise use cases.


What I'd like to know is, can this issue be resolved in a future version of ZFS or is it too ingrained into the design of ZFS?


For years the fix ("block pointer rewrite" feature) was promised as coming eventually, but that effort was abandoned. BTRFS will reach ZFS levels of stability before ZFS reaches BTRFS levels of flexibility.


It would be expensive -- you'd have to "rebalance" the data by copying it to the new volume / away from the old volume, which would take hours or days.


People who want this feature are mostly frugal. They want it so they can upgrade by just buying a single new disk. That is a rare enough use case that the performance penalty would be worth it.


Frugal enough to shut off their computer when they're not using it (to "save power"), so the rebalancing won't have time to complete in the background?


Isn't the fact that Drobo is still around and doing well a testament to people being fine with this tradeoff?


Correct me if I'm wrong, but I can't add vdevs to a RAIDZ (RAID5/6ish) system in ZFS, which are probably the most common configurations for home NAS systems.

I'd love to run ZFS, but I can't because I need to be able to add more drives as I buy them.


I've been reading more into it and it seems like back in 2008 they came up with an algorithm to do this: https://blogs.oracle.com/ahl/expand-o-matic-raid-z

But it doesn't look like there's been any movement on an implementation and it seems like it's high effort and mostly home users who want this, not enterprises who might be willing to pay for.

Ah well, guess I'll stick with mdraid for now.


correct, the number of devices in a raidz1/2/3 vdev cannot change. However you can replace all of the drives one at a time with higher capacity drives and most zfs installs will grow your vdev.

You could also add another raidz vdev if you have the space in the server, eventually they should level out depending on workload.

"Best" solution, read most expensive, would be to copy the data over to a new server you've configured for the new load/capacity.


Adding 3 drives at a time is expensive and a non-option for a home NAS. As is "just copy it to a new server".

Both have high investment costs for something a plain mdRAID, snapRAID or LVM RAID can achieve far simpler with better results.


I add two at a time without much problem, mind you I have a rather extreme setup with an external 12-bay enclosure but I feel most people building a NAS instead of buying an off-the-shelf system from Synology/Drobo aren't looking to cut costs or corners.

A 4TB Seagate IronWolf drive costs $129 off Amazon, buying two to add as a new mirrored vdev to my TrueNAS box isn't outrageous.


You're limited then to only mirroring, you can't do RAID6 or similar on a massive array where you're only giving up 2/10 drives space but can have a 2 disk failure. Seems like a lot of wasted disks when the only thing I'm dependent on the NAS for is not losing my TV collection. Right now I know if that if I go out and buy 2 more 4 TB disk I get 8 TB more in my NAS and at the price of an only slightly increased risk of having more than 2 drives fail at once. That's probably my favorite feature of RAID6 for a home NAS.

I actually went custom because those off-the-shelf boxes are either very expensive or have weak CPUs so can't be used for video transcoding very well - it's cheaper to build it yourself, much cheaper if you already have old hardware to dedicate to the task.


Different use cases I suppose, my FreeNAS box stores my video collection and all the usual stuff but I've also got all the VM's that run my home network stored on it - performance+resliver times are a lot more important to me than storage efficiency.


And your network-running VMs have high disk IO requirements?


All of my virtual disks are hosted off my FreeNAS, I've got a direct 10GbE link between it and my oVirt host - raw throughput isn't so much the issue most of the time as IOPS are, I've got a local gitlab instance, OpenShift, some PostgreSQL databases, etc. and they like to hammer the crap out of my storage when in use.

Having an L2ARC helps out quite a bit, but only having 32GB of memory and wanting to keep most of it for the L1ARC means I still hit my spinning disks regularly (and mirrored vdev's help read IOPS tremendously in this case).


Reducing the cost of a custom system isn't irrational.

My own personal budget is very limited, buying two IronWolf HDDs in this case is easily a good chunk of my monthly income. If I can build a NAS that can expand with single drives as needed, it's more cost effective for me.

And I imagine a lot of others have the same problem.

In the end, by buying 2 drives when a single drive could have solved the problem equally well: expanding your space by 4TB, is wasting money. Period.


This might be somewhat off topic but I'm desperate. I've been looking for a way to store files:

- Using parity rather than mirroring. I'm happy to deal with some loss of IOPS in exchange for extra usable storage.

- That deals with bitrot.

- That I can migrate to without somehow moving all of my files somewhere first (i.e. supports addition/removal of disks).

- Is stable (doesn't frequently crash or lose data)

- Is free or has transparent pricing (not "Contact Sales").

- Ideally, supports arbitrary stripe width (i.e. 2 blocks data + 1 block parity on a 6 disk array)

Unfortunately it doesn't appear that a solution for this exists:

- ZFS doesn't support addition of disks unless you're happy to put a RAID0 on top of your RAID5/6 and it doesn't support removal of disks at all when parity is involved. It is possible to migrate by putting giant sparse files on the existing storage, filling the filesystem, removing a sparse file, removing a disk from the original FS and "replacing" the sparse file with the actual disk but this is somewhat risky.

- BTRFS has critical bugs and has been unstable even with my RAID1 filesystem.

- Ceph mostly works but I always seem to run into bugs that nobody else sees.

- I couldn't even figure out how to get GlusterFS to create a volume.

- MDADM/hardware RAID don't deal with bitrot.

- Minio has hard coded N/2 data N/2 parity erasure coding, which destroys IOPS and drastically reduces capacity in exchange for an obscene level of resiliency I don't need.

- FlexRAID either isn't realtime or doesn't deal with bitrot depending which version you choose.

- Windows storage spaces are slow as a dog (4 disks = 25MB/s write).

- QuoByte, the successor to XtreemFS has erasure coding but has "Contact Us" pricing and trial.

- Openstack Swift is complex as hell.

- BcacheFS seems extremely promising but it's still in development and EC isn't available yet.

I'm currently down to fixing bugs in Ceph, modifying Minio, evaluating Tahoe-LAFS and EMC ScaleIO or building my own solution.


You can probably achieve what you're looking for by stacking a few filesystems. For example, you could create a separate ZFS pool/vdev with a single full-disk zvol on each disk. Then use mdadm to create a RAID array of the zvols. Then put ext4 (or whatever) on the mdadm array.

I've done something similar for the purpose of getting FDE with ZFS in linux. It can be a little finicky, but it's definitely workable.

One ZFS-specific caveat (which may conflict with your desire to get high storage efficiency): you way need to prevent your ZFS pools from filling up too much [1]. You can either enable discard/TRIM on the whole stack, so the top level FS (e.g. ext4) can let ZFS know when a block is actually free. Or alternative just to limit your zvols to 85% (for example) of their respective pools. The latter is my preference, because there was originally a bug with discard in zfs and it's not immediately clear if it's totally fixed (although my fstrim tests seemed to work out fine).

[1] https://www.reddit.com/r/zfs/comments/3vtur4/what_exactly_ha...


Hmm, that actually sounds workable. I could even format the mdadm device as ZFS too if I really wanted. I am somewhat worried about performance, have you had any issues with that?


I haven't had any issues with performance, but then again my requirement was just "reasonable performance".

I ran some quick benchmarks (data below). Obviously this is far from rigorous, but maybe it'll be useful. In previous tests I found that volblocksize=128K was optimal for my stack -- which is why the last benchmarks use that setting.

Every additional ZFS filesystem in the stack may reduce storage efficiency (minimum free space requirements [1]; metadata & checksum overhead [2][3]) -- that's why I used ext4 as the top layer instead of another ZFS.

[1] (as mentioned before) https://www.reddit.com/r/zfs/comments/3vtur4/what_exactly_ha...

[2] https://news.ycombinator.com/item?id=14756360

[3] https://forums.freenas.org/index.php?threads/what-is-the-exa...

  Test setup:
   debian stable
   kernel 4.9.0-3-amd64
   zfs 0.6.5.9-5
   ZFS "pool": mirror with 2x 7200rpm drives

  Benchmark command:
   for i in `seq 1 10`; do sync; dd if=/dev/zero of=DEST bs=1M count=1024 conv=fdatasync; done

  zfs mirror -> dataset
   Data (MB/s): 125,115,104,135,148,170,135,151,118,119
   Mean (MB/s): 132.0
   Std.dev.: 19.9

  zfs mirror -> zvol (volblocksize=8K [default])
   Data (MB/s): 150,115,127,125,122,118,105,118,124,128
   Mean (MB/s): 123.2
   Std.dev.: 11.6

  zfs mirror -> zvol (volblocksize=128K)
   Data (MB/s): 68.5,112,115,114,94.3,85.1,83.1,98.4,120,108
   Mean (MB/s): 99.8
   Std.dev.: 16.9

  zfs mirror -> zvol (volblocksize=128K) -> luks -> ext4  (my stack)
   Data (MB/s): 130,94.4,109,139,138,125,94.9,124,134,133
   Mean (MB/s): 122.1
   Std.dev.: 16.8
edit: formatting


Can you please elaborate what are the advantages of such configuration?


The advantage is that you can get all the features you want, even though they aren't all available in one filesystem.

In my case, I wanted a reliable filesystem, RAID1/mirror support, block-level checksumming, and full-disk encryption. No filesystem provides these on linux right now. My solution was therefore to use ZFS to provide a mirrored, checksummed, reliable volume -- onto which I put a standard LUKS-encrypted ext4 filesystem. In the past I had tried the opposite (LUKS on the bare drives, then a ZFS mirror of the 2 decrypted volumes), but it was kinda annoying to manage, and I don't really need the other ZFS features (like snapshotting).

In the grandparent post, the requirement was the ability to expand the RAID volume without rebuilding (something that ZFS doesn't offer), plus checksumming and reliability (which ZFS does offer). So one option would be to use mdadm to manage the RAID array, and then put ZFS on the resultant volume in order to get checksumming.

The disadvantages are: extra complexity; extra overhead; more potential points of failure; more management hassle; etc. As soon as encryption for ZFSonlinux is stable, I'll be very happy to drop this filesystem stacking in favor of that!


There kinda isn't such a thing at this point. I sure wish there was! And this is the point behind the article: Wouldn't it be awesome if ZFS had continued developing as the super filesystem it looked like last decade? It would do all this and more! Frankly if ZFS supported re-balancing ("re-RAID-ing") and hybrid pools, it would be pretty much everything we all need. I do hope ZFS or Btrfs or something gets there.


It's so close though. BTRFS and Ceph both have what I want, they're just unstable. The BcacheFS dev has told me he's close but not quite there yet.

Hopefully one of the other solutions will work, otherwise I'll just have to build it.


bcachefs really does seem as if it will be our holy grail, we have very similar needs.

They have a patreon if you wish to contribute that way.


I already contribute and help out with testing and the occasional docs on IRC :)


XFS has gained (some) checksumming and CoW support.

At this rate XFS will end up evolving to add all the features btrfs promised before the latter makes them stable.


> ZFS doesn't support addition of disks unless you're happy to put a RAID0 on top of your RAID5/6 and it doesn't support removal of disks at all when parity is involved. It is possible to migrate by putting giant sparse files on the existing storage, filling the filesystem, removing a sparse file, removing a disk from the original FS and "replacing" the sparse file with the actual disk but this is somewhat risky.

It may be "RAID0 on top of your RAID5/6" but it's all integrated and works smoothly. What I do is have two raidz2 vdevs of 4 disks each, one twice the size of the other, and alternate which one I upgrade (i.e. I started with 4x250gb disks and 4x500gb disks, a few years later replaced the 250gb ones with 1tb ones, then the 500gb ones with 2tb ones, and most recently the 1tb ones with 4tb ones). But yeah I am now stuck with at least 8 disks and if I wanted to migrate off them I'd have to do so all in one go.


> It may be "RAID0 on top of your RAID5/6" but it's all integrated and works smoothly.

I know that it does but I don't like the way you lose your entire pool if a single vdev fails. I'd have less of a problem with it if a pool could distribute data such that a vdev failure results in the loss of only what was on that vdev.


Shrug. I figure losing a random half of my files is pretty much as bad as losing all of them (if I wanted to make an explicit split into two halves I could use two separate pools).


Re: contact sales, I totally hear what you're saying. It's not always a bad idea to talk to a sales person, and a good one can steer you towards the options you'd actually want. But yeah, I generally don't want to talk to a salesperson either.


Yes, but would it kill them to just put a pricing there? Nowadays you can get the list price of sending a Rocket into space[1] but not how much some software is going to cost you

[1] http://www.spacex.com/about/capabilities in case you need one


Have you seen Quantcast File System? http://quantcast.github.io/qfs/


I hadn't and it looks good but it'll only allow 3 recovery stripes or none, which isn't optimal for me.


Have you considered snapraid? It's offline though, but seems to tick all your boxes otherwise.


Yes, I've looked into SnapRAID but I'm not a fan of offline and IIRC I didn't have enough memory to run it last time I tried it out.


For storage spaces try fixed size, you can later enlarge the volumes, faster.


I was using fixed size, it didn't help. Read speeds were okay but write always capped out at 25MB/s.


Illumos has a way to expand pools, FYI. IDK if that's in OpenZFS yet.

It works thusly: ZFS creates a vdev inside the new larger vdev, then moves all the data from the old vdev to the new vdev, then when all these moves are done the nested vdevs are enlarged.

What should originally have happened is this: ZFS should have been closer to a pure CAS FS. I.e., physical block addresses should never have been part of the ZFS Merkle hash tree, thus allowing physical addresses to change without having to rewrite every block from the root down.

Now, the question then becomes "how do you get the physical address of a block given just its hash?". And the answer is simple: you store the physical addresses near the logical (CAS) block pointers, and you scribble over those if you move a block. To move a block you'd first write a new copy at the new location, then overwrite the previous "cached" address. This would require some machinery to recover from failures to overwrite cached addresses: a table of in-progress moves, and even a forwarding entry format to write into the moved block's old location. A forwarding entry format would have a checksum, naturally, and would link back into the in-progress-move / move-history table.

During a move (e.g., after a crash during a move) one can recover in several ways: you can go use the in-progress-moves table as journal to replay, or you can simply deref block addresses as usual and on checksum mismatch check if you read a forwarding entry or else check the in-progress-moves table.

For example, an indirect block should be not an array of zfs_blkptr_t but two arrays, one of logical block pointers (just a checksum and misc metadata), and one of physical locations corresponding to blocks referenced by the first array entries. When computing the checksum of an indirect block, only the array of logical block pointers would be checksummed, thus the Merkle hash tree would never bind physical addresses. The same would apply to znodes, since they contain some block pointers, which would then have three parts: non-blockpointer metadata, an array of logical block pointers, and an array of physical block pointers.

The main issue with such a design now is that it's much too hard to retrofit it into ZFS. It would have to be a new filesystem.


> Illumos has a way to expand pools ... ZFS creates a vdev inside the new larger vdev

Huh?

> IDK if that's in OpenZFS yet.

The openzfs tree (on github) is virtually identical to illumos-gate (on github).

> physical block addresses should never have been part of the ZFS Merkle hash tree, thus allowing physical addresses to change without having to rewrite every block from the root down.

mahrens deals with this (and block pointer rewriting) here:

https://www.youtube.com/watch?v=G2vIdPmsnTI#t=44m53s

Even with SSDs IOPS are precious. On rotating media, burning track-to-track seeks in reading and updating a large hash table is a bad plan (cf. the deduplication table).


Btrfs might just become “the ZFS of Linux” but development has faltered lately, with a scary data loss bug derailing RAID 5 and 6 last year and not much heard since.

It was not a per se data loss bug. It was Btrfs corrupting parity during scrub when encountering already (non-Btrfs) corrupted data. So a data strip is corrupt somehow, a scrub is started, Btrfs detects the corrupt data and fixes it through reconstruction with good parity, but then sometimes computes a new wrong parity strip and writes it to disk. It's a bad bug, but you're still definitely better off than you were with corrupt data. Also, this bug is fixed in kernel 4.12.

https://lkml.org/lkml/2017/5/9/510

Update, minor quibbles:

lacking in Btrfs is support for flash Btrfs has such support and optimizations for flash, the gotcha though if you keep up with Btrfs development is there have been changes in FTL behavior and it's an open question whether or not these optimizations are effective for today's flash including NVMe. As for hybrid storage, that's the realm of bcache and dm-cache (managed by LVM) which should work with Btrfs as any other Linux file system.

ReFS uses B+ trees (similar to Btrfs) XFS uses B+ trees, Btrfs uses B-trees.


The thing I'm struggling with is 4K sector support. It's horribly inefficient with ZFS. RAIDZ2 wastes a ton of space when pool is made with ashift=12. And everybody knows 512e on AF disks is horribly slow...so ZFS is either very slow or wastes 10% of total space...Or both (ZVOL :D)

According to some bug reports, nobody has touched this since 2011...


Can you elaborate on how ZFS wastes 10% of total space?

I recently set up a ZFS volume using 12x4TB drives using RAID-Z2, so I expected 40TB of usable space, or ~36.3TiB. However, I only see 32TiB of usable space on the volume. I always wondered why that was so, never figured it out..


There's ton of sites when you google ashift=12, http://louwrentius.com/zfs-performance-and-capacity-impact-o... or https://github.com/zfsonlinux/zfs/issues/548 for instance.

Basically, ashift=12 increases ZFS block size to 4K. Metadata use full blocks that would be 512b on ashift=9 but are now 4K (due to ashift=12). It wastes at least 3.5Kb more than normal 512 byte blocks for each block that is not filled entirely.


For new data, you can get back most of that lost space by setting the recordsize to 1M. You won't necessarily see the improvement in your df and zfs list commands, as they assume smaller max recordsize, but your overhead should drop significantly.


I didn't even consider this. Thanks for the explanation! Makes tons of sense!

Incidentally, many modern filesystems (including NTFS) store very small files in the FAT rather than taking up a whole block for this very reason!


cringes at "in the FAT" instead of "MFT"....

ZFS has this same feature, however, as long as feature@embedded_data=enabled ;)


I guess I'm an old storage guy. I call it the FAT on everything! :-)


Yeah, different filesystem calls things differently FAT, MFT, inodes.

Best is to just call it metadata :)


Thanks!


He his talking about "best level of data protection in a small office/home office (SOHO) environment".

Trying to do this with FS features is misguided.

You need to have backups, and have regular practice in restoring from backups.

Some organizations need fancy filesystems in addition to backups, because they want to have high availability that will bridge storage failures. But that has a high cost in complexity, you should only consider it if you have IT/sysadmin staff and the risk management says it's worth the investment in cognitive opportunity cost, IT infrastructure complexity and time spent.


Negative, you wont know you _need_ to use the backups without these FS features, and by the time you finally do you could have rotated through them.


The article didn't mention backups at all. If a SOHO environment can afford only either backups or a ZFS storage system, choosing backups leaves much less residual risk on the table.

Yes, there is still a risk that corrupted data may end up in backups, but that's true even with ZFS. Ideally you want end-to-end integrity checking and verification, that means application layer and should also be done for backups. But like with all risk management, there are diminishing returns...


That is the most contrived nonsense I've heard in a long time. You can't not afford to use ZFS, it works fine on a single disk and at least you'd know your data had mutated.


Does your strong disagreement mean you think for most SOHO environments it's better to have ZFS without backups than backups without ZFS?


Knowing your data has rotted doesn't bring it back.


But it does allow you to treat it with suspicion and human judgement. If it's an album, you go re-rip or download it. If it's medical data, you don't use it.


And if it's that one holiday album of images, it's down the toilet forever.


The filesystem as basic infrastructure has to be robust and fuss free. The complex stuff is going to be built on top of that.

After years of btrfs I realized while the all the features around snapshotting, send/receive etc are great the cost in performance and other issues is too high.

And using plain old ext4 is more often than not the best compromise so you can forgot just about the fs and focus on higher layers.


I came to say similar things. After having had problems with ZFS (on Solaris) and btrfs, but never on ext4 or other "simple" filesystems i am wondering if i need and want all that complexity in my filesystem: And no, i don't.

Snapshots are nice, sure. But i'd rather do that on top of my filesystem (they're called backups) and leave the filesystem lean, simple, fast AND reliable. Every filesystem may have its own problems, but i feel that the "attack surface" of my filesystem should be as small as possible. Do i rather want to have a bug in my filesystem implementation concerning snapshots or would i rather have that bug in my backup tool?

IMO, the FS should be rockstable and lean and not "cool and fancy". I can do fancy on top of rockstable.


The nice thing about btrfs vs ZFS is that you can just use it like a normal filesystem ignoring all the advanced features and still get the benefit of checksumming (plus duplicated metadata by default on spinning disks) and compression


The problem with btrfs and cow in general is poor performance for databases and overall, in some cases significantly, slower performance than ext4. ZFS has high memory requirements.

If your use case mainly revolves around the benefits of snapshots then it definitely makes sense.


How do you do snapshots on ext4? Don't large datasets take forever?


I don't think ext4 has native snapshots. You can use lvm-thin for snapshots. Note that lvm-thin is different from regular lvm snapshots, which are known to be inefficient. RHEL has decent support for this feature, and will even let you set it up in the GUI during installation.


That snapshots the entire volume, though, right? I don't want to waste time on snapshotting my movies when all I want is my family photos...

It seems that the GP's point that the fs should just store files is rather debunked, as snapshots are an extremely useful feature for a filesystem.


Yes snapshots are for volumes. There's also a yum plugin which will do before and after snapshots. Rhel and suse have this feature which works with lvm-thin as well as btrfs.


lvm-thin snapshots the entire volume but in a more Copy-on-Write like way.

https://access.redhat.com/documentation/en-US/Red_Hat_Enterp...


On my current laptop, I'm seeing a 20% reduction in disk usage relative to the filesystem size because of ZFS's built-in compression.


20%? That's weak :P I have 3.32x refcompressratio on my /home partition in my dev VM (using gzip-7 here).


You might look into using xz(lzma) or zstd in place of gzip. Gzip offers pretty poor compression per CPU time performance compared to these newer options.

https://clearlinux.org/blogs/linux-os-data-compression-optio...


ZFS doesn't support either yet: http://open-zfs.org/wiki/Performance_tuning

Though it looks like zstd is coming, which is exciting: https://reviews.freebsd.org/D11124


zfs doesn't support either of those yet unfortunately. I'd love for zstd to be available given it's benefits in speed and compression ratio.


https://reviews.freebsd.org/D11124 :)

Or use lz4 while you wait.


Another future alternative TFS: https://github.com/redox-os/tfs


> Many remain skeptical of deduplication, which hogs expensive RAM in the best-case scenario. And I do mean expensive: Pretty much every ZFS FAQ flatly declares that ECC RAM is a must-have and 8 GB is the bare minimum. In my own experience with FreeNAS, 32 GB is a nice amount for an active small ZFS server, and this costs $200-$300 even at today’s prices.

I use nas4free with much less ram…


The massive amounts of RAM recommendation is if you need to do deduplication. Are you doing that? If not then you don't need a lot of RAM.

Between the low cost of storage, and alternative solutions for deduplicating data I personally don't use the built-in deduplication functionality of ZFS for my zpools. Might come down to what sorts of data you are storing, though.


Yup. This. Now that I have 32 GB of RAM in my FreeNAS box I decided I really didn't need dedupe after all. I just don't have that much duplicated data, and I've got 60 TB of HDDs in the box. So I use the RAM for VMs instead!


Yeah dedup is mostly useful for massively multiusers setups, say mail, file sharing, etc etc. For small setups you can do a `fdupes` pass once a day and fix it yourself I suppose.


I remember that time I turned on dedup on our 45drive box with 32GB of ram and 140TB~ of data..


...and nothing happened because it does not retroactively dedup existing data?

Over time it would become horrendously slow, I agree, since you have too little RAM by a factor of at least 8.


And this is why ixSystems disables access to the dedup switch on their TrueNAS systems without contacting support to enable it.

I've got 30TB of small files stored on our TrueNAS system at work, there's no benefit to enabling it but if someone decided to toggle it we'd quickly learn there isn't enough RAM in the world to handle billions of 1-16KB files....


Does anybody use ZFS as replacement for a database backup/restore on a test environment? I'm not sure but it seems that it's possible to use ZFS snapshots in order to quickly restore previous database state. Note: it's just a question, I'm not advising to try that.


Well, not in a test environment, but for production updates of some NoSQL stuff, sure. Snapshot the datasets, clone them over into a new rw-dataset, run the upgrade on the clone. Upgrade went wrong and corrupted your files? Destroy the clone and make a new one, then run it again (after fixing whatever caused the corruption obviously).

Want to test the upgrade in your production environment beforehand? Well, make a clone a couple days early.

And once all works out, a few days later you promote the clone and destroy the old datasets. Need a rollback? Well, just start the old application instance on the old datasets, nothing touched them.

Doesn't work for all database types, especially if you have no possibility to replay new data into the rollback. But if your system allows it, it is really comfortable.


Thanks, very interesting.


File system snapshots of databases are not necessarily consistent, and can not always be restored like that.


Atomic snapshots like ZFS's are always consistent for Postgres. I guess other databases with a similar write-ahead log can be snapshotted as well?


> Atomic snapshots like ZFS's are always consistent for Postgres.

As long as you make sure to only use one filesystem, i.e. you don't place pg_xlog or some tablespaces on a different filesystem. You can get very weird corruption in such cases :)


With ZFS you can also do atomic snapshots of multiple filesystems. https://serverfault.com/questions/608223/is-zfs-snapshot-r-o...


Interesting. I played with ZFS years ago on opensolaris, I don't remember there were atomic snapshots. I definitely need to dive into ZFS again.


But if we "stop" this database/schema, make ZFS snapshot (I suggest that the "partition" keeps only this database related files) it may work. It can be useful for testing purposes I guess.


A logical issue that I have with the existence of such filesystems as ZFS and BTRFS is that the problem of "bit rot" should be addressed at a lower abstraction level - hardware or the driver - rather than at the level that should be primarily responsible for user-visible organization of files, directories, etc.


How?

Bitrot occurs because the lower-abstraction level hardware fails. When you put a Hard-Drive into storage for say 5 years, the bits may change. Even if a Hard-drive remains in constant use for 5 years... if said files or directories aren't checked and double-checked constantly, the error-correction codes may fail over time.

Its a fundamentally different problem from Hard Drives that are being used constantly as say Swap.

Hard Drives typically include Hamming codes or ECC bits to address typical corruption issues.

-------------

The fundamental principle at hand here is as follows: to ensure integrity of files, you need to regularly check file data. Only the Filesystem would know which files were recently checked.


Couldn't the drive's firmware or the driver do the same just as well (except on physical records instead of files)?


First off, real hard drives have "SMART" data that detects (and automatically corrects) simple errors. So remember, hard drives ALREADY have a large degree of error correction built in. Its just not enough for serious data-storage purposes.

The "Bit-Rot" scenario is particularly harmful to RAID5 (Minimum 3-hard drives. Two contain data, one contains "parity" that can fix any errors on the other hard drives. Then the parity is structured to be striped equally across the three drives). Modern RAID drivers can do this rather easily.

The problem with "Bit Rot" is that a RAID5 drive will not rebuild itself until it detects an error. If you're reading files along, and all of a sudden... the hard drive detects an error. No problem (in the typical case), just rebuild the data from the parity.

However, "Bit Rot" means that the parity bits (on the 3rd backup hard drive) have ALSO rotted away.

----------

The only way to fix this "bit-rot" error is to constantly read through your data and CONSTANTLY check for bit-rot. No hard drive is going to silently spin and hamper-performance of the system for self-verification purposes... but a Filesystem / Operating system can schedule these "Scrubs" to occur during periods of low-I/O.

Which is how ZFS, and Window's ReFS work. When your computer is idle, the OS checks for bitrot. When the computer starts to work again, it pauses the "low priority" bitrot checks and serves the data.

-------

ZFS doesn't quite work like Window's ReFS. ZFS simply checks for bit rot whenever a file is accessed. Every time. There are "ZFS Scrub" commands (which you can put into a cron-job) to read every file (and therefore check for bitrot).


"enterprise" storage arrays do this for you.

You have a glob of storage that is presented as a block device, but actually underneath is a reas no fooling filesystem with FEC, Snapshots and all sorts of other goodies.


You must not read HW errata or device drivers.

It's been posited that the main push for ZFS was perceived bugs in UFS that happened to be LSI firmware bugs once they had the capabilities of ZFS to detect them.


You can get a disaggregated version of what ZFS does - you can have filesystem -> LVM -> MD -> devices, where the MD layer does the checksumming at a sequence-of-bits level. It tends to work less well than ZFS though; snapshots are a lot more efficient if the low level knows which bits are which files at the high level, scrubbing can be done more efficiently if you know which files are in use and which aren't, if the parity checking wants to use idle bandwidth then it needs to know about user-requested access. Perhaps most importantly, data writes need to be async for performance, but filesystem metadata needs to be written in a way that will maintain the filesystem's invariants, so the question of when bits hit the physical disk ends up being intimately entangled with the filesystem's internals. It's really difficult to express an interface between the filesystem layer and the parity layer that lets it do the right thing.


If you have a layer that knows there's another copy of the data on a different disk, that layer should be checking for bit-rot even if the individual drives have their own error correction.


I have to wonder what's going to happen once those storage level random access non-vol memory technologies finally make it out of R&D and into the market.

I mean, as it is now it seems like we have a hard enough time dealing with comparatively simple hybrid memory systems.


I am really excited for bcachefs. It is also the only fs that has support for chacha20-poly1305 encryption.


I wonder if a non-hardware accelerated encryption algorithm is good enough for a FS that also has checksums. The CPU is already busy with checksimming, so doesn't this considerably slow down writes?


Both Chacha20 and Poly1305 are optimized (by design) for running on general purpose CPUs. AES-GCM using AES-NI instructions is still faster, but not that much [0].

[0] https://community.qualys.com/thread/16005


Somewhat counterintuitively, Chacha-Poly is as fast as AES-NI. You could probably go even faster using QuickAssist but most people don't have that.


Depends on your CPU. Desktop/server CPUs are plenty fast enough, but a home NAS might slow down.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: