ZFS Is the Best Filesystem For Now

floatboth · on July 12, 2017

> ZFS never really adapted to today’s world of widely-available flash storage: Although flash can be used to support the ZIL and L2ARC caches, these are of dubious value in a system with sufficient RAM, and ZFS has no true hybrid storage capability.

How is L2ARC not "true hybrid"?

> And no one is talking about NVMe even though it’s everywhere in performance PC’s.

Why should a filesystem care about NVMe? It's a different layer. ZFS generally doesn't care if it's IDE, SATA, NVMe or a microSD card.

> can be a pain to use (except in FreeBSD, Solaris, and purpose-built appliances)

I think it's just a package install away on many Linux distros? Also installable on macOS — I had a ZFS USB disk I shared between Mac and FreeBSD.

Also it's interesting that these two sentences appear in the same article:

> best level of data protection in a small office/home office (SOHO) environment.

> It’s laughable that the ZFS documentation obsesses over a few GB of SLC flash when multi-TB 3D NAND drives are on the market

Who has enough money to get a mutli-TB SSD for SOHO?!

raattgift · on July 12, 2017

> How is L2ARC not "true hybrid"?

It doesn't persist across import/export or reboot.

It is demand-filled.

Not all data in the main vdevs are eligible for l2arc.

There is memory overhead for l2arc buffers.

There is CPU overhead in processing l2arc headers.

Once a buffer is in l2arc it stays in l2arc until the underlying data is overwritten or destroyed, or until the l2arc has filled up and the buffer is replaced with fresher data.

A true hybrid in the zfs context would let one pin a dataset or zvol onto a particular vdev, or pin only the (zfs) metadata (or a subset thereof) of a dataset, zvol or pool to a particular vdev.

Openzfs will eventually get both persistence and this form of true hybrid.

Unfortunately automatic migration by zfs of hot data to low-latency vdevs and cool data from low-latency vdevs is not really possible without solving the infamous block-pointer-rewrite problem.

chongli · on July 12, 2017

Why should a filesystem care about NVMe?

Because at some point the filesystem becomes a bottleneck. ZFS was designed with the assumption that CPUs would be way faster than storage. When you get speeds over 10GB/sec, [0] you are going to spend a lot of time checksumming all that data.

[0] http://www.seagate.com/ca/en/about-seagate/news/seagate-demo...

olavgg · on July 12, 2017

Fletcher checksums are very cheap and not a bottleneck.

https://github.com/zfsonlinux/zfs/issues/4789#issuecomment-2...

chongli · on July 12, 2017

Maybe I'm reading those benchmarks wrong, but they appear to max out well under 10GB/s. This would mean you'd be CPU bound on your checksums alone with one of those Seagate cards.

olavgg · on July 12, 2017

That is a Xeon Phi, a modern Xeon with 10+ cores should do tens of GB/s

https://cloud.githubusercontent.com/assets/472018/13333262/f...

https://github.com/zfsonlinux/zfs/pull/4330

dev_tty01 · on July 12, 2017

Hmm, I for one don't want to tie up ten cores with IO.

ioquatix · on July 13, 2017

If you aren't checksumming your data, is it truely your data?

pmarreck · on July 13, 2017

Who checksums the checksummers? ;)

joshka · on July 13, 2017

You can't handle the checksum

ioquatix · on July 13, 2017

Mr Edward Checksum Checker.

Update: At least 1 person didn't get this joke :D

blattimwind · on July 13, 2017

> Hmm, I for one don't want to tie up ten cores with IO.

What else are you going to use the two CPUs in a filer for?

bobdole1234 · on July 13, 2017

On a single thread.

ZFS is well pipelined for multicore throughput.

RJIb8RBYxzAMX9u · on July 13, 2017

Although not recommended, you can turn checksum off. The granularity is at the dataset level, too, so you have a lot of flexibility.

wmf · on July 12, 2017

What could ZFS do differently to solve that problem while maintaining data integrity?

loeg · on July 12, 2017

Just one idea: offload checksum calculation to a DMA engine. Linux already has a generic DMA engine facility in the kernel, backed by e.g. I/OAT on some Intel hardware.

boomboomsubban · on July 12, 2017

Assuming this is the same thing as hardware-assisted checksums, both the ZFS on Linux maintainer and Intel have said that they are working on it at various cons last year.

sfoskett · on July 12, 2017

Hybrid storage combines flash and disk for better performance and lower cost. Most hybrid storage is tiered, meaning that data can "live" on either flash or disk, but there are other approaches that look a lot more like a cache (see Nimble Storage, for example). True hybrid storage would work like Apple Fusion Drive or Hybrid Storage Spaces Direct - A single pool with SSD and HDD where data can reside wherever is best.

L2ARC is only ever a cache and although it can be on SSD, ZFS generally won't use much more than a few tens of GB. ZIL isn't even a cache and is only used for synchronous writes. Check any ZFS tuning guide and the gist will be "just buy more RAM or create an all-SSD pool" rather than trying to wedge an SSD into L2ARC or ZIL.

snuxoll · on July 13, 2017

L2ARC and the ZIL can be great in many storage situations, but not all.

We have a TrueNAS appliance with a 480GB L2ARC and a small 120GB ZIL (never fully used) used for image/document storage by our ECM suite. Our metadata usage on the filesystem is astronomical due to having billions of small (<16KB) files, the L2ARC may not do a whole lot for actual data (since most of it isn't hit frequently enough to be eligible to be stored in the ARC/L2ARC) but it's instrumental in maintaining performance with such massive amounts of filesystem metadata.

koffiezet · on July 13, 2017

A write-focussed SLOG device can massively improve your write IOPS though, and is massively cheaper than having an "all-ssd" pool, while being able to approach it's normal use performance. If you have a sustained heavy I/O load, sure - going full SSD will be your best bet, but for most usecases, you only need peak IOPS performance in short bursts.

wfunction · on July 12, 2017

How reliable are hybrid storage drives? I feel like the last one I saw failed pretty shockingly quickly but not sure (it wasn't mine)...

sfoskett · on July 12, 2017

I'm sorry, I was referring to hybrid storage as a technology category, not the hybrid disk drives like Seagate Momentus XT. You're right that a 2-drive hybrid (Fusion Drive) will mathematically be less reliable than a non-hybrid one and that those "hybrid disks" haven't lived up to the hype.

Pretty much every enterprise storage solution designed today is hybrid (SSD plus HDD) or all-flash and includes lots of advanced availability features.

adrianratnapala · on July 13, 2017

> "Pretty much every enterprise storage solution designed today is hybrid"

By this do you mean simply that there is both SSD and disk storage around, or do you mean that the storage system transparently chooses where to store particular things without the apps having to care.

Because the latter thing is not my experience.

Spooky23 · on July 13, 2017

If you buy an enterprise storage frame there are a variety of mechanisms where as a storage consumer you just see a LUN, but in the backend individual extents/blocks/volumes are tiered in different performing areas.

These systems will transparently "promote" hot blocks to flash or faster spinning disk, etc and demote cold blocks based in usage patterns and policy, without the app being aware.

There are a lot of different ways to do it, ranging from tiering within the array to a storage virtualization solution that can tier data across different storage platforms. In one case I worked on a project where that virtualization tech was used to consolidate 10 data centers to 1, pretty much transparent to the end users and mostly transparent to the folks running apps. (Exceptions were ostly apps that built their own Storage HA)

wfunction · on July 12, 2017

Ohh okay, gotcha.

ioquatix · on July 13, 2017

I had a ridiculously bad experience with the Seagate Momentus XT drives. Never again..

edude03 · on July 12, 2017

> How is L2ARC not "true hybrid"?

L2ARC in my understanding is only for reads, whereas ZIL is the write ahead log. Ideally, ZFS would "combine" the two into a MRU "write through cache" such that data is written to the SSD first, then asynchronously written to the disk after (ZIL does this already) but then, when the data is read back, it's read back from the SSD.

philsnow · on July 12, 2017

This makes sense; if you were building a service that served bytes off of disk, with an in-memory LRU cache, you wouldn't have two separate pools of memory, with one for writes and one for reads.

Did the ZIL and L2ARC concepts come up before SSD was widely available? Especially the ZIL seems very much optimized for crazy enterprise 15k rpm spinning rust. Memory and SSD access characteristics are so different from spinning disks; I don't know why ZFS separates ZIL and L2ARC.

koffiezet · on July 13, 2017

Because the ZIL is a journal log, not a cache. It is intended to increase data security without sacrificing too much performance. Many people also confuse ZIL and SLOG devices though...

By default, the ZIL is written on the same disks as where the data will be stored, but an external device (aka SLOG) can be added. From that point on, your write IOPS will be limited by this SLOG device, so normally you add a more expensive fast disk as SLOG device to increase your write IOPS.

notacoward · on July 13, 2017

> Did the ZIL and L2ARC concepts come up before SSD was widely available?

Yes. After they realized that their initial claims about not needing such things was bullshit (which some of us had told them at the time) but before SSDs became common.

vbezhenar · on July 12, 2017

> I think it's just a package install away on many Linux distros? Also installable on macOS — I had a ZFS USB disk I shared between Mac and FreeBSD.

It's easy to install if you want to use it as an additional filesystem. But if you want to install e.g. RHEL on root ZFS, it's quite an adventure. Even Ubuntu with first-class support for ZFS does not support ZFS on root from the box. Actually I don't understand it. Of all the features, snapshots looks like killer feature for Linux distributions. Make snapshot before upgrade, allow easy rollback if upgrade gone wrong. Something like Windows restore points, but much more reliable.

lscotte · on July 13, 2017

ZFS root absolutely is supported on Ubuntu and Debian. Every Debian system I've built in the past year or so have been 100% ZFS (around 6 physical systems, but I've also built AWS AMIs this way). You have to install via debootstrap, but it's definitely a working and supported configuration. Installer support would sure be welcome though, it's not trivial!

kuschku · on July 13, 2017

And that’s exactly the issue. You can’t just use ubiquity to set it up, or even the terminal, you have to go a long, complicated, badly documented path.

weitzj · on July 13, 2017

If you use proxmox (Debian), you do not need a terminal. And you get all the nice proxmox features. But I agree, standalone setup is no fun.

lscotte · on July 14, 2017

Actually, I do use the terminal to install it. Boot off a liveusb, switch vty, and go.

phil21 · on July 12, 2017

I agree, snapshots are amazing for / on a linux machine. Makes backups actually work right, and be very inexpensive (computationally).

But, the tooling is still rather new and untested. At work we are still stamping out upstream bugs that really shouldn't exist, but the ecosystem is certainly getting better. It's only been a couple years since the ZFS on Linux project has gotten remotely any uptake by the distro folks, and even now that interest is rather tepid due to the licensing issues.

If ZFS was GPL I think it would have been the default filesystem on Linux for quite some time now.

floatboth · on July 12, 2017

> if you want to use it as an additional filesystem

Which is totally fine for a NAS! But yeah, I like my ZFS root on all the FreeBSD installs :)

> snapshots looks like killer feature for Linux distributions

True. FreeBSD and of course Solaris/illumos had boot environments for a long time, it's an excellent feature.

ScottBurson · on July 12, 2017

Since Btrfs also has snapshots, it seems perfectly adequate for a root FS. It hasn't seemed important to me to run a ZFS root, even though I have all my user data in ZFS.

RJIb8RBYxzAMX9u · on July 13, 2017

btrfs snapshots have a lot of gotchas. Quoting myself:

https://news.ycombinator.com/item?id=14724820

cmurf · on July 13, 2017

If you have a mandatory flush to disk anytime there's a snapshot, they suddenly become a lot more expensive. They are still atomic in that the operation either did completely happen or not at all happen, if a crash occurs after the command returns, i.e. you don't end up with a partial or corrupt snapshot after you reboot.

The need for a snapshot to be on disk when sending, what you cite says you get an obscure error "stale NFS file handle" not that you get missing files. If you're silently getting missing files on a send/receive, that's a bug and should be reported.

I use snapshots quite a bit both for data, backup/replication with send/receive, and for root fs, and haven't had problems with it. The known problem with Btrfs snapshots is that they are deceptively cheap to create, but become expensive later on to delete due to back reference searching, freeing extents (or not if they're still held by other snapshots) and updating metadata. Dozens to small hundreds aren't normally a problem, in my use case I don't notice performance problems.

ScottBurson · on July 13, 2017

Good to know. I've used them only very rarely.

boomboomsubban · on July 12, 2017

RHEL has a repository run by the ZFS on Linux team, and if you know ZFS it's not much harder than partitioning your disks. The big issue is a lack of documentation on using GRUB with ZFS. It's seemingly nonexistent, leading most guides to say /boot needs a separate partition.

majewsky · on July 13, 2017

> /boot needs a separate partition

This is true with any Linux-supported FS on most modern PCs since UEFI can only read VFAT partitions.

boomboomsubban · on July 13, 2017

The ESP does not need to contain /boot, it only needs to contain the EFI. And the UEFI specifications are for an independent filesystem based on FAT, VFAT is still patented. In practice, any FAT filesystem usually works.

lscotte · on July 14, 2017

You actually don't need a separate /boot - recent grub versions know what to do. You do need a small EF02 partition (I make mine 4096 sectors), and then the rest of the disk as a BF02 partition. I wrote this up for an AWS AMI, but I use the exact same setup on my Debian desktops and servers - https://www.scotte.org/2016/12/ZFS-root-filesystem-on-AWS

floatboth · on July 13, 2017

/boot and the EFI System Partition aren't necessarily the same thing

fweespeech · on July 12, 2017

> Who has enough money to get a mutli-TB SSD for SOHO?!

https://www.amazon.com/Crucial-MX300-Internal-Solid-State/dp...

I was contemplating a build with 2 of these in a RAID 1 configuration for my next homelab server.

Personally I run a gaming (windows) desktop at home and an always-on UPS-backed homelab server that handles minor ops tasks (mostly backing up side projects and some ETL) + provides dev VMs.

My home office "budget" is ~$2k/year. My gaming/work desktop is ~4 years old, represents ~$2.5k of that budget. Monitors/peripherials/desk/chair generally eats another $2k and are of a similar age. I generally spend ~$2.5k on the dev server. I then usually toss ~$1k into a laptop.

I could easily see someone who purely works from home (rather than 1-2 days a week) operating with a larger budget and genuinely needing a ZFS setup of 2TB SSDs.

Realistically, $2-3k/year is 2-5% of the sort of salaries we see on HN given we make a living at this sort of thing...it isn't surprising people would spend that kind of money to me.

https://hurdlr.com/blog/software-web-developer-tax-deduction...

Keep in mind "business equipment" certainly qualifies for such a dev server so you won't be paying taxes on it if you itemize as well.

snuxoll · on July 13, 2017

> Keep in mind "business equipment" certainly qualifies for such a dev server so you won't be paying taxes on it if you itemize as well.

Even factoring in my homelab spend I don't get any more itemizing than taking the standard deduction as a married individual making $85K, and I spent well close to $2000 on it last year.

zackelan · on July 12, 2017

> How is L2ARC not "true hybrid"?

I think what that may be referring to is that the ARC is in-RAM and obviously cleared on a reboot, so as a result L2ARC on an SSD is also not persistent. After a reboot, you have to allow the ARC to fill up, then as it evicts data from the L1ARC it's pushed to L2ARC. Until that happens the SSD is not used at all.

binarycrusader · on July 12, 2017

ARC is not necessarily cleared on reboot (if doing a "fast" reboot; that is, kernel reload); that's platform and implementation-dependent. Lookup "Persistent L2ARC".

cryptonector · on July 12, 2017

Perhaps "true hybrid" == RAM->SSD->HDD->tape, that is, warm-through-cold storage?

Whatever.

What would be really nice is raw SSD/storage access so that ZFS (or other FS) could manage all the wear leveling and bad block mappings.

nomel · on July 12, 2017

Wear leveling is a implementation detail of the medium that can and will change. It doesn't seem right to put that in the filesystem itself. If anything, I would think it should go into some 'generic <specific flash technology here>' device driver.

tw04 · on July 12, 2017

>. It doesn't seem right to put that in the filesystem itself

It absolutely belongs in the filesystem. When you're doing RAID of any type across the devices, you need that layer to manage the underlying media. A single device view will never appropriately manage wear leveling and garbage collection.

There's a reason companies like NetApp have been working with drive vendors to have more control over the underlying media:

http://www.samsung.com/us/labs/pdfs/2016-08-fms-multi-stream...

hvidgaard · on July 13, 2017

It's difficult problem to solve. If we do put it in the filesystem, we need some way for the filesystem to know how the NAND behaves. Otherwise the lowest common denominator dictates and nobody gain anything.

With this in mind, I don't see why the disk cannot handle this in the firmware. As long as there is enough free NAND on the drive, it can manage wear level and GC just fine assuming that it gets TRIM commands.

tw04 · on July 13, 2017

You've just described why we have storage appliances for high performance and enterprise workloads, and why the drive to standardize always swings back around to customized software and hardware.

cryptonector · on July 12, 2017

It belongs in the filesystem in exactly the same way that volume management belongs in the filesystem.

ZFS includes volume management, _naturally_. I say naturally, but that was a radical idea 15 years ago. Even now it's not universally accepted, but it's quite correct!

drvdevd · on July 13, 2017

I'll add that ZFS is also quite concerned with write patterns IIRC, and thus wear leveling.

The author of the article seems to assume we should all be trusting SSD or "hybrid storage" firmware to properly handle this sort of thing for us like nice black boxes.

I think this was one of the major problems ZFS was designed to solve. To make storage hardware more simple the idea was to move a lot of this logic into the OS (especially as large RAM sizes got cheaper). It's why having a RAID controller sitting under your vdevs is advised against.

floatboth · on July 12, 2017

Whatever is exactly my reaction :)

You get raw NAND access on like, home routers. OpenWrt/LEDE uses JFFS2 on that.

joshka · on July 13, 2017

Change Tape to Cloud and I think you're on to something.

Borealid · on July 12, 2017

A 2TB SSD costs around 400-500USD. That's not exactly out of the realm of possibility for a small office.

ioquatix · on July 13, 2017

I've got 6x 4TB WD RED in a ZFS RAID10 (3x2 mirrors). I get about 500Mbytes/s read and write, on average. You start getting into 10GbE territory pretty easily with even consumer drives and ZFS. With SSDs, you'd quickly need trunked 10GbE if you want to fully saturate your network - in addition to that consider the client requirements - e.g. each workstation with 10GbE or 1GbE. You'd have to have some pretty decent requirements to necessitate a ZFS RAID10 with [NVMe] SSDs.

hvidgaard · on July 13, 2017

The benefit of SSD is not the raw sequential transfer. That is easy to max out with spinning platters. You want SSD for random access, it's significantly better at that.

ioquatix · on July 13, 2017

The ARC (and L2ARC on an SSD) give you the best possible IOPs for read operations given the limits of the underlying hardware. Asynchronous write operations are cached to memory before being written to disk, often sequentially. For synchronous writes, you should use a ZIL on a mirrored SSD.

floatboth · on July 12, 2017

Sure, but it's a rather high price for just faster storage.

And like… 2 TB of cache is a bit high for SOHO NAS, and if you go full SSD for storage, you'd want two of them for a mirror and that's 1000 USD already…

dragontamer · on July 12, 2017

And we spend what? $200 upgrading from an i3 to an i7 for like... 30% faster CPU performance?

Upgrading $100 hard drive to $400 SSD results in like, 500% improvements in storage speed. If you have any storage-related task... such as video editing, handling of large datasets and whatnot... the SSD will have a far bigger impact on your productivity than any CPU upgrade.

loeg · on July 12, 2017

Probably better to spend $150 on SSD cache and $150 on RAM cache than to spend $300 on RAM and $0 on SSD, though. Or the other extreme.

koffiezet · on July 13, 2017

A 2TB drive for 400 or 500 USD? Cheapest ok-ones (samsung EVO) I can find are 700 USD, and then we're not really talking about drives you might actually consider putting in a storage pool used to store critical data.

And if you want 6 or 7 of those puppies, in a non-diy server through a vendor which offers support, 10GE and with enough ECC and CPU to handle that IO load/throughput, you're in for a treat :)

keeperofdakeys · on July 13, 2017

From a protocol level, NVMe is actually designed for SSDs - just look at the amount of queues it has https://en.wikipedia.org/wiki/NVM_Express#cite_ref-ahci-nvme.... To really take advantage of this, you'd need your filesystem to be designed for many independent IO streams. ZFS metaslabs probably help, but I'm sure there is more you can do in this area.

On the other hand, I don't think many people would be hitting any limits where this matters.

burntrelish1273 · on July 13, 2017

Both the L2ARC and the ZIL can be put on separate vdevs (IIRC), to use faster or more reliable SSDs. L2ARC typically wants a pair of striped fast SSD vdevs while the ZIL should be on a mirror vdev of higher-reliability (SLC-like) SSDs.

Also, be sure to have 8+ GiB system RAM available at all times or performance is gonna suck.

Gonzih · on July 13, 2017

> I think it's just a package install away on many Linux distros? Also installable on macOS — I had a ZFS USB disk I shared between Mac and FreeBSD.

Having your root on a filesystem that is provided with your kernel is not ideal situation, update issues make your system basically unbootable.

mrkgnao · on July 13, 2017

not* provided?

ori_b · on July 12, 2017

> How is L2ARC not "true hybrid"?

As far as I'm aware, it's not persistent. Reboot, and your cache of recently accessed files is gone.

mixmastamyk · on July 12, 2017

I've been disappointed in linux filesystems and Intel hardware lately. Little integrity checking in ext4 and btrfs is still having growing pains. Recent search for a svelte laptop with ECC memory yielded nothing. Sheesh, wasn't this stuff invented like 30+ years ago?

I understand Intel is segmenting reliability into higher-priced business gear, but as a developer that depends on this stuff for their livelihood the current status quo is not acceptable.

Linux should have better options since profit margins are not an impediment.

zokier · on July 12, 2017

I have to agree, segmenting on ECC feels really outmoded these days. With Intel you basically need a Xeon even before you can start thinking about ECC, and that does limit your options somewhat. Luckily at least these days Intel is making mobile Xeons, so you can get ECC laptops from most major manufacturers (I checked Dell, HP and Fujitsu, and according to sibling comment Lenovo too).

ianai · on July 12, 2017

It's the laptop manufacturers that are to blame. Check out the specs on this i3 with ecc support. http://ark.intel.com/products/90734/Intel-Core-i3-6100T-Proc...

Granted, that's still not a laptop CPU. But it's not a Xeon either.

edgan · on July 12, 2017

The closest you will get that I know of is a Lenovo P50/P51. They have builds with Xeons with ECC.

But I remember when ECC memory dropped out of favor. I wish it had never happened.

solotronics · on July 12, 2017

all the new AMD chips have ECC by default which is exciting

we will see how their mobile chips stack up

zokier · on July 12, 2017

Hasn't all "big" AMD CPUs supported ECC to some degree as long as the memory controller has been on the CPU? The real problem on that side has been motherboard support which has been afaik pretty much nonexistent.

Teknoman117 · on July 12, 2017

All of the FX series CPUs supported ECC memory (albeit the unbuffered kind). As you said, finding a compatible motherboard was a pain. All the ones I found used the massively power hungry 990FX-A chipset. But I did get one, and my home NAS uses an FX-8350. Kind of scary getting an automated email every week or so about a bit flip that would have otherwise been silent corruption on my home desktop.

dmm · on July 13, 2017

A bit flip every week is way higher than it should be.

ioquatix · on July 13, 2017

Here is a decent build that I recently did for my home server: https://forums.freenas.org/index.php?threads/any-help-with-m...

It uses the Intel G4560 which supports ECC, and is very inexpensive. It's low power too.

dijit · on July 12, 2017

Dell Precision 5520 with a Xeon supports ECC memory (up to 64G DDR4) and it's the same size as an XPS.

mixmastamyk · on July 13, 2017

Ironic, that is the laptop I ended up getting, but because I didn't want a hot, power-sucking laptop I got a lower-speed i5 instead.

Not to mention that no one could tell me if you could install ECC memory or if it would be used by the firmware if it even worked at all.

Roritharr · on July 12, 2017

Are you sure about the 64GB? As far as I see it only has two slots, and I can only configure it with 32 GB.

dijit · on July 13, 2017

I'm running one right now with 64G of DDR4 ECC.

Not low power though, 1.3v each 32G DIMM.

thijsvandien · on July 12, 2017

The CPU supports it but I still don't see the option to actually configure it with ECC memory (which has been keeping me from buying one for quite a while now).

dijit · on July 12, 2017

I have one with ECC memory that's seen my archlinux.

I had to buy the memory myself but I was looking for "better" memory anyway.

Roritharr · on July 13, 2017

So you bought it with 8GB non-ECC, plugged in 2* 32GB ECC Dimms and it works?

I can't find those DIMMs, 16GB DIMMs are the biggest I can find.

dijit · on July 13, 2017

Yep, I think I'm using memory that is not commercially available although I didn't know it before.

I can't find them online.

For reference they're Samsung branded, I can take a photo if you like; you can see the "width" of the channel in linux which tells you if you're using ECC or not.

Roritharr · on July 14, 2017

Wow, where did you get them? Is it an engineering sample?

thijsvandien · on July 12, 2017

Are you absolutely positive that ECC is functional? What memory did you opt for? Does this void warranty?

dijit · on July 13, 2017

Memory upgrades on this line to not void warranty; however it seems my exact memory is not commercially available (yet).

I can see the full "width" in linux, which indicates that the OS can see it.

bubblethink · on July 12, 2017

Start from here. www.dell.com/developers

wmf · on July 12, 2017

Holy shit, Ubuntu is $101 cheaper than Windows! It's finally possible to escape from the "Windows is cheaper than Linux because of crapware" quagmire.

thijsvandien · on July 12, 2017

Not sure what you're getting at. Security vulnerability?

bubblethink · on July 12, 2017

Sorry, wrong link. www.dell.com/developers. Updated the parent too.

thijsvandien · on July 12, 2017

Still don't see ECC there on the 5520.

bubblethink · on July 12, 2017

Search for ECC on that page. Some of them do. It is possible that their SKUs change with time. Currently, 3520, 7510 and 7710 show ECC support. Dell's site is quite terrible for navigation. You may have to spend some time to find the right model.

thijsvandien · on July 13, 2017

The point was to have the compact XPS-like casing. All those other ones are bulky. So I'd have to go with replacing the RAM myself... I really don't get why this configuration option is missing here.

smacktoward · on July 12, 2017

It's not available for that model, but the 3520, 7520, 7720, 7510 and 7710 all offer ECC RAM as an option.

thijsvandien · on July 13, 2017

See my reply to the other comment pointing this out.

peapicker · on July 12, 2017

ZFS, at least on Solaris, has issue with many multiple readers of the same file, blocking after ~31 simultaneous readers (even when there are NO writers). Ran into this with a third party library which reads a large TTF to produce business PDF documents. The hundreds of reporting processes all slowed to a crawl when accessing the 20Mb Chinese TTF for reporting because ZFS was blocking.

I can't change the code since it is third party. The only way I saw to easily fix it was on system startup to copy the fonts under a new subdir in /tmp (so in tmpfs, ie RAM, no ZFS at all there ) and then softlink the dir the product was expecting to the new dir off of /tmp, eliminating the ZFS high-volume multiple-reader bottleneck.

Never had this problem with the latest EXT filesystems on my volume groups on my Linux VMs with the same 3rd party library and same volume of throughput.

nisa · on July 13, 2017

If you recreate the bug on Linux and have something like a simple reproduder I guess that the ZoL devs are more than happy to fix or at least understand it:

https://github.com/zfsonlinux/zfs

From reading the pull requests and issues on that repo I've got the impression that the next release 0.7.0 will be quite step forward and there seems to be quite sophisticasted work to tackle performance issues.

JdeBP · on July 13, 2017

I suggest getting on to the OpenZFS people and seeing whether this is still a problem in illumos; and fixing it if so.

* http://www.open-zfs.org/wiki/Main_Page

* https://github.com/openzfs/openzfs

* https://wiki.illumos.org/display/illumos/ZFS

conductor · on July 12, 2017

DragonFlyBSD's HAMMER [0] is another viable alternative.

Unfortunately the next generation HAMMER2 [1] filesystem's development is moving forward very slowly [2].

Nevertheless, kudos to Matt for his great work.

[0] https://www.dragonflybsd.org/hammer/

[1] https://gitweb.dragonflybsd.org/dragonfly.git/blob_plain/HEA...

[2] https://gitweb.dragonflybsd.org/dragonfly.git/history/HEAD:/...

_axq2 · on July 12, 2017

I had a look at HAMMER as it seemed it might meet my requirements [0] but I couldn't figure out whether it supports replication or erasure coding. Don't suppose anyone here knows?

[0]: https://news.ycombinator.com/item?id=14756787

conductor · on July 12, 2017

HAMMER does support replication but it doesn't have erasure coding.

If you are using a recent Linux Kernel I can suggest you to use dm-integrity [0] (optionally with dm-crypt) with your favorite filesystem. It's not erasure coding but it can help detecting silent data corruption on the disk.

[0] https://old.lwn.net/Articles/721738/

_axq2 · on July 12, 2017

Hmm, it does seem like it should be possible to put mdadm on top of dm-integrity devices, I might try that out.

Mic92 · on July 12, 2017

The article does not mention bcachefs as a future alternative: http://bcachefs.org/

jeremyw · on July 12, 2017

HN: Please consider subscribing to make this alternative real.

https://www.patreon.com/bcachefs

reacharavindh · on July 13, 2017

I’m certainly an excited audience for modern alternative file system options. Quick look at the homepage of bcachefs reads -

“Not quite finished - it's safe to enable, but there's some work left related to copy GC before we can enable free space accounting based on compressed size: right now, enabling compression won't actually let you store any more data in your filesystem than if the data was uncompressed.”

What is the point of having compression enabled if you can’t store more data than you could if it was uncompressed? Shouldn’t they just say “compression mechanism works but not useful yet. as of now it is just extra overhead”..

wtallis · on July 13, 2017

Compression can give you a net performance improvement on slow disks if the CPU time required to compress is less than the time saved by writing less data to the disk.

sfoskett · on July 12, 2017

I had never heard of bcachefs. Thanks for the pointer!

std_throwaway · on July 13, 2017

I see they use another approach to file systems. They sound very confident. A bit too confident to my taste. Are they experienced enough in building file systems that they won't run into huge problems arising when it actually goes into the dirty details?

wtallis · on July 13, 2017

The developer of bcachefs already has the experience of creating bcache, a general-purpose block-level caching system that is optimized for the characteristics of flash-based SSDs as the cache devices. The original bcache essentially serves as the proof of concept and practice implementation for all the lower-level portions of bcachefs (not that it was planned that way; they just realized while working on bcache they had implemented most of a filesystem).

specto · on July 12, 2017

We can all hope...

Perseids · on July 12, 2017

(Near) zero-cost snapshots and filesystem-based incremental backups are amazing. Just today I was saved by my auto snapshots [1]. Apparently I didn't `git add` a file to my feature branch and without the snapshot I wouldn't have been able to recover it after some extensive resetting and cleaning before I switched back to the feature branch. It's really comforting to have this easy to access [2] safety net available at all times.

Now that Ubuntu has ZFS build-in by default, I'm seriously considering switching back, and since I too have been burned by Btrfs, I guess I'll stay with ZFS for quite some time. Still, the criticism of the blog post is fair, e.g. I was only able to get the RAM usage in control after I set hard lower and upper limits of the ARC as kernel boot parameters (`zfs.zfs_arc_max=1073741824 zfs.zfs_arc_min=536870912`).

[1] https://github.com/zfsonlinux/zfs-auto-snapshot

[2] The coolest feature is the virtual auto mount where you can access the snapshots via the magical `.zfs` directory at the root of your filesystem.

rsync · on July 13, 2017

"(Near) zero-cost snapshots and filesystem-based incremental backups are amazing."

This.

We[1] offer ZFS filesystems in the cloud[2] and one of the nicest things to explain to customers is that they don't have to think about "incrementals" or "versions" or retention in any way. They can just do a "dumb rsync" to us (mirror) and our ZFS snapshots, on their schedule, will do the rest.

In the event of a restore, the customer just browses right into "5 days ago"[3] and sees their entire offsite filesystem as it existed 5 days ago.

[1] rsync.net

[2] http://www.rsync.net/products/platform.html

[3] rsync.net accounts have a .zfs directory

alyandon · on July 12, 2017

  "Once you build a ZFS volume, it’s pretty much fixed for life."

The ease of growing/shrinking existing volumes and adding/removing storage is why I made the decision to go with btrfs when I rebuilt my home file server.

bitL · on July 12, 2017

I hope you managed to use the fixed BTRFS unlike me that built home NAS with tens of terrabytes on top of BTRFS and then year later learned that the version I have was broken and the fix is not applicable to a running cluster... So I have to buy another batch of HDDs, install fixed BTRFS (who knows what other unknown issue awaits me in the darkness?) and copy over everything...

ioquatix · on July 13, 2017

This advice isn't actually totally correct.

If you build a volume, you can add more zdevs to it.

My suggestion is to avoid RAIDZx and go with mirrors. To add more storage, just add another pair of drives and add it to your pool.

Here is what my main `tank` looks like:

  pool: tank
  state: ONLINE
  scan: scrub repaired 0B in 4h59m with 0 errors on Tue Jul  4 12:47:02 2017
  config:

  NAME                            STATE     READ WRITE CKSUM
  tank                            ONLINE       0     0     0
  	mirror-0                      ONLINE       0     0     0
  		wwn-0x50014ee20eba1695      ONLINE       0     0     0
  		wwn-0x50014ee20eba337e      ONLINE       0     0     0
  	mirror-1                      ONLINE       0     0     0
  		wwn-0x50014ee2640efb3a      ONLINE       0     0     0
  		wwn-0x50014ee2b964ef3f      ONLINE       0     0     0
  	mirror-2                      ONLINE       0     0     0
  		wwn-0x50014ee2b964f2d5      ONLINE       0     0     0
  		wwn-0x50014ee2b964f4f4      ONLINE       0     0     0
  cache
  	wwn-0x5002538d704e9ff1-part4  ONLINE       0     0     0

  errors: No known data errors

You can simply add another mirrored pair and get more storage. Or, you can update an existing mirror by adding new drives, then removing the old ones. Mirroring has better performance and IOPs than striped raid too. If you feel that mirrors might be less reliable than RAIDZx, I'd suggest you read up about it - generally mirrors are more reliable (and you could mirror 3 drives if you liked, get more throughput and IOPs - or have a hot spare).

gaadd33 · on July 13, 2017

As you add mirrored pairs your data isn't evenly distributed across all stripes is it?

ioquatix · on July 13, 2017

The simple answer to your question is no.

The more complex answer is that over time as you write more data, ZFS will re-balance the distribution of data over the available zdevs.

takeda · on July 14, 2017

I think the concern is that if you have two disks die it is not a big deal, unless those two disk happen to be of the same pair[1], then most likely all data is lost.

[1] if you're not careful, this is actually very likely to happen. If you purchase 2 disks at the same time, and they are under the same usage patterns (which they will be when working in a pair), under the same temperature then they are very likely to die at the same time.

problems · on July 12, 2017

Btrfs is a mess too, especially with its metadata system that you have to manually resize constantly, data-loss bugs and fsck problems. I've had nothing but problems with it and will never use it on a production system.

LVM on the other hand makes it easy to do snapshotting, RAID, volume resizing, adding more drives, etc - at the cost of some performance in some cases. For the most part, it's the best trade off available if expandability is a requirement and you can't get away with simple mdraid.

alyandon · on July 12, 2017

btrfs has all those features as well and my original setup was mdadm raid 5 + LVM + ext4.

Regardless, I really haven't had any operational problems with btrfs for my 3 TB or so of data and when it has managed to get wedged because it couldn't allocate more space to the metadata pool during the initial import of data from my old array I fixed it with a simple rebalance command.

A cron set up to run a minor rebalance weekly helps ensure you never run into that situation in practice and I've not lost any data so for now I'm comfortable using btrfs as my primary filesystem.

I am a little concerned about the longevity of btrfs in general though because it hasn't been receiving a lot of development work lately.

cmurf · on July 12, 2017

>its metadata system that you have to manually resize constantly

?

Doesn't ring a bell.

A lot of problems with Btrfs sounds like hardware problems of one sort or another. If it's a legit bug the only way such things get fixed is to report it to the developers <linux-btrfs@vger.kernel.org> with complete logs and system information. Did you?

The fsck is definitely hit or miss, but my perspective is the emphasis is on fixing bugs that obviate file system problems in the first place. The reality is an offline fsck for large file systems is just not scalable, so the best bang for the buck is bug squashing.

And there's a tons of that happening.

For the initial 4.12 pull (now done, and probably had a few dozen changes during rc's) 40 files changed, 1629 insertions(+), 834 deletions(-) https://lkml.org/lkml/2017/5/9/510

For the initial 4.13 pull 47 files changed, 1707 insertions(+), 1400 deletions(-) https://lkml.org/lkml/2017/7/4/436

rrauenza · on July 13, 2017

He may be talking about the need to rebalance ... I run a btrfs balance every month to make sure I don't run out of space again.

It happened to me while on vacation -- I only had satellite internet and had to fix it over ssh at single characters per second.

It might have to do with a bug I saw mentioned on the list that referred to the free space map getting corrupted (but easily fixed with a -oremount,clear_cache). So I also run that monthly as well.

I've been too afraid to remove the cron job even though I moved my CentOS7 to the "mainline" kernel rpms from http://elrepo.org

Also, I only trust RAID1 + Crashplan.

ThatPlayer · on July 12, 2017

I'm also using Btrfs for my home file server, but Btrfs's current RAID5/6 status leaves a lot to be desired. I'm running Btrfs on top of mdadm right now because of that.

sheepdestroyer · on July 12, 2017

Should be "mostly" resolved from kernel 4.12 : https://www.phoronix.com/scan.php?page=news_item&px=Linux-4....

pantalaimon · on July 12, 2017

It used to be "usable for most purposes" at some point already, before those scary data corruption bugs were discovered.

https://btrfs.wiki.kernel.org/index.php?title=RAID56&diff=30...

alyandon · on July 12, 2017

My last file server used mdadm raid 5 across 4 drives and I just couldn't take the write performance loss anymore so I switched to btrfs raid1(ish) block duplication.

phil21 · on July 12, 2017

This is only difficult with ZFS if you care about performance. If you are a typical home file server user, you can add vdevs in a rather ad-hoc fashion as they fill and it works pretty well.

There is a huge performance penalty of course, as all the majority of new data will reside on the latest vdevs - but it generally works.

I do agree this is one of the largest drawbacks of ZFS, but very few filesystems get it right.

sfoskett · on July 12, 2017

Yeah, as mentioned in the post, you can indeed just add a vdev to the pool whenever you want. But the risk, apart from performance, is data protection If you add one disk, your whole pool is at risk. So of course you have to add two or more disks as a mirror or RAIDZ but lots of home users don't want to do this. They want that Drobo thing where they throw another log on the fire and everything gets rebalanced. It's sad that ZFS can't do this, and that it can't "upgrade" RAID-Z1 to RAID-Z2. It's hard to do, but would be so much nicer.

wmf · on July 12, 2017

you can add vdevs in a rather ad-hoc fashion as they fill and it works pretty well

As long as they're expensive mirror vdevs, right? My impression is that home users want the efficiency of RAID-6 and they want incremental expansion (regardless of whether this combination is "good for them"). ZFS can't do that.

floatboth · on July 12, 2017

You should always use mirrors! http://jrs-s.net/2015/02/06/zfs-you-should-use-mirror-vdevs-... "don’t be greedy. 50% storage efficiency is plenty". Mirrors perform better, perform MUCH better when degraded, rebuild MUCH faster.

tscs37 · on July 13, 2017

Mirrors are also less safe. [0]

For a 6x8TB and assuming a (optimistic) 10^-16 URE, you get 3.5% failure rate for a RAID5 array, 0.7% for a RAID10 array and a 1.06e-08% failure rate for RAID 6.

Why be greedy for all that performance? Most home-grade NAS or even some business-grade NAS isn't used for performance sensitive operations, more like Word Documents and Family Pictures, stuff you don't want to loose.

I'd rather take safety over performance here.

[0]: https://redd.it/6i4n4f

hvidgaard · on July 13, 2017

Raid is for convinience and performance. It is not and cannot replace backup. In any case, if you want safety, RAID1 is the way forward, and not RAID6. An 8 drive RAID6 with 8TB WD Red NAS drives (URE <1 in 10^14) is virtually guranteed to have at least one read error during a rebuild (if the URE rate is true, which I believe it is not).

Regardless, the determining factor here is how much data do you need to read in the case of a failure to rebuild the array. RAID1 wins every single time because you cannot read less than the single drive you need to replace.

tscs37 · on July 13, 2017

The URE rate is most likely much lower.

However, the chance of failure for a RAID6 of 100x10TB disks is less than 0.482% after 1'000'000 rebuilds.

RAID1 is space-inefficient, a 100x10TB RAID array might never fail but it has only 10TB of storage space.

A RAID10 Array has a 14% failure chance for just 4x2TB disks using 10^14 failure rates.

RAID1 and RAID10 are definitely not the way forward, it is less secure, something that should be immediately apparent if you read the link in my previous comment.

A 10 Disk RAID6 with 10TB disks is more reliably than a RAID10 by multiple orders of magnitude and more space efficient than a simple RAID1.

hvidgaard · on July 14, 2017

Your math is completely off. a 100x10TB RAID6, with a failed disk need to read 990TB of data to rebuild in the case of a failed disk. With an URE of 1 in 10^14 you will see 79.2 URE events on average if the URE rate is correct (again, I don't believe it is) during single a rebuild - this is the reason no serious engineer recommends a RAID6 for large arrays.

In the case of a RAID1, noone uses 100 mirrored drives. You use RAID10, and in the case of a failed disk, you must read 10TB to recover. With the same URE, we'd see on average 8 URE for every 10 rebuilds, or around 2 orders of magnitude less failure rate compared to the RAID6 example.

tscs37 · on July 14, 2017

Your logic is sadly incorrect.

During a RAID6 rebuild, a URE is non-critical as the Array can recover the data with one lost disk an a URE on any other disk during the stripe rebuild.

The only critical error would be a URE on two disk on the same stripe, 80 URE's during a 990TB rebuild have an amazingly low chance of having two UREs on the same stripe on two seperate disks.

In case of the RAID10, you get 8 URE's over 10 rebuilds, which aren't recoverable unless you have 3 disks. So you'll corrupt data.

edit: URE of 10^14 is what most vendors specify for consumer harddrives, 10^16 is closer to what people encounter in the real world but 10^14 is considered the worst case URE rate.

hvidgaard · on July 14, 2017

Good point with the URE on a RAID6, but that still doesn't make it superiour. The strain of rebuild have been known to kill many arrays, both RAID5 and 6.

URE does not have to corrupt data, if you use a proper filesystem with checksumming such as the ZFS.

When a disk fails, a RAID10 is simply in a far better position as it only have to read a single disk, and it doesn't have any complicated striping to worry about. Just clone a disk.

tscs37 · on July 14, 2017

>URE does not have to corrupt data, if you use a proper filesystem with checksumming such as the ZFS.

No but afaik there is no way to recover data once ZFS has declared it corrupted. (ie, no parity)

>The strain of rebuild have been known to kill many arrays, both RAID5 and 6.

I haven't actually encountered that yet. Despite that, a RAID 6 can loose a disk, so as long as you don't encounter further URE's after loosing another disk, it's fine.

If you're worried about that, go for RAIDZ3 or equivalent. With something like SnapRAID you can even have a RAIDZ6, loosing 6 disks without loosing data. The chances of that happening are relatively low.

>When a disk fails, a RAID10 is simply in a far better position as it only have to read a single disk

A RAID 10 is in no position to recover from URE's once a disk has failed unless you reduce your space efficiency to 33%.

I personally favor not corrupting data over rebuild speeds.

Striping might be complicated but that doesn't make it worse.

It might be acceptable too loose a music file, but once the family image collection gets corrupted or even lost on ZFS because a disk in a RAID 1 encountered a URE, it's personal.

I'd rather life with the thought that even if a disk has a URE, the others can cover for it. Even during a rebuild.

jbronn · on July 12, 2017

While it's true you can't expand a vdev, you can always add another vdev to a pool at any redundancy level you desire to expand capacity. For example, you could add a trio of drives as a raidz vdev to a pool with an existing mirror vdev (`zpool add mypool raidz dev1 dev2 dev3`). However, the drawback is that the expanded pool won't be "balanced" unless its datasets are rewritten.

tscs37 · on July 12, 2017

Home users would rather just add a single drive instead of having to worry about adding entire RAID setups.

alyandon · on July 12, 2017

Exactly that. I can essentially add/remove drives on a whim (for any reasonable definition of "on a whim") using btrfs and run a rebalance command afterward and I'm done.

That, of course, is really not a concern for enterprise use cases.

eriknstr · on July 12, 2017

What I'd like to know is, can this issue be resolved in a future version of ZFS or is it too ingrained into the design of ZFS?

wtallis · on July 12, 2017

For years the fix ("block pointer rewrite" feature) was promised as coming eventually, but that effort was abandoned. BTRFS will reach ZFS levels of stability before ZFS reaches BTRFS levels of flexibility.

cbhl · on July 12, 2017

It would be expensive -- you'd have to "rebalance" the data by copying it to the new volume / away from the old volume, which would take hours or days.

rocqua · on July 12, 2017

People who want this feature are mostly frugal. They want it so they can upgrade by just buying a single new disk. That is a rare enough use case that the performance penalty would be worth it.

cbhl · on July 15, 2017

Frugal enough to shut off their computer when they're not using it (to "save power"), so the rebalancing won't have time to complete in the background?

kalleboo · on July 13, 2017

Isn't the fact that Drobo is still around and doing well a testament to people being fine with this tradeoff?

problems · on July 12, 2017

Correct me if I'm wrong, but I can't add vdevs to a RAIDZ (RAID5/6ish) system in ZFS, which are probably the most common configurations for home NAS systems.

I'd love to run ZFS, but I can't because I need to be able to add more drives as I buy them.

problems · on July 13, 2017

I've been reading more into it and it seems like back in 2008 they came up with an algorithm to do this: https://blogs.oracle.com/ahl/expand-o-matic-raid-z

But it doesn't look like there's been any movement on an implementation and it seems like it's high effort and mostly home users who want this, not enterprises who might be willing to pay for.

Ah well, guess I'll stick with mdraid for now.

Terribledactyl · on July 12, 2017

correct, the number of devices in a raidz1/2/3 vdev cannot change. However you can replace all of the drives one at a time with higher capacity drives and most zfs installs will grow your vdev.

You could also add another raidz vdev if you have the space in the server, eventually they should level out depending on workload.

"Best" solution, read most expensive, would be to copy the data over to a new server you've configured for the new load/capacity.

tscs37 · on July 13, 2017

Adding 3 drives at a time is expensive and a non-option for a home NAS. As is "just copy it to a new server".

Both have high investment costs for something a plain mdRAID, snapRAID or LVM RAID can achieve far simpler with better results.

snuxoll · on July 13, 2017

I add two at a time without much problem, mind you I have a rather extreme setup with an external 12-bay enclosure but I feel most people building a NAS instead of buying an off-the-shelf system from Synology/Drobo aren't looking to cut costs or corners.

A 4TB Seagate IronWolf drive costs $129 off Amazon, buying two to add as a new mirrored vdev to my TrueNAS box isn't outrageous.

problems · on July 13, 2017

You're limited then to only mirroring, you can't do RAID6 or similar on a massive array where you're only giving up 2/10 drives space but can have a 2 disk failure. Seems like a lot of wasted disks when the only thing I'm dependent on the NAS for is not losing my TV collection. Right now I know if that if I go out and buy 2 more 4 TB disk I get 8 TB more in my NAS and at the price of an only slightly increased risk of having more than 2 drives fail at once. That's probably my favorite feature of RAID6 for a home NAS.

I actually went custom because those off-the-shelf boxes are either very expensive or have weak CPUs so can't be used for video transcoding very well - it's cheaper to build it yourself, much cheaper if you already have old hardware to dedicate to the task.

snuxoll · on July 13, 2017

Different use cases I suppose, my FreeNAS box stores my video collection and all the usual stuff but I've also got all the VM's that run my home network stored on it - performance+resliver times are a lot more important to me than storage efficiency.

problems · on July 13, 2017

And your network-running VMs have high disk IO requirements?

snuxoll · on July 13, 2017

All of my virtual disks are hosted off my FreeNAS, I've got a direct 10GbE link between it and my oVirt host - raw throughput isn't so much the issue most of the time as IOPS are, I've got a local gitlab instance, OpenShift, some PostgreSQL databases, etc. and they like to hammer the crap out of my storage when in use.

Having an L2ARC helps out quite a bit, but only having 32GB of memory and wanting to keep most of it for the L1ARC means I still hit my spinning disks regularly (and mirrored vdev's help read IOPS tremendously in this case).

tscs37 · on July 13, 2017

Reducing the cost of a custom system isn't irrational.

My own personal budget is very limited, buying two IronWolf HDDs in this case is easily a good chunk of my monthly income. If I can build a NAS that can expand with single drives as needed, it's more cost effective for me.

And I imagine a lot of others have the same problem.

In the end, by buying 2 drives when a single drive could have solved the problem equally well: expanding your space by 4TB, is wasting money. Period.

_axq2 · on July 12, 2017

This might be somewhat off topic but I'm desperate. I've been looking for a way to store files:

- Using parity rather than mirroring. I'm happy to deal with some loss of IOPS in exchange for extra usable storage.

- That deals with bitrot.

- That I can migrate to without somehow moving all of my files somewhere first (i.e. supports addition/removal of disks).

- Is stable (doesn't frequently crash or lose data)

- Is free or has transparent pricing (not "Contact Sales").

- Ideally, supports arbitrary stripe width (i.e. 2 blocks data + 1 block parity on a 6 disk array)

Unfortunately it doesn't appear that a solution for this exists:

- ZFS doesn't support addition of disks unless you're happy to put a RAID0 on top of your RAID5/6 and it doesn't support removal of disks at all when parity is involved. It is possible to migrate by putting giant sparse files on the existing storage, filling the filesystem, removing a sparse file, removing a disk from the original FS and "replacing" the sparse file with the actual disk but this is somewhat risky.

- BTRFS has critical bugs and has been unstable even with my RAID1 filesystem.

- Ceph mostly works but I always seem to run into bugs that nobody else sees.

- I couldn't even figure out how to get GlusterFS to create a volume.

- MDADM/hardware RAID don't deal with bitrot.

- Minio has hard coded N/2 data N/2 parity erasure coding, which destroys IOPS and drastically reduces capacity in exchange for an obscene level of resiliency I don't need.

- FlexRAID either isn't realtime or doesn't deal with bitrot depending which version you choose.

- Windows storage spaces are slow as a dog (4 disks = 25MB/s write).

- QuoByte, the successor to XtreemFS has erasure coding but has "Contact Us" pricing and trial.

- Openstack Swift is complex as hell.

- BcacheFS seems extremely promising but it's still in development and EC isn't available yet.

I'm currently down to fixing bugs in Ceph, modifying Minio, evaluating Tahoe-LAFS and EMC ScaleIO or building my own solution.

555h · on July 13, 2017

You can probably achieve what you're looking for by stacking a few filesystems. For example, you could create a separate ZFS pool/vdev with a single full-disk zvol on each disk. Then use mdadm to create a RAID array of the zvols. Then put ext4 (or whatever) on the mdadm array.

I've done something similar for the purpose of getting FDE with ZFS in linux. It can be a little finicky, but it's definitely workable.

One ZFS-specific caveat (which may conflict with your desire to get high storage efficiency): you way need to prevent your ZFS pools from filling up too much [1]. You can either enable discard/TRIM on the whole stack, so the top level FS (e.g. ext4) can let ZFS know when a block is actually free. Or alternative just to limit your zvols to 85% (for example) of their respective pools. The latter is my preference, because there was originally a bug with discard in zfs and it's not immediately clear if it's totally fixed (although my fstrim tests seemed to work out fine).

[1] https://www.reddit.com/r/zfs/comments/3vtur4/what_exactly_ha...

_axq2 · on July 13, 2017

Hmm, that actually sounds workable. I could even format the mdadm device as ZFS too if I really wanted. I am somewhat worried about performance, have you had any issues with that?

555h · on July 13, 2017

I haven't had any issues with performance, but then again my requirement was just "reasonable performance".

I ran some quick benchmarks (data below). Obviously this is far from rigorous, but maybe it'll be useful. In previous tests I found that volblocksize=128K was optimal for my stack -- which is why the last benchmarks use that setting.

Every additional ZFS filesystem in the stack may reduce storage efficiency (minimum free space requirements [1]; metadata & checksum overhead [2][3]) -- that's why I used ext4 as the top layer instead of another ZFS.

[1] (as mentioned before) https://www.reddit.com/r/zfs/comments/3vtur4/what_exactly_ha...

[2] https://news.ycombinator.com/item?id=14756360

[3] https://forums.freenas.org/index.php?threads/what-is-the-exa...

  Test setup:
   debian stable
   kernel 4.9.0-3-amd64
   zfs 0.6.5.9-5
   ZFS "pool": mirror with 2x 7200rpm drives

  Benchmark command:
   for i in `seq 1 10`; do sync; dd if=/dev/zero of=DEST bs=1M count=1024 conv=fdatasync; done

  zfs mirror -> dataset
   Data (MB/s): 125,115,104,135,148,170,135,151,118,119
   Mean (MB/s): 132.0
   Std.dev.: 19.9

  zfs mirror -> zvol (volblocksize=8K [default])
   Data (MB/s): 150,115,127,125,122,118,105,118,124,128
   Mean (MB/s): 123.2
   Std.dev.: 11.6

  zfs mirror -> zvol (volblocksize=128K)
   Data (MB/s): 68.5,112,115,114,94.3,85.1,83.1,98.4,120,108
   Mean (MB/s): 99.8
   Std.dev.: 16.9

  zfs mirror -> zvol (volblocksize=128K) -> luks -> ext4  (my stack)
   Data (MB/s): 130,94.4,109,139,138,125,94.9,124,134,133
   Mean (MB/s): 122.1
   Std.dev.: 16.8

edit: formatting

libx · on July 14, 2017

Can you please elaborate what are the advantages of such configuration?

555h · on July 14, 2017

The advantage is that you can get all the features you want, even though they aren't all available in one filesystem.

In my case, I wanted a reliable filesystem, RAID1/mirror support, block-level checksumming, and full-disk encryption. No filesystem provides these on linux right now. My solution was therefore to use ZFS to provide a mirrored, checksummed, reliable volume -- onto which I put a standard LUKS-encrypted ext4 filesystem. In the past I had tried the opposite (LUKS on the bare drives, then a ZFS mirror of the 2 decrypted volumes), but it was kinda annoying to manage, and I don't really need the other ZFS features (like snapshotting).

In the grandparent post, the requirement was the ability to expand the RAID volume without rebuilding (something that ZFS doesn't offer), plus checksumming and reliability (which ZFS does offer). So one option would be to use mdadm to manage the RAID array, and then put ZFS on the resultant volume in order to get checksumming.

The disadvantages are: extra complexity; extra overhead; more potential points of failure; more management hassle; etc. As soon as encryption for ZFSonlinux is stable, I'll be very happy to drop this filesystem stacking in favor of that!

sfoskett · on July 12, 2017

There kinda isn't such a thing at this point. I sure wish there was! And this is the point behind the article: Wouldn't it be awesome if ZFS had continued developing as the super filesystem it looked like last decade? It would do all this and more! Frankly if ZFS supported re-balancing ("re-RAID-ing") and hybrid pools, it would be pretty much everything we all need. I do hope ZFS or Btrfs or something gets there.

_axq2 · on July 12, 2017

It's so close though. BTRFS and Ceph both have what I want, they're just unstable. The BcacheFS dev has told me he's close but not quite there yet.

Hopefully one of the other solutions will work, otherwise I'll just have to build it.

jamiek88 · on July 12, 2017

bcachefs really does seem as if it will be our holy grail, we have very similar needs.

They have a patreon if you wish to contribute that way.

_axq2 · on July 12, 2017

I already contribute and help out with testing and the occasional docs on IRC :)

rodgerd · on July 12, 2017

XFS has gained (some) checksumming and CoW support.

At this rate XFS will end up evolving to add all the features btrfs promised before the latter makes them stable.

lmm · on July 13, 2017

> ZFS doesn't support addition of disks unless you're happy to put a RAID0 on top of your RAID5/6 and it doesn't support removal of disks at all when parity is involved. It is possible to migrate by putting giant sparse files on the existing storage, filling the filesystem, removing a sparse file, removing a disk from the original FS and "replacing" the sparse file with the actual disk but this is somewhat risky.

It may be "RAID0 on top of your RAID5/6" but it's all integrated and works smoothly. What I do is have two raidz2 vdevs of 4 disks each, one twice the size of the other, and alternate which one I upgrade (i.e. I started with 4x250gb disks and 4x500gb disks, a few years later replaced the 250gb ones with 1tb ones, then the 500gb ones with 2tb ones, and most recently the 1tb ones with 4tb ones). But yeah I am now stuck with at least 8 disks and if I wanted to migrate off them I'd have to do so all in one go.

_axq2 · on July 13, 2017

> It may be "RAID0 on top of your RAID5/6" but it's all integrated and works smoothly.

I know that it does but I don't like the way you lose your entire pool if a single vdev fails. I'd have less of a problem with it if a pool could distribute data such that a vdev failure results in the loss of only what was on that vdev.

lmm · on July 13, 2017

Shrug. I figure losing a random half of my files is pretty much as bad as losing all of them (if I wanted to make an explicit split into two halves I could use two separate pools).

_d8fd · on July 13, 2017

Re: contact sales, I totally hear what you're saying. It's not always a bad idea to talk to a sales person, and a good one can steer you towards the options you'd actually want. But yeah, I generally don't want to talk to a salesperson either.

el_isma · on July 14, 2017

Yes, but would it kill them to just put a pricing there? Nowadays you can get the list price of sending a Rocket into space[1] but not how much some software is going to cost you

[1] http://www.spacex.com/about/capabilities in case you need one

emikulic · on July 13, 2017

Have you seen Quantcast File System? http://quantcast.github.io/qfs/

_axq2 · on July 13, 2017

I hadn't and it looks good but it'll only allow 3 recovery stripes or none, which isn't optimal for me.

davrosthedalek · on July 12, 2017

Have you considered snapraid? It's offline though, but seems to tick all your boxes otherwise.

_axq2 · on July 13, 2017

Yes, I've looked into SnapRAID but I'm not a fan of offline and IIRC I didn't have enough memory to run it last time I tried it out.

TiredOfLife · on July 12, 2017

For storage spaces try fixed size, you can later enlarge the volumes, faster.

_axq2 · on July 13, 2017

I was using fixed size, it didn't help. Read speeds were okay but write always capped out at 25MB/s.

cryptonector · on July 12, 2017

Illumos has a way to expand pools, FYI. IDK if that's in OpenZFS yet.

It works thusly: ZFS creates a vdev inside the new larger vdev, then moves all the data from the old vdev to the new vdev, then when all these moves are done the nested vdevs are enlarged.

What should originally have happened is this: ZFS should have been closer to a pure CAS FS. I.e., physical block addresses should never have been part of the ZFS Merkle hash tree, thus allowing physical addresses to change without having to rewrite every block from the root down.

Now, the question then becomes "how do you get the physical address of a block given just its hash?". And the answer is simple: you store the physical addresses near the logical (CAS) block pointers, and you scribble over those if you move a block. To move a block you'd first write a new copy at the new location, then overwrite the previous "cached" address. This would require some machinery to recover from failures to overwrite cached addresses: a table of in-progress moves, and even a forwarding entry format to write into the moved block's old location. A forwarding entry format would have a checksum, naturally, and would link back into the in-progress-move / move-history table.

During a move (e.g., after a crash during a move) one can recover in several ways: you can go use the in-progress-moves table as journal to replay, or you can simply deref block addresses as usual and on checksum mismatch check if you read a forwarding entry or else check the in-progress-moves table.

For example, an indirect block should be not an array of zfs_blkptr_t but two arrays, one of logical block pointers (just a checksum and misc metadata), and one of physical locations corresponding to blocks referenced by the first array entries. When computing the checksum of an indirect block, only the array of logical block pointers would be checksummed, thus the Merkle hash tree would never bind physical addresses. The same would apply to znodes, since they contain some block pointers, which would then have three parts: non-blockpointer metadata, an array of logical block pointers, and an array of physical block pointers.

The main issue with such a design now is that it's much too hard to retrofit it into ZFS. It would have to be a new filesystem.

raattgift · on July 13, 2017

> Illumos has a way to expand pools ... ZFS creates a vdev inside the new larger vdev

Huh?

> IDK if that's in OpenZFS yet.

The openzfs tree (on github) is virtually identical to illumos-gate (on github).

> physical block addresses should never have been part of the ZFS Merkle hash tree, thus allowing physical addresses to change without having to rewrite every block from the root down.

mahrens deals with this (and block pointer rewriting) here:

https://www.youtube.com/watch?v=G2vIdPmsnTI#t=44m53s

Even with SSDs IOPS are precious. On rotating media, burning track-to-track seeks in reading and updating a large hash table is a bad plan (cf. the deduplication table).

cmurf · on July 12, 2017

Btrfs might just become “the ZFS of Linux” but development has faltered lately, with a scary data loss bug derailing RAID 5 and 6 last year and not much heard since.

It was not a per se data loss bug. It was Btrfs corrupting parity during scrub when encountering already (non-Btrfs) corrupted data. So a data strip is corrupt somehow, a scrub is started, Btrfs detects the corrupt data and fixes it through reconstruction with good parity, but then sometimes computes a new wrong parity strip and writes it to disk. It's a bad bug, but you're still definitely better off than you were with corrupt data. Also, this bug is fixed in kernel 4.12.

https://lkml.org/lkml/2017/5/9/510

Update, minor quibbles:

lacking in Btrfs is support for flash Btrfs has such support and optimizations for flash, the gotcha though if you keep up with Btrfs development is there have been changes in FTL behavior and it's an open question whether or not these optimizations are effective for today's flash including NVMe. As for hybrid storage, that's the realm of bcache and dm-cache (managed by LVM) which should work with Btrfs as any other Linux file system.

ReFS uses B+ trees (similar to Btrfs) XFS uses B+ trees, Btrfs uses B-trees.

gulikoza · on July 12, 2017

The thing I'm struggling with is 4K sector support. It's horribly inefficient with ZFS. RAIDZ2 wastes a ton of space when pool is made with ashift=12. And everybody knows 512e on AF disks is horribly slow...so ZFS is either very slow or wastes 10% of total space...Or both (ZVOL :D)

According to some bug reports, nobody has touched this since 2011...

iooi · on July 12, 2017

Can you elaborate on how ZFS wastes 10% of total space?

I recently set up a ZFS volume using 12x4TB drives using RAID-Z2, so I expected 40TB of usable space, or ~36.3TiB. However, I only see 32TiB of usable space on the volume. I always wondered why that was so, never figured it out..

gulikoza · on July 12, 2017

There's ton of sites when you google ashift=12, http://louwrentius.com/zfs-performance-and-capacity-impact-o... or https://github.com/zfsonlinux/zfs/issues/548 for instance.

Basically, ashift=12 increases ZFS block size to 4K. Metadata use full blocks that would be 512b on ashift=9 but are now 4K (due to ashift=12). It wastes at least 3.5Kb more than normal 512 byte blocks for each block that is not filled entirely.

secabeen · on July 12, 2017

For new data, you can get back most of that lost space by setting the recordsize to 1M. You won't necessarily see the improvement in your df and zfs list commands, as they assume smaller max recordsize, but your overhead should drop significantly.

sfoskett · on July 12, 2017

I didn't even consider this. Thanks for the explanation! Makes tons of sense!

Incidentally, many modern filesystems (including NTFS) store very small files in the FAT rather than taking up a whole block for this very reason!

chungy · on July 12, 2017

cringes at "in the FAT" instead of "MFT"....

ZFS has this same feature, however, as long as feature@embedded_data=enabled ;)

sfoskett · on July 12, 2017

I guess I'm an old storage guy. I call it the FAT on everything! :-)

takeda · on July 14, 2017

Yeah, different filesystem calls things differently FAT, MFT, inodes.

Best is to just call it metadata :)

iooi · on July 12, 2017

Thanks!

fulafel · on July 13, 2017

He his talking about "best level of data protection in a small office/home office (SOHO) environment".

Trying to do this with FS features is misguided.

You need to have backups, and have regular practice in restoring from backups.

Some organizations need fancy filesystems in addition to backups, because they want to have high availability that will bridge storage failures. But that has a high cost in complexity, you should only consider it if you have IT/sysadmin staff and the risk management says it's worth the investment in cognitive opportunity cost, IT infrastructure complexity and time spent.

kev009 · on July 13, 2017

Negative, you wont know you _need_ to use the backups without these FS features, and by the time you finally do you could have rotated through them.

fulafel · on July 13, 2017

The article didn't mention backups at all. If a SOHO environment can afford only either backups or a ZFS storage system, choosing backups leaves much less residual risk on the table.

Yes, there is still a risk that corrupted data may end up in backups, but that's true even with ZFS. Ideally you want end-to-end integrity checking and verification, that means application layer and should also be done for backups. But like with all risk management, there are diminishing returns...

kev009 · on July 13, 2017

That is the most contrived nonsense I've heard in a long time. You can't not afford to use ZFS, it works fine on a single disk and at least you'd know your data had mutated.