Examining Btrfs

COGlory · on Sept 24, 2021

>Moving beyond the question of individual disk reliability, btrfs-raid1 can only tolerate a single disk failure, no matter how large the total array is. The remaining copies of the blocks that were on a lost disk are distributed throughout the entire array—so losing any second disk loses you the array along with it. (This is in contrast to RAID10 arrays, which can survive any number of disk failures as long as no two are from the same mirror pair.)

This feels a little disingenuous to me. This is a distinctly RAID-1 problem. This doesn't feel like a BTRFS problem. Everyone knows RAID-1 = 1 drive loss.

>I don't have anything specific to say about btrfs-raid0, other than the fact that it's raid zero. Any failure of any disk loses all data on the array. This is not a storage system, it's a virtual woodchipper. Avoid.

If the previous paragraph felt disingenuous, this one feels downright malicious. Not only does BTRFS RAID-0 have no differences from normal RAID-0, but the blanket recommendation to avoid it makes no sense. There are many use cases for RAID-0, but the author is acting like this is some kind of problem with BTRFS.

mst · on Sept 24, 2021

> This feels a little disingenuous to me. This is a distinctly RAID-1 problem. This doesn't feel like a BTRFS problem. Everyone knows RAID-1 = 1 drive loss.

The point he's making is that classic RAID-1 is pairs of disks, of which you can lose one of the two, whereas btrs-raid1's (admittedly very neat) feature to bolt multiple disks together means you can end up with a situation where you can only lose one of N drives, which is significantly riskier.

> There are many use cases for RAID-0, but the author is acting like this is some kind of problem with BTRFS.

I read the "anything specific" as them being clear that it's not remotely btrfs specific and is, indeed, exactly a limitation of RAID-0.

Plus over the years I've definitely run into a lot more cases where people have misused RAID-0 and ended up with an unexpected virtual woodshipper than cases where they were actually right to use it (even those do absolutely exist), which does have a tendency to colour one's experience. I think "Avoid" is the safest advice to give in a general-audience article, people who will genuinely get advantages out of RAID-0 in spite of the failure modes are likely to be able to figure that out themselves anyway.

tashbarg · on Sept 24, 2021

In BTRFS, RAID1 simply means 2 copies on different devices. There is RAID1C3 and RAID1C4 if you want more redundancy/copies.

This whole critique boils down to poorly chosen naming and bad documentation. Both VERY valid critique. But feature-wise, BTRFS delivers „classic“ RAID1 and more/better. It’s just hard to find and easy to misread.

mst · on Sept 24, 2021

The critique is more about the fact that it'll stripe across any two devices in the pool.

So in an 8+4+2+2 setup, you could run into serious trouble after only losing the two 2Tb drives (which are likely the oldest and flakiest two in a bodged-together-from-leftovers style array).

I do agree that this is still strictly better than not being able to do it at all so long as you understand the risks, but being able to do RAID1 over 8+(4+2+2) would be even nicer for a bunch of uses and I've yet to figure out how to do that.

wtallis · on Sept 25, 2021

The btrfs allocation policy is to put data on the device with the most free space, after respecting requirements for redundancy. So if your drives are sized 8+4+2+2 and you're using the RAID1 profile, it will in fact operate as 8+(4+2+2), using only the 8 and 4TB devices until the 4TB device is half full, at which point it starts using the 2TB devices.

You can still end up in the situation where there's data that is duplicated across the two 2TB devices, if those are the two you started with and you added the larger devices later. But that can be fixed by doing a rebalance operation at any time after adding more devices. That's usually a good idea, though sometimes you might want to avoid a full rebalance so as not to put too much IO load on a single disproportionately large device.

(If you want a hard guarantee that no data will ever be mirrored across the two 2TB drives, use dm/md to concatenate them into a 4TB block device, and add that to the btrfs array.)

I have a btrfs array currently consisting of a mix of 1TB, 2TB and 4TB devices; 10 drives with a nominal total capacity of 26TB. This is using the RAID1 profile for data, and has been using the RAID1c3 profile for metadata since that feature became available. The number of drives in this array has fluctuated up and down over the years and it has survived drive failures and the accidental removal of the wrong drive from the hot-swap bay, and hasn't lost data in that time. But the current allocation is a bit uneven, because I don't always rebalance after adding or removing devices.

morning_gelato · on Sept 24, 2021

I would argue btrfs does not deliver "classic" RAID1. Last I checked if you lose 1 disk in a 2 disk btrfs RAID1 setup and then reboot, it will be unable to mount the filesystem until you change the mount options so that it is set to 'degraded' mode. This is very different from other RAID1 setups (e.g. hardware raid controllers, mdadm, openzfs), and a big problem if you don't have lights out management or fast physical access to the machine.

mst · on Sept 24, 2021

I would argue that the 'degraded' stuff is a valid but different critique - and in fact is covered in a completely separate part of the article at some length.

morning_gelato · on Sept 24, 2021

I think we are in agreement. I was responding to the comment that stated "BTRFS delivers „classic“ RAID1 and more/better.", which is what I am disagreeing with. Requiring that mount options be changed whenever there is a drive failure (despite having sufficient redundancy) is definitely an anti-feature in my book.

tashbarg · on Sept 24, 2021

There’s no need for changing mount options. If you want to allow mounting of degraded arrays, just put the degraded option there from the start.

mst · on Sept 24, 2021

That sounds to me like it was originally set up that way early in development because they wanted people to give immediate manual attention to a system before booting it in that state.

If btrfs is mature enough that it's "safe" to boot missing a disk now I think either the defaults or the documentation probably want changing to make that clearer.

Like, I get "oh just add this option" as a response but in this case the fact distros don't default to adding it and the docs don't say "sure, do that" somewhere prominent mean I'm allowed to be a bit worried about how safe it actually is.

wtallis · on Sept 25, 2021

Whether to include the degraded option by default is a policy choice that's far beyond the purview of filesystem developers, and not something that most distros can give a clear answer to, either. It boils down to a question of the end user's use cases and risk tolerance. But it seems pretty reasonable to state that a loss of redundancy should either be handled by the user, or by a piece of software sitting between the user and the filesystem itself and acting in accordance with the user's preferences. Silently continuing to operate but with less safety than the user originally requested is the kind of dangerous that should be an opt-in feature, not a default.

Moving the decision into the filesystem itself only makes sense if the filesystem is equipped to enact mitigating actions such as claiming a hot spare as the replacement device, notifying the user/sysadmin through whatever logging/reporting mechanism is actually monitored by a human, signalling applications like load balancers to stop relying on this particular machine if a healthy alternative is available, etc.

(There's also an implementation detail that can trip up users who are trying to live dangerously: you're not supposed to mount a degraded btrfs array as writable until you're prepared to fix the problem making it degraded—such as by providing the devices needed to restore redundancy, or converting it to not be a redundant array anymore.)

bluGill · on Sept 24, 2021

Only in small arrays. In larger zfs systems RAID-1 you divide the disks into chunks of drives - 5-7 is typical. If you have 10 drives, any one can fail without issues, but even with that one failure, a second can fail without losing data 50% of the time. This puts the odds more in your favor. If you are like the typical single person 5 disks is enough and so this doesn't matter at all. However if you a larger organization with several hundred disks odds are you have more than one failed at a time on a regular basis, but the odds that more than one failed drive is from the same chunk is still low enough to not worry too much.

I personally have zfs-z2 with 6 disks in my NAS. This any 2 drives can fail. If I want to expand my NAS I would need 6 more disks, but not only could any 2 fail, many combinations of up to 4 can fail without issues.

If you have a small number of drives this doesn't matter, but BTRFS-raid1 is not acceptable for large number of disks.

tw04 · on Sept 24, 2021

>This feels a little disingenuous to me. This is a distinctly RAID-1 problem. This doesn't feel like a BTRFS problem. Everyone knows RAID-1 = 1 drive loss.

That's simply not true. I can do a RAID-1 (mirror) with 3 drives on ZFS or mdadm and lose 2 drives. I can do it with 10 drives and lose 9.

https://www.thegeekdiary.com/how-to-add-a-3rd-disk-to-create...

>There are many use cases for RAID-0, but the author is acting like this is some kind of problem with BTRFS.

Sure, but for a novice the default advice needs to be: don't do that. And he's writing for folks with all sorts of levels of technical knowledge. Someone who knows nothing about RAID reading the article should be scared away unless they have a very, very good reason for doing RAID-0.

nybble41 · on Sept 24, 2021

> I can do a RAID-1 (mirror) with 3 drives on ZFS or mdadm and lose 2 drives.

You can do this with BTRFS also; it's called RAID1C3 (for three copies). RAID1C4 is also an option with larger arrays.

The issue is that if you have, say, four drives and standard RAID-1 (two copies) it doesn't divide the drives into mirrored pairs where you could lose one drive from each pair without any ill effect (like RAID-10), but rather just ensures that each piece of data is stored on any two different drives. With 4-drive RAID-10 if you lose two drives you have a 50% chance of losing the whole array, since both failing drives could be from the same pair. With BTRFS RAID1 in a four-drive array you have a high probability of losing both copies of some of the data no matter which two drives fail.

RAID-6 is strictly better than either, though, since you can lose any two drives and still access all your data.

uniqueuid · on Sept 24, 2021

One thing I always found curious is that Synology supports Btrfs.

Granted, they are not exactly what I'd call a company with a rock-solid track record, but they are still a rather large premium manufacturer in the NAS space.

It seems they have either managed to contain Btrfs' complexity and prevent users from doing the dangerous things, or their own hybrid raid was even worse, or they are just very risk friendly.

mrlonglong · on Sept 24, 2021

The most annoying issue I have with my Synology is that when I accidentally pulled out the power cord, it seems to have forgotten its configuration. Only Synology support knows the commands required to get it all up and running, they sorted it remotely, BUT, I rebooted it and it promptly forgot its settings. They sorted it AGAIN, but said only way to sort it is to reformat and rebuild.

If it can be restored to normal functionality with a few commands, surely they should be able to save the configuration for subsequent reboots?

No data was lost, btw.

cdumler · on Sept 24, 2021

Synology's implementation of BTRFS is rocksolid, in my experience. This guy [1] tried to break it and couldn't.

[1] https://daltondur.st/syno_btrfs_1/

zionic · on Sept 24, 2021

I’ve been incredibly happy with my Synology products FWIW. BTRFS/SHR have been rock solid.

uniqueuid · on Sept 24, 2021

The hardware is very nice indeed, and the software is smooth as well.

What has irritated me in the past was that they sometimes took very long to fix vulnerabilities [1].

[1] https://www.cvedetails.com/vendor/11138/Synology.html

sumtechguy · on Sept 24, 2021

What is getting me is the EOL bits they have done with some of their HW. It is a NAS. I can see EOL on parts of the stuff they add. But the fileshare bits? That is really the only reason I bought the thing. Does anyone know any alt distros to replace theirs?

vanburen · on Sept 24, 2021

I think Synology runs BTRFS on top of LVM, so that may mitigate some of the issues.

cvubrugier · on Sept 24, 2021

Synology SHR is btrfs (or ext4) on top of LVM and MD. MD is used for redundancy. LVM is used to aggregate multiple MD arrays into a volume group and to allow creating one or move volumes from that volume group.

jacknews · on Sept 24, 2021

The ONLY time I've actually lost data, was with BTRFS, and only using it as a basic single-volume filesystem, no RAID, snapshots, or anything complex.

BTRFS does not seem to recover well from media errors. I've had plenty of those, and always been able to recover at least partial data (minus the bad media) with other FS's.

codys · on Sept 24, 2021

I've also recently had bad experiences where hard-power-offs of a fleet of systems with a btrfs filesystem repeatedly cause corruption (resulting in the btrfs filesystem no longer being mountable)

Some testing seems to indicate even ext4 is more reliable in the face of hard power cycles.

ok123456 · on Sept 24, 2021

And the "rescue" commands tell you not to use them unless you've been blessed by someone on a mailing list, and that they can actually cause MORE damage.

I was using BTRFS on my workstation and all of a sudden some garbage got written to the extent-tree, and it was toast. Never using BTRFS again. I had backups and was able to recover 99.99%, but the recuse tools were garbage. I finally killed it after it had spent a month, from July 4 until August 4, doing apparently nothing. And this was this year. This was a normal desktop configuration, single drive. I wasn't using any of the real features of the FS other than COW.

labawi · on Sept 24, 2021

I've had funny experiences with disappearing files on ext4 - reading a file while there are transient SATA errors (on a flaky SATA link) would cause a the file to be removed.

To be fair, I think I've lost more data on btrfs.

jbotz · on Sept 24, 2021

How long ago was this?

My experience about a year ago was the opposite, btrfs held up really well on a failing drive and I managed to copy the filesystem with 'btrfs send' without suffering any data loss or corruption as far as I've been able to tell. I also had a btrfs filesystem on an external drive with a lose USB plug that caused it to disconnect once in a while, and didn't have any corruptions or loss on that one either.

terrywang · on Sept 24, 2021

Btrfs is flexible and works well for self-hosted home storage use cases, especially running cheap hardware & HDDs with mixed RPM + capacity specs (many consider it a major advantage over OpenZFS on Linux which inherits ZFS pros and cons). As far as data corruption is concerned, avoid using raid{5,6} profile (that governs how a chunk is replicated within or across a member device) for data and you should be fine.

Running OpenZFS on Linux also requires dealing with DKM(ES)S, Ubuntu being an exception (hmm)...

NOTE: Self-hosted podcast show episode 25 contains excellent coverage (ZFS vs. Btrfs) by Jonathan Panozzo from Unraid.

kloch · on Sept 24, 2021

I've used ZFS on FreeBSD for many years and it's absolutely rock solid. There was even a case where an IPMI had been hacked and the attacker reformatted the disks for Windows 10 and I was able to recover most of the the data. The attack was interrupted before they could copy data to the disk and I had left a small amount of space unallocated at the end of the disk.

Lessons learned:

- ZFS is very robust for handling partial corruption

- Always leave unallocated (~1gb) space before and after critical data partitions. accidental/malicious reformatting will likely use the entire disk so the new FS superblocks are less likely to overwrite all of the original ones.

- Always have an offline backup of your partition table in case of accitental/malicious repartitioning

The one thing that still makes me nervous is large stripes of mirrors where a multi-disk failure in one mirror can take out the entire stripe. For that reason I stick with simple 2 disk (or 3 disk for critical data) mirrors and avoid combining them in stripes.

uniqueuid · on Sept 24, 2021

Yes, canonical was brave enough to assume they don't need DKMS for zfs.ko ... [1]

From my experience, Ubuntu's ZFS support is pretty solid, even for root disks. There were some limitations in the past, however - for example, you might not want to use ZFS on a ssd root because of its write amplification.

By the way, there's a handy PPA with newer versions [2] - but this does use DKMS.

[1] https://ubuntu.com/blog/zfs-licensing-and-linux

[2] https://launchpad.net/~jonathonf/+archive/ubuntu/zfs

mnw21cam · on Sept 24, 2021

I use ZFS on my main computer, but it sends its backups to a tiny computer with two laptop USB drives plugged in, running Btrfs RAID-0. (The second drive was added when the first one filled up.) I did it that way because it's the second copy of the data, and I'll only need to call on it if my main computer dies. I have a sentry program running on the tiny computer that regularly checks that all the data is present, correct, and readable. The likelihood that it breaks at the same time as my main computer is slim.

I wouldn't trust Btrfs for much more than that.

I specified ZFS for some of work's very large servers (>100 drives). They wanted to do hardware RAID and XFS instead, and I can just image the nightmares that would have involved.

naranha · on Sept 24, 2021

What is kind of neat for your case maybe is that you can run btrfs scrub even on raid0 or a single drive. The scrub command reads the whole filesystem and verifies checksums, so you can be sure that the data is readable and correct.

Disclaimer: I only tried it on a single disk.

derbOac · on Sept 24, 2021

Can someone recommend a good overview of modern filesystems that covers performance and licensing issues? I liked this article but was left with the question "so what else do you use?"

It's been awhile (too long really) since I really tried to read up on this area. Last time I did I concluded that it was sort of a mess but lots of things were still in progress.

bluGill · on Sept 24, 2021

Use ZFS if you can. However for legal reasons it is hard to use on linux, which pushes you to FreeBSD (I'm not sure about the others). FreeBSD can replace linux, but you will find a number of places where the software you use assumes linux and so things don't work right.

If you are doing NAS all on your own, then don't hesitate to run to FreeBSD. However if you want to a not do it yourself NAS it is a little harder. The most popular/polished commercial NAS is synology based on BTRFS. I personally use TrueNAS which is a great file server, but synology has some nice integration to photo and media applications and so I'm thinking about switching.

bluGill · on Sept 25, 2021

Too late to edit, but I just discovered the synology isn't use BRTF for redundancy, they are somehow using linux device manager in a way that I don't understand. Since I don't understand what they are doing I don't know how it compares to ZFS.

jl6 · on Sept 24, 2021

Happy user of btrfs-raid1 for more than ten years here, for what it’s worth.

plainnoodles · on Sept 24, 2021

Seconded, though I think my array is perhaps only... 8ish years old?

flas9sd · on Sept 24, 2021

I understand the audience of the ars article is people running larger setups with redundant disk arrays, managing lots of data reliably.

But when it comes to what you use in a laptop or desktop, with maybe one more slot for a ssd/nvme or an external disk, btrfs snapshots with incremental send+receive to the second disk or over the network is a good argument in favor of giving it a try.

jqpabc123 · on Sept 24, 2021

Do one thing and do it well.

Do too much and you may end up like Btrfs.

nikbackm · on Sept 24, 2021

What about ZFS?

uniqueuid · on Sept 24, 2021

I have to second this - the rampant layering violation that is ZFS has in my experience been extremely robust.

Maybe it's simply doing the "one coherent tool to manage large chunks of data and disks" thing well.

creshal · on Sept 24, 2021

"Do only one thing and do it well" is highly dependent on how you define "thing", yes.

Is awk bad because you can't compose the internals of its built-in functions? Are vim and emacs bad because they have multiple modes of operation? Is bash violating the unix spirit by doing both the job of an interactive shell and a C API extensible scripting engine?

I don't think so.

ZFS, BTRFS, and advanced ext4 features like encryption are more arguments that layered block devices may have been a poor choice of abstraction.

jqpabc123 · on Sept 24, 2021

Suspect as well --- for the same reason.

System level data handling functionality is critical and needs to be rock solid. Complexity makes this more difficult to achieve. Not to say it's impossible, just increasingly more difficult.

db48x · on Sept 24, 2021

I disagree. I think that the traditional split between block devices and file systems is a premature optimization that ZFS fixes.

And in practice ZFS is extraordinarily solid.

krageon · on Sept 24, 2021

> And in practice ZFS is extraordinarily solid.

As long as you never ever run into an edge case it's great, but then again the same goes for btrfs :)

db48x · on Sept 24, 2021

Did you have anything specific in mind? Because otherwise that statement could apply to anything.

ajross · on Sept 24, 2021

ZFS is a more mature and (maybe?) more performant solution to... the same wrong problem. The modern world doesn't want this stuff from its filesystems. It just doesn't.

Modern flash devices are extremely reliably and outrageously fast, and they pervasively sit on high bandwidth internet connections which provide reliable backup for virtually every application imaginable. RAID and checksumming[1] just isn't used at the level of small/personal devices anywhere anymore, and really it never will be.

And as you scale higher, single filesystems just don't cut it anyway. Cloud storage paradigms work on much larger bits, and worry about things like multi-site failover and atomicity. ZFS/btrfs has nothing to offer someone working on an S3-like service, or a globalized database, either.

I mean, it's fine. Use it if you like it and you have an esoteric application that can leverage it (lots of hackers' personal networks tend to have toys like that). For that matter btrfs is fine too; my main data server at home is scripted around btrfs snapshots because they're easy.

But really... fancy filesystems are just software in need of a home. Run ext4 or whatever and don't sweat it.

[1] Non-encryption checksumming anyway. dm-verity doesn't work at the filesystem layer, nor should it.

uniqueuid · on Sept 24, 2021

Although I completely agree with your observations - sitting in a weird spot between very large and very small data, there are use cases where ZFS is extremely useful.

I'm regularly working with datasets in the single to two digit TB range. They need to sit on disk because the university doesn't have 10GE.

If I ran a regular raid and would encounter a disk or data error, I would have to re-create the data from tape archive or worse, lose it; and getting a replacement disk would surely take weeks.

So instead, I've been running raidz pools for ~10 years and have never skipped a beat.

I agree that this is not very typical, but it suits my needs perfectly.

resonious · on Sept 24, 2021

I just do software dev, but cp with reflink is absolutely killer, and the only reason I don't just go with ext4. Unless I'm crazy, ext4 doesn't seem to have any simple copy-on-write feature I can use to copy/paste an entire project directory instantly. I sometimes do that so I can have two instances of the same app running at the same time, on different branches. I've also used reflink in automated tests, testing a tool that relies on and changes a project's file structure. I can just duplicate an entire dummy project, operate on it, then throw it out and it's fast since it doesn't need to copy bit-by-bit.

curt15 · on Sept 24, 2021

Isn't Lustre designed specifically for ZFS as its backing filesystem?

ajross · on Sept 24, 2021

No, Lustre shipped on ext4 for years and years. Not an expert, but IIRC there aren't any core/major/whatever features in ZFS being leveraged in an irreplaceable way.

Again, ZFS works fine, it's there (so is btrfs). Use it if it does something you like. It's just not really "worth" the level of complexity involved, and the window has long since closed for fancy filesystems to change any major paradigms of computation. This is dinosaur technology, basically.

throwdbaaway · on Sept 24, 2021

> ZFS/btrfs has nothing to offer someone working on an S3-like service, or a globalized database, either.

Why would that be the case? Let's say I want to setup a vitess cluster (planetscale = globalized database). I can easily delegate the compression logic to zfs, which works way better than innodb table compression or transparent page compression. I can rely on zfs atomic write to safely disable innodb doublewrite, which has been the number one bottleneck to scale writes until it was enhanced recently.

rwaksmunski · on Sept 24, 2021

What about it? It does big storage really, really well.

mrlonglong · on Sept 24, 2021

I would definitely look again at btrfs once it gains raid5/6 capability.

Currently happily using ZFS. The only issue is not being able to use latest kernel releases as the ZFS folks usually don't release updates for a while, but overall it doesn't really bother me much nowadays.

kzrdude · on Sept 24, 2021

I guess it's unfortunate for btrfs that it has so much "raid" functionality. Otherwise it could be known for its useful other functionality - snapshots, subvolumes, checksums, compression!

uniqueuid · on Sept 24, 2021

TL;DR:

Btrfs is usable in mirror mode.

Raid is still unusable (as in: you will lose data)

Usage and tooling are unintuitive and dangerous.

I have to say, it doesn't sound so bad when you know what you're doing. The example from the article - the fact that you need to re-silver and re-balance a mirror after a failure - yes that is dangerous. But then again, when you're starting with zero ZFS knowledge, you also need to learn the particular way it's designed.

And let's remember that some mdadm raids don't auto-scrub periodically. So the essence is: Storage is still hard, it's still necessary to think before you manage disks.

creshal · on Sept 24, 2021

> But then again, when you're starting with zero ZFS knowledge, you also need to learn the particular way it's designed.

Except… there's nothing to really learn. All these weird edge cases that can kill your BTRFS simply don't exist in ZFS. The tooling is also much more polished and in the few cases that something does require user attention, it'll tell you in easily understandable terms, so as a new user the only command you really need to learn is `zpool status`, which you now did.

uniqueuid · on Sept 24, 2021

Sure, I give you that ZFS' tooling is much, much more polished.

But of course there are peculiarities that have 'bitten' people (softly) in the past:

- dedup has hardly any use case, novices get enticed to try something that doesn't do what they think it does

- ZIL and SLOG sound like "a simple cache" when in reality they are much more complex

- You can add single devices to a (redundant) pool, at which point you've lost redundancy, perhaps without knowing.

- No defrag

- No re-balance

- No growing pools

- ARC still surprises people (also, I've had kernel crashes under heavy load when the ARC didn't free ram fast enough ... edge case maybe, but still)

That said, of course, ZFS is awesome and I encourage everybody to give it a try.

codys · on Sept 24, 2021

Some points I think bear some clarification wrt current zfs state:

- current versions of zfs (ie: openzfs) error (unless forced) when adding non-redundant devices to a redundant pool, and have device removal to handle the case where someone bypasses the error.

- pools can grow in zfs (ie: the things one allocates storage out of). what is probably being referenced here is that one can't reshape a zraid vdev to add more disks to it (basically the reshape operation in classic raid setups). This does make expanding in some cases less flexible that old-style raid setups.

Also, I really have to agree that ARC behavior is still not great under load, and I don't think that's that much of an edge case.

Dylan16807 · on Sept 25, 2021

- No copy-on-write file copies.

- Very limited ability to make snapshots writable.

WastingMyTime89 · on Sept 24, 2021

> Except… there's nothing to really learn.

Would you mind explaining to us how to add a disk to an array with ZFS? Then same question but with a disk which is not the same size that the rest of the array.

mindslight · on Sept 24, 2021

    zpool add tank mirror /dev/sdc1 /dev/sdd1

But I presume you're referring to adding disks to an existing raidz. Such an operation was never really considered by ZFS, as it was made for business environments where they buy disks in quantity. So they buy enough disks for a full redundancy set (another mirror or raidz), and append onto the existing array.

From the perspective of an individual user it's definitely a suboptimal part of ZFS. But standing in the shadow of business usage is the price to pay for it having been developed in the first place.

As an individual user you can play games with sparse files and failed members as a sibling comment says (and dragging some external drives into service can help, as well as restoring from your backups). But really, just try to plan for the space you want and buy at once. Disks are large enough these days that you should aim to not be tinkering with your storage.

Back in the day I had something like an 8 x 250GB disk external USB drive array for my FLAC collection, with each disk partitioned into 10 slices, raid6'd, and then LVM'd back together. This gave me the wiggle room to take down a single slice at a time for reshaping, and then I could even online grow the filesystem thanks to reiserfs. No backups of course. I did the online expanding thing once or twice and it was neat, but ultimately the whole thing was replaced by a simple 2TB mirror after several years.

There's no reason you couldn't do something similar with ZFS besides performance. But do you really want to?

uniqueuid · on Sept 24, 2021

You're teasing, of course, but it shows the real problems.

To take you up on that with a dangerous example that reinforces the point:

To add a disk to an array whose used space is less than the capacity of a single disk, you can (DON'T DO THIS, IT'S DANGEROUS)

(1) remove a disk from the array (it's degraded now) (2) reformat the single disk, put all data on there (3) create a sparse file with the size of one disk (4) create a new array with N disks, in place of the temporary disk use the sparse file as a device (5) remove the sparse file (the new array is now degraded) (6) bring the new array up, copy all data from temp disk on it (7) replace the sparse file with the temp disk (8) copy files around to make them re-balance

Congratulations, you've now grown your array, and put your data seriously at risk in the process! :)

mavhc · on Sept 24, 2021

How do you do it on another FS?

uniqueuid · on Sept 24, 2021

mdadm --add /dev/md1 DEVICE

mdadm --grow /dev/md1 --size=max

mavhc · on Sept 28, 2021

My friend did that, lost all his data, now uses ZFS

bluGill · on Sept 24, 2021

You don't. There is NO real world use case to do that. It seems like they exist, but every time you dig into the details you discover that there is a horrible edge case that makes it a stupid idea.

What you do is buy not one drive, but a set of drives and add an second array to the system. Then RAID-0 between the two arrays.

Dylan16807 · on Sept 25, 2021

There is NO real world use case to buy fewer than six to eight drives at a time? (assuming I'm using raidz2 or so)

Really?

bluGill · on Sept 25, 2021

The smallest number you can safely go is 3 at a time - put them in a mirror configuration. This is expensive for the amount of storage you get, so few people are willing to spend that much. Once you want to be cost effective 6-8 is the sweet spot - enough drives for a good amount of usable space, but not so many that multiple drive failures are likely to compromise the array. Nobody is stopping you from other configurations, but 6-8 drives in your array is about the best compromise between performance, cost effective and total storage.

wtallis · on Sept 25, 2021

> The smallest number you can safely go is 3 at a time - put them in a mirror configuration. This is expensive for the amount of storage you get, so few people are willing to spend that much.

You're talking in circles. You're assuming the limitations of ZFS, then providing advice based on those limitations, then declaring that there's no advisable use case for a filesystem that's free of those limitations.

On btrfs, you can start with two or three drives, then expand your array one drive at a time up to 8+ without lowering your redundancy at any point, and each time you add a drive the usable capacity of the array really does increase.

bluGill · on Sept 25, 2021

It is not safe to have more than 8 drives in an array. I'm not willing to trust I will never want to stay that small.

wtallis · on Sept 25, 2021

> It is not safe to have more than 8 drives in an array.

What are you basing this claim on? Did you mean to qualify that statement to only apply to certain RAID modes, or are you again just repeating advice specific to ZFS as if it applies to all storage systems?

bluGill · on Sept 26, 2021

All raid. Your odds of a dual fail failure are too high.

Note, good raid6 should have x and y stripes with no more than 8 drives in any stripe. I have no idea what they do, so maybe they can restripe somehow on one drive. Hiding the implementation detail is good if they do this right. But I don't know how to tell if any do it right.

wtallis · on Sept 26, 2021

> All raid. Your odds of a dual fail failure are too high.

These two sentences are in conflict. There are RAID modes where the failure of two drives is recoverable. You probably know this, since you mention one of them in your next sentence. But maybe you don't realize that RAID 6 isn't the only way for an array to survive the failure of two drives, especially if the failures aren't simultaneous. You should read up on some storage systems beyond traditional hardware RAID and ZFS; there are quite a few interesting ideas out there you should open your mind to. Btrfs would be a great place to start your investigation.

Dylan16807 · on Sept 26, 2021

If 8 drives is safe with dual parity, then significantly more than 8 drives would be safe with triple parity, right?

Dylan16807 · on Sept 26, 2021

I did some calculations, with an estimated 3% annual failure rate per drive.

The numbers shift a bit depending on how much you expect drive failures to correlate and how long you take to replace them.

So I analyzed two example cases, using pretty good math based on a binomial distribution.

The first case is a pessimistic one, where after a drive fails we assume that any other possible failures from the next year will hit before we finish rebuilding. In this case, 8 drives with double parity have a 1/740 chance of array failure each year. If we switch to triple parity, we can have 16 drives and a slight improvement in safety.

The second case is a more optimistic one, where after a drive fails we assume that any other possible failures from the next week will hit before we finish rebuilding. In this case, 8 drives with double parity have a 1 / 93 million chance of array failure each week. If we switch to triple parity, we can have 40 drives and a slight improvement in safety.

I would have done "what if all the drive failures happened at once" but that gives >10% odds of an 8 drive array failing, which seems unreasonably harsh.

Dylan16807 · on Sept 25, 2021

I want to work up to 8 drives in a vdev as my storage needs grow, because that's cost-effective. But I don't want to spend 8 drives of money up front, because it's a lot and because drive prices drop over time. Am I not a real use case?

bluGill · on Sept 26, 2021

For the first 8 drives, but after that no. BRTFS-raid2 seems to be good for up to 8 drives if that is your use case. But there is no easy upgrade path to that 9th drive. You should not be running less than 3 drives for any data you care about.

Fortunately, most people don't have data they care about. Their photos and documents are on cloud providers like google/dropbox/facebook (and dozens of others I don't even know of). If disaster happens they might lose a couple days work. There is a lot to not like about cloud providers, but they are probably better than what the average person does - no backups at all. Note that the RAID system I've described here is for data integrity, but there are still house burning down cases where you lose it all,

wtallis · on Sept 26, 2021

> BRTFS-raid2 seems to be good for up to 8 drives if that is your use case. But there is no easy upgrade path to that 9th drive. You should not be running less than 3 drives for any data you care about.

What the hell? Can you please clarify what you mean by "BRTFS-raid2", since that's not a term that has previously existed and BTRFS does not have any functionality remotely similar to the long-obsolete traditional RAID2, and none of the RAID functionality that BTRFS does have has any limitations or safety thresholds for going beyond 8 drives, especially not the RAID1 mode (which is the most plausible thing your raid2 could have been a typo of).

If you have an 8-drive btrfs RAID array, going to 9 drives is as simple as `btrfs device add`, optionally followed by a `btrfs balance start`. Neither of those will reduce the redundancy of any data stored in the array. And I can't see any reason why you're suddenly talking about 3 drives right after talking about 8 or 9 drives.

xoa · on Sept 24, 2021

To be clear, I'm assuming by "array" you mean "vdev" not the overall pool, since adding new disks to a pool has always been the normal way to grow a pool. PR was opened by Matthew Ahrens (one of the OpenZFS founding devs) back in June for RAID-Z expansion if you're actually curious. It almost certainly won't hit any stable major releases until next year while everyone kicks the tires, though should be noted ixSystems has a history of pulling new ZFS features into TrueNAS early (they might be more cautious to be ultra careful about this one though). But working code is there, it just adds another capability to the existing [zpool attach] command.

FWIW I honestly don't think growing an array vs adding to a pool is actually that interesting given the modern state of drives and how capacity growth has lined up with data growth across the spectrum. At the consumer/soho/small business scale growth curves for storage blew past data needs a while ago, so by the time somebody needs more storage the drives are old and it makes more sense to just replace them with much bigger ones (which ZFS supports fine, replace all the drives in an array one by one then it'll expand), or small drives will be so much cheaper that just adding another set to the pool makes sense. Much bigger then that and people aren't going to be doing array expansion ever anyway. Still, cool to see that one checked off the list, and if the functionality ultimately enables generalization to RAID-Z type conversion or better rebalancing that'd be very interesting.

>Then same question but with a disk which is not the same size that the rest of the array.

Don't do that simple as that? There isn't any great way to handle that with full performance and redundancy, and any possible value seems way way too niche to be worth bothering with. Although again if you want to add uneven vdevs (single or multidisks) to a pool you can always do that if you want.

Edit: One other practical consideration I've always had regarding expansion, adding disks and so on, is that the troublesome brute force method to me is an excuse to promote good habits and practices most of us are very lazy about in general. Having to tear down a pool and restore to get best performance and flexibility out of adding a bunch of disks is certainly a pain. But we should be able to do this at any time anyway, we should be semi-regularly testing that it works, and we do all do that testing right? Right? OK probably not, plenty of people aren't in the habit which is why automation is so important. But even if ZFS gained full perfect flexibility, I still think I wouldn't use it in general because I think the exercise of making sure that yes I can do a full restore and everything is working at least once every few years is so valuable.

anuraaga · on Sept 24, 2021

How to kill your array - reboot your computer at a bad time. Who'd have thought raid1 could be so unreliable.

https://btrfs.wiki.kernel.org/index.php/Gotchas#raid1_volume...

I did end up having to copy the array out because of this. Copied it into a zfs array and am glad to be out of the land of gotchas.

mhd · on Sept 24, 2021

Well, at least there's a chance. Now, ReiserFS 4...

pdmccormick · on Sept 24, 2021

His sentence will be served as of sometime in 2023.

kakd · on Sept 24, 2021

The development might resume in 2023

dosman33 · on Sept 24, 2021

Yet another knock-off of Lustre. Lustre is not perfect, and I'm not suggesting it has a place in the mainline kernel, but it sounds like neither does BTRFS.

When you get to large scale (multi-petabyte) filesystems there are not many options, it's GPFS, Lustre, or roll your own. Lustre development has been indirectly sponsored by the DoE for a long time now. And if Lustre didn't exist IBM would charge even more for GPFS. But at least Lustre is free. But it's not a walk in the park to use.

guenthert · on Sept 24, 2021

Uh, BTRFS and Lustre can't easily be compared. The former is exclusively a single-host fs, the latter is a cluster-only fs. The former focuses (more or less successfully) on data resilience and availability, the latter on performance.

There are quite a few large scale distributed file systems. XtreemFS, Hadoop and Ceph come to mind. There are others, e.g. OrangeFS (successor of PVFS), PanFS, etc. .

DiabloD3 · on Sept 24, 2021

Brtfs only exists as Oracle's "replacement" for ZFS, to hedge their bets in case they couldn't buy Sun. Oracle bought sun in 2010, that has been 12 years.

They merged the Btrfs and ZFS teams, they then worked on ZFS entirely with no Btrfs work, and then basically reassigned, fired, or otherwise dissolved the ZFS team when it became clear that OpenZFS (Sun's chosen steward for ZFS) could not be swayed.

Btrfs has had no real work done on it since then, and anyone that is actually serious about data storage has moved to ZFS, or has left POSIX filesystems entirely.

nimbius · on Sept 24, 2021

>Btrfs has had no real work done on it since then

changelog seems to refute this point entirely.

https://btrfs.wiki.kernel.org/index.php/Changelog

>anyone that is actually serious about data storage has moved to ZFS

sometimes data storage doesnt have to scale, and thats okay. there are still plenty of use cases for lvm/xfs/mdadm in environments that dont need Google levels of service. im "serious" about data storage and run raid10 btrfs.

>or has left POSIX filesystems entirely.

oh get off the cross. redhat intentionally threw backing into XFS to make it functional with Ceph. just because BTRFS has issues with RAID5 (arguably the worst RAID) doesnt mean the whole of POSIX storage has been relegated to the ash heap.

topspin · on Sept 24, 2021

> redhat intentionally threw backing into XFS to make it functional with Ceph.

Can you elaborate on that? Is the about pre-bluestore backing of Ceph or... something else?

jcalvinowens · on Sept 24, 2021

This is one of the most laughably ignorant posts I've ever seen on the internet.

> Btrfs has had no real work done on it since then, and anyone that is actually serious about data storage has moved to ZFS, or has left POSIX filesystems entirely

Btrfs is the most actively developed filesystem in the kernel, both by LOC changed and the number of contributors. Facebook publicly stated that they run BTRFS on every webserver years ago.

ZFS is an antiquated joke. It is wed to the memory allocator in Solaris, and requires that you permanently reserve 1GB of RAM per 1TB of disk space, which is an impractical constraint in the modern world. ZFS is also full of cute little wasteful "innovations" like 128-bit file size support.

I don't know what planet you're from, but it's not mine :)

dijit · on Sept 24, 2021

You’re not correct on a number of fronts here.

ZFS is not tied to the memory allocator in Solaris, I’m very unclear how you could have made that mistake.

128bits might feel wasteful, but you’re likely never going to have the problems FAT32 has had- it’s designed for the next 100 years- which has drawbacks but prevent the kinds of ugly hacks that NTFS has gone through to maintain forward compatibility (rolling on disk upgrade).

And yes, the 1G per TB is for the in memory block map, which is a requirement for deduplucation on the default block size.

jcalvinowens · on Sept 24, 2021

> ZFS is not tied to the memory allocator in Solaris, I’m very unclear how you could have made that mistake

> And yes, the 1G per TB is for the in memory block map, which is a requirement for deduplucation on the default block size.

You actually answered your own question: it's antiquated because of the assumption it is possible to get contiguous RAM for that mapping. On modern Linux, that is only possible if you reserve it pre-boot.

detaro · on Sept 24, 2021

> it's antiquated because of the assumption it is possible to get contiguous RAM for that mapping.

source on that? (and even if that's the case, wouldn't that be an implementation detail? E.g. implementations can spill it to L2ARC, but somehow not to different memory locations?)

jcalvinowens · on Sept 24, 2021

My memory is that this mapping does have to be contiguous, but I'll dig it up in the code and link it here either way.

culpable_pickle · on Sept 26, 2021

still planning on linking that code?

dijit · on Sept 24, 2021

The Solaris memory allocator is not contiguous either though

jcalvinowens · on Sept 24, 2021

This is a whole rabbit hole... but essentially, Linux uses a naive buddy allocator for physical RAM, and Solaris was more interesting.

uniqueuid · on Sept 24, 2021

Please correct me if I'm wrong, but the "1GB ram per TB" is for the de-duplication table.

ZFS' memory cache (ARC) by default uses up to 50% of system ram and frees this under pressure. You can also set it lower.

jcalvinowens · on Sept 24, 2021

Yes, this is what I'm talking about. With respect to ARC, it gets along very poorly with the native page cache and memory allocator.

I would add that BTRFS has no such requirements for deduplication.

db48x · on Sept 24, 2021

That’s because the btrfs deduplicator is quite a different animal. You could just as easily run fdupes or something similar. Meanwhile ZFS is doing something much more ambitious.

In any case, I don’t think that either of them are worth using in most cases. Deduplication is just too expensive to be worth it.

Dylan16807 · on Sept 25, 2021

> You could just as easily run fdupes or something similar.

If you only deduplicate whole files and soft links or hard links are suitable for your situation.

db48x · on Sept 26, 2021

That’s not the aspect of the deduplication process that I was referring to.

Btrfs only supports offline deduplication, where you periodically run a process that searches for duplicate blocks and combines them.

Meanwhile ZFS supports online deduplication, where every block written is checked against an index to see if it is a duplicate of some existing block.

That index is what takes up the memory that jcalvinowens complained about. When deduplication is turned on, ZFS keeps that whole index in memory all of the time. Meanwhile btrfs only needs that index to be loaded while the deduplicator is running, so it doesn’t show up most of the time.

Personally I don’t think that either is really worth the cost; the benefit is so small that the cost isn’t worth paying.

Dylan16807 · on Sept 28, 2021

I know the ZFS deduplicator accomplishes more, I'm just pointing out that that if you want offline deduplication on ZFS, you can't have it.

isatty · on Sept 24, 2021

Look you can’t start your post with

> This is one of the most laughably ignorant posts I've ever seen on the internet.

And say

> requires that you permanently reserve 1GB of RAM per 1TB of disk

If you want people to take you seriously. There are ways to debate on HN, this isn’t it.

mavhc · on Sept 24, 2021

What are they changing all the LOCs for if not to fix the existing massive data loss bugs?

EastToWest · on Sept 24, 2021

> Btrfs has had no real work done on it since then, and anyone that is actually serious about data storage has moved to ZFS, or has left POSIX filesystems entirely.

I'm not very familiar with this space, but if this were true why would Fedora use Btrfs by default?

Also on Fedora wiki [0] -- Btrfs is a mature, well-understood, and battle-tested file system, used on both desktop/container and server/cloud use-cases -- which seems to contradict what you said.

[0] https://fedoraproject.org/wiki/Changes/BtrfsByDefault

naranha · on Sept 24, 2021

Apparently it used at facebook a lot too.

https://lwn.net/Articles/824855/