Hacker News new | past | comments | ask | show | jobs | submit login
New tricks for XFS: support for subvolumes (lwn.net)
97 points by diegocg on March 1, 2018 | hide | past | favorite | 70 comments



I think the biggest missing feature for home/casual use in XFS right now is shrinking. Currently, it's impossible to reduce an XFS filesystem (partition) in size, so you have to commit to your disk layout once you've set it up. Whether it's to install another OS side-by-side, grow a swap partition, experiment with or gradually migrate to another filesystem, none are currently possible without adding additional physical storage or using loop devices.

The same applies to ZFS (it's not possible to shrink a ZFS pool), which is why I'm currently using btrfs (with all its pain points) on my machines.


From operations experience, the risks of data loss with btrfs far outweigh volume shrinking capabilities on either XFS or ZFS.


I really would like more data about it. My impression is that the biggest limitation that btrfs suffers nowadays is a severely lacking communication. Even the official wiki is not so up to date, and there are a lot of horror stories surviving from years ago.

The situation has been quite reliable for some time now (single disk, raid{0,1,10}), moreover the feature set of btrfs is really wide (on par with zfs), with a very high flexibility: you can mix and match disks of different capacities, shrink and expand pools, change their redundance level through filtering, everything can be done online...


Even it it doesn't break your filesystem anymore: Performance is subpar in all aspects except sequential read/write even compared to ZFS. high snapshot count degrades performance and lot's of gotchas you discover after using it for a while. RAID1 is not RAID1 - it's oddness of the pid decides which disk to read from... scrub impacts io massivly. Tooling and documentation is not exactly great... lot's of quirky hacks to make up for design errors IMHO.

If it works for you, fine. Also it appears to get better - I won't touch it anymore if I can avoid it.


Fortunately, btrfs is really good at backups.

Well, so is ZFS, but in some situations, like choosing what filesystem to use on a rented dedicated server, any choice that could lead to a situation requiring physical hardware configuration changes is a non-starter.

ZFS licensing is an issue too, as it means that you can't just boot into any ol' Linux live CD (or remotely boot into a rescue environment) to fix the system or salvage the data on it.


> ZFS licensing is an issue too, as it means that you can't just boot into any ol' Linux live CD (or remotely boot into a rescue environment) to fix the system or salvage the data on it.

Sure you can. I've done just this with 3 different live CDs: ArchLinux, FreeBSD and OpenSolaris. I'm fairly sure I've also used ZFS on the Ubuntu Desktop live CD as well but that was just for playing rather than rescusing a degraded system.


FreeBSD and OpenSolaris probably aren't very useful, when trying to rescue a Linux system. Especially if you need to chroot and run things from there. (My need so far was to rescue non-booting system, because the zfs package upgrade went wrong and didn't update spl too. Re-running dracut would be a somewhat problematic from these two systems).

Ubuntu desktop live CD doesn't contain zfs, you have to install it from apt.

However, if you have a ZFS system, I see no problem with having an USB stick with minimal installation of your distro of choice, together with ZFS support. I'm glad I did have it since the ZFS install.


I think you're nitpicking a little to be honest. None of those problems are hard to workaround:

> FreeBSD and OpenSolaris probably aren't very useful, when trying to rescue a Linux system. Especially if you need to chroot and run things from there. (My need so far was to rescue non-booting system, because the zfs package upgrade went wrong and didn't update spl too. Re-running dracut would be a somewhat problematic from these two systems).

I do see your point but it really depends on the problem as not all recoveries require chroot / package management access. I've rescued Solaris (not OpenSolaris) with an OpenSuse live CD back when a cavalier opp chmodded /etc. I've rescued OpenSolaris with a FreeBSD CD back when a faulty RAID controller borked the file system. As for ArchLinux ISOs, I've used them to rescue more systems than I can count. But as you said, some problems do just require booting an instances of the host OS via some means.

> Ubuntu desktop live CD doesn't contain zfs, you have to install it from apt.

It took me all of about 10 minutes to bake the ZFS driver into the ISO. It's not hard compared to the other technical challenges you've discussed. Though if that's too much effort then I think you can also just apt it from the Live CD and manually modprobe it into the running kernel.

> However, if you have a ZFS system, I see no problem with having an USB stick with minimal installation of your distro of choice, together with ZFS support. I'm glad I did have it since the ZFS install.

Indeed. My preferred method is having rescue disks available over PXE booting. Before then I was forever hunting down my recovery disks or spare USB keys / CD-Rs. Not to mention the pain involved if the system I was trying to recover was my main workstation (ie the hardware I'd normally use to download and burn CDs on).


> None of those problems are hard to workaround:

Sure, there are very few problems that cannot be solved by throwing some time and sweat on them. However, when I do need to solve something, I prefer to not be sidetracked by sub-problems. Smooth sailing and all that.

It's much simpler to pull an usb key from the drawer or PXE boot, as you mentioned, and go on on solving the damaged system, than to start downloading and preparing a live distro somewhere.


Again, you're overstating things. If it genuinely takes you more than a couple of minutes to run apt and modprobe then I really think you shouldn't be allowed anywere near a degraded system to begin with. These aren't "sub-problems" - they're the absolute basics of system administration.


It a bit more than couple of minutes to download installer, install it somewhere (livecd doesn't have persistent /), install zfs there and only then go on doing whatever you were doing.

Compared to grabbing standard media you have somewhere, it will take at least 15 minutes extra.

Basics of system administration does not mean, that you are wasting your time, especially on something you can be without.


> It a bit more than couple of minutes to download installer, install it somewhere (livecd doesn't have persistent /), install zfs there and only then go on doing whatever you were doing.

You don't need a persistent root. I'd already addressed that point. Just run modprobe and you're done.

>Compared to grabbing standard media you have somewhere, it will take at least 15 minutes extra.

Bullshit. I've done exactly what I described and it did not take me 15 minutes. Furthermore all you're doing is pre-emptively pushing the work to before your outage which you could do the same with the ISO (if you really wanted to compare apples with apples).

> Basics of system administration does not mean, that you are wasting your time, especially on something you can be without.

The whole point of this tangent was about when one needs an Live CD. Not about whether creating a live CD is worthwhile when you already have a USB key. That new argument you've invented is stupid because the answer is quite clearly "use the USB key if that's already in your draw." But what happens if you have a ZFS volume on a system and you don't already have a recovery media? (ie the original question) Well in that case you can use any of the methods I described. Or, of course, you can create a USB key too. But that will take just as long as the methods I described anyway (you still have to download the OS image, ZFS drivers and write them all to your storage medium. Thus all you're really doing is swapping out one chunk of plastic with another chunk of plastic).


> You don't need a persistent root. I'd already addressed that point. Just run modprobe and you're done.

That assumes too much. For example, that you have a network connection while booted from the live media. You may not have one; then you cannot run apt/yum and you need persistent media that you prepared somewhere else. (Happened to me).

> Bullshit.

Surely. Or you have extra speedy USB keys. Just installing minimal distro on USB takes a better chunk of that time.

> The whole point of this tangent was about when one needs an Live CD.

When you are doing something non-standard - and installing ZFS on Linux is pretty nonstandard - you know in advance that the normal live media won't work. It's prudent to have something prepared, if/when SHTF event occurs.

Specifically with regards to filesystem, when you are installing with non-distro-provided-fs root, you need to make it anyway, just to install it in the first place. So instead of throwing it away, just label it and put in into the drawer. (When you are not installing on non-distro-fs root, you don't need support for that fs in live media at all, the standard one will do for making the system boot).


> That assumes too much

You've been assuming a crap load of stuff as well when it suits your argument. Like having a pre-prepared USB key to begin with.

> For example, that you have a network connection while booted from the live media. You may not have one; then you cannot run apt/yum and you need persistent media that you prepared somewhere else. (Happened to me).

Indeed. You might also not have a CD drive on the host (happened to me), or any blank CD-Rs, or a CD burner on your workstation. Or the internet connection might not work on your workstation either. But then most of those arguments can be made for creating a USB key as well so your point is moot. In fact my latest workstation (Macbook Pro) only has USB-C so I couldn't use my USB keys when I went to install Linux on that.

My point is, if you're looking for ways to nitpick, there are plenty for your examples as well. In fact there will be a thousand different exceptions for any solution you could dream up. Thus is the nature of working in IT.

> Just installing minimal distro on USB takes a better chunk of that time.

Arguably yes but that also takes longer and your original point was about getting stuff done as quickly as possible. So you're now contradicting yourself.

> When you are doing something non-standard - and installing ZFS on Linux is pretty nonstandard - you know in advance that the normal live media won't work.

Except the whole point of this tangent is me demonstrating where it does work.

> It's prudent to have something prepared, if/when SHTF event occurs.

Now you're arguing a different point to the point I was discussing. I'm not going to disagree with you there (since I've already discussed I run a PXE server for situations like these) but that wasn't the topic we were discussing.

I seriously just think you're now just arguing for the sake of winning an internet argument. I'm not going to argue with you that a CD is better than USB because it's pretty obvious that isn't the case. But that wasn't the point I was discussing. So for the benefit of my own sanity can we please get back onto topic: you can use live CDs to repair a degraded system running ZFS. Sure there will be occasions when you cannot; but that's the case when doing anything in IT (and thus why use sysadmins get to command such a good wage). But generally you can. And I literally have. Many times in fact. So enough with the dumb "death by a thousand paper cuts" and goal post moving arguments please.


> You've been assuming a crap load of stuff as well when it suits your argument. Like having a pre-prepared USB key to begin with.

You are still conveniently ignoring what I said: if you want to install system with ZFS root, you have to make it. That's also the reason why I have it. I just didn't throw it away after the installation.

> Except the whole point of this tangent is me demonstrating where it does work.

Yes, if everything is aligned right, it can work.

> I seriously just think you're now just arguing for the sake of winning an internet argument.

You are free to think whatever you want.

> you can use live CDs to repair a degraded system running ZFS.

Yes, under certain conditions. How they apply in your environment is up to you to assess.

> Sure there will be occasions when you cannot; but that's the case when doing anything in IT (and thus why use sysadmins get to command such a good wage). But generally you can. And I literally have. Many times in fact. So enough with the dumb "death by a thousand paper cuts" and goal post moving arguments please.

It's not goal post moving, it's what happens. Having a livecd that supports your configuration is advantageous to not having it. Being able to download a ready-mady one is advantageous to having to make it. Etc.

So when I can choose between freebsd or opensolaris iso and native system that fully support whatever I need (that was the original issue, remember?), of course I will choose the latter, or having the latter available is preferred.


> You are still conveniently ignoring what I said: if you want to install system with ZFS root, you have to make it. That's also the reason why I have it. I just didn't throw it away after the installation.

I'm not ignoring it; I've repeatedly addressed it and pointed out how it's not true (the Ubuntu Desktop example). Want a few more examples? When I installed ArchLinux with a ZFS root I didn't use a custom ISO (read their ZFS wiki if you don't believe me). I also didn't create a custom Ubuntu Server ISO when I installed that with a ZFS root. Both were installed from CD - the vanilla CD available on their respective websites.

Also, even if you did install from a USB key; what's to say you don't then lose said key afterwards? I'm forever am losing them.

The point is whichever argument you're going to make will be full of more exceptions than you can count. So nitpicking one over the other, like you are, is an utterly pointless exercise and a distraction from the original point I was making.

> So when I can choose between freebsd or opensolaris iso and native system that fully support whatever I need (that was the original issue, remember?)

No that wasn't the original issue. The original issue was whether there are an live CDs that can be used to rescue a degraded ZFS system - which I've demonstrated there are.

However I do agree with you that running ZFS on Linux is a little pointless when FreeBSD and the OpenSolaris forks are all solid platforms and have unencumbered native ZFS support. Though installing a ZFS root on FreeBSD was just as painful as doing so on ArchLinux (at least that was the case a few versions ago - things might have improved since but thankfully FreeBSD never really needs rebuilds so I've not had revisit that particular pain point)


> Fortunately, btrfs is really good at backups.

For backing up to another file system? How so? Is there something like "zfs send" or even "xfsdump"?

Honest question - I haven't look that closely at btrfs.


Yes, "btrfs send" and "receive" exist for that purpose.


Mind that it doesn't work quite so well as ZFS: btrfs send/receive doesn't produce an identical file system as the source. ctimes aren't preserved, some attributes, and depending on mount options, ACLs and xattrs can vanish too.


Is there really any reason to believe that Btrfs has more risk of data loss than XFS or ZFS, at least on simple single or mirrored drives?

Honest question - my home box runs a Btrfs mirror on openSUSE.


The two most obvious reasons to assume so are that btrfs is in comparison relatively new and quite complex. Not to say zfs isn't complex, but I'd rather trust zfs just because of its age.

That being said, unless I really need those specific features, I go for ext4 whenever possible, as it has to be the most battle tested one, at least when it comes to *nix. It also seems that fsck.ext4 has almost magical powers sometimes, but that shouldn't stop you from making backups obviously.

Related: http://events.linuxfoundation.org/sites/events/files/slides/...


> I think the biggest missing feature for home/casual use in XFS right now is shrinking. .... The same applies to ZFS (it's not possible to shrink a ZFS pool)

I have to admit that recently I've only had the exact opposite use-case: Wanting to expand volumes.

Except for authoring install-media or images where you want to reduce the final file-size... What common use-cases are there for volume-reductions?


Since XFS works by multiple allocation groups spread across the volume, that would require you to migrate data out of the area you want to reclaim, which would require a pretty significant reshuffle of the volume contents.

It's easier to just backup and refill with a single set of reads and single set of writes than it is to suffer the huge overhead of all that random metadata updating and seeking.


Doesn't LVM solve this? I know this might not work as well in some cases with multi-disk arrays, but I'm not sure I'd call that "casual use".


You mean, by overcommitting the filesystem size, and using discard to free up underlying space?

Possibly... how well does that work in practice? What happens when LVM runs out of space due to some runaway disk-filling process, is the filesystem ready to handle out-of-space errors coming from LVM in all situations?


No, the biggest missing is native compression.


There are 3 filesystems that can do transparent compression: NTFS, ZFS, btrfs. One is for Windows, the other isn't available for RedHat/CentOS, the third is getting sunset. XFS could really benefit from compression support.


ZFS is available in RHEL/CentOS. Not as a first-party option, but thanks to kABI that Red Hat provides, it is provided by the ZoL project also in a form of a binary kernel module, ready to go, without having to fumble around with DKMS and compilers.

Btrfs is being obsoleted by RHEL only, because they have XFS expertise, but not Btrfs. No need to split limited resources. Other distributions will continue with development. SuSE doesn't intend to stop.


Btrfs isn't getting sunset tho.

And ZFS can be installed on any Linux distribution.


GP probably meant in RHEL/CentOS. It was deprecated in 7.4, it will be removed in 7.5 (already is in beta).

Yeah, I spent one afternoon migrating a CentOS+btrfs machine to XFS.


Which file types are there that benefit from compression, but are not yet compressed as part of the format?


The general mix has some potential for compression.

Looking at one machine available:

* / has 2.08 compression ratio,

* /var/cache 2.15

* /var/tmp 1.0

* /home 1.02

* /srv 1.10 (few web apps, pgsql instance for these web apps, svn repo, samba shares, local prometheus store).

If your server uses spinning rust, you can also increase the I/O by using compression. You are trading the CPU time for I/O bandwidth. Depending on your workload, it may be a sensible trade-off.


I was thinking more in terms of application data, thank you for giving me this perspective.

But still wondering: does the amount of this data really warrant compression? I mean the smallest sensible size of an SSD is around 100GB and it has lots of performance.


In this case, the svn repos and samba shares are way over 100GB and even with classic HDDs, the machine can saturate the network and still be basically idling. No need for SSDs, classic drives were more cost effective.

The compression was just something nice to have. Lz4 in filesystem is basically for free.


Logfiles. ZFS with lz4 has given me compression ratios of over 100× with /var/log, due to the huge amount of repetition.

My first test of PostgreSQL on ZFS was also quite instructive. lz4 again achieved respectable compression (>10× for my datasets) and improved throughput several fold (with no other tuning!).


I've been having some fun with a revived SGI Octane running IRIX and it occurred to me, quite out-of-the-blue, that XFS development essentially ceased before SSDs were ever even contemplated. For a few moments, I pondered the apparent profundity of this realisation, and then I moved on, cursing the lack of package management while trying to get something to work.


I worked down the hall from the XFS team at SGI in 2005. Development had definitely not ceased, and their work did not seem heavily related to Octane systems. They were focused on XFS for Suse and RedHat Linux for enterprise and supercomputing.


ah, youre right up to a point. around 2014 there was a great flurry of work done, which meant its metadata speed improved >50x turning it from a great streaming file system, to a good all rounder.


It was a bit before that, iirc. LWN article summarizing the changes at [1], which are mostly delaylog, if we are referring to the same thing. Was default-enabled in 2011 on Linux 3.3.

... I’ve happened to spend much of the last three weeks learning about XFS.

[1] https://lwn.net/Articles/476263/


Aha, good spot. My boss sent me the video of that presentation


Ah good to know; I had given up on XFS before then because the metadata speed made certain SVN operations painful


pkgsrc is still setup to work on IRIX, although few have used it recently


IRIX, like all SRV4, has a perfectly adequate package manager.

In the SGI case it is called "inst"


Believe there was a "software manager" or some such.


By the way, I'm using XFS only because it allows for more than 64k hardlinks per file. Strangely, this isn't possible with ext4 out of the box.


Just curious, but what use case do you have for such a voluminous quantity of hardlinks?


Basically copying files without actually copying them. For example, when making versioned backups, or when copying a large number of files from production to various development environments.

Something I wish there was a better solution for. Perhaps there is a filesystem out there that supports CoW in a way that fits this usecase (XFS perhaps even), but I haven't looked into it.

The disadvantage of using hardlinks is that you can't hardlink between users (user is a property of the file, not of the link to the file), and there's always the danger that a write takes place through one of the links. Imho, that should really be solved at the filesystem level using a CoW scheme.


> Basically copying files without actually copying them. For example, when making versioned backups, or when copying a large number of files from production to various development environments.

To me it sounds like you want simple snapshots and backups without the redundancy at the storage level.

So why don't use a filesystem which supports that natively, like ZFS or Btrfs?


Do they handle 64k snapshots?


That's one backup every hour for 7+ years. I'm really curious about the use case.

Anyway, theoretically[0]:

> As you might know btrfs treats subvolumes as filesystems and hence the number of snapshots is indeed limited: namely by the size of files. According to the btrfs wiki the maximum filesize that can be reached is 2^64 byte == 16 EiB

But it seems practically you're hitting the mud at ~100 snapshots[1] but be sure to read the reply to that mail as it will depend on the use case and it might turn out to be fine way beyond that.

[0]: https://unix.stackexchange.com/questions/140360/practical-li...

[1]: https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg...


Not sure, I have ZFS as a docker storage backend, its yet to crash*

*not the best data point I realise....


> Perhaps there is a filesystem out there that supports CoW in a way that fits this usecase

I'd recommend zfs snapshots - or for something a bit different, maybe nilfs2:

http://www.linux-mag.com/id/7345/


Does btrfs, with cp --reflink, not meet your use case?


Yes, thanks, I found that, but I'm not sure if I'm ready for btrfs yet. I have to look into it, and see if btrfs is sufficiently stable, etc.


reflinks should work on suitably recent XFS as well.


For versioned backup you should use reflink rather than hardlink.

Hardlink is not CoW. Modify the file all your backup will change too.


git clone uses hardlinks to reuse object storage if you clone a local repository. I use it when I need another working copy, that way you save some space. And time as well as you don't need to pull it all from the network.

edit: Local in this case means cloning inside one filesystem. Hardlinks cannot span filesystem boundaries of course.


Just out of curiosity, what is the use case for having multiple local clones? Is it to test run pruning or manipulate the reflog? Or is it for developing on git itself?


I typically have some development ongoing on one clone and use the other to do merging for releases and so on.


Why? Isn't the whole idea of using git that changing branch is just as simple/fast as doing "cd"?

You're probably not doing your ongoing work in the branch you are merging for release anyway, if you did you'd have solve any conflicts all over again when you switch clone.


> When a CoW filesystem writes to a block of data or metadata, it first makes a copy of it

Is this a really a precise description? I was sure that (for data) the actual copy only happens in case of a block with multiple references. If there was a single use of a block I expected it to be modified in place in most real-world filesystems. Summary from wikipedia seems to confirm that.

Am I missing something, or is that just unfortunate wording?

What they describe with the write of the data followed with write of the indexes as new elements seems more like a log filesystem.


No, the copy is generally done every time. In-place writes cause all kinds of problems with atomicity, ordering, torn writes, etc. Also, for ZFS/btrfs/bcachefs style filesystems the pointer contains a checksum of the block and thus it needs to be updated on every write.


Always copying is what ZFS does as far as I know - so that the final write on raidz can be a pointer swap at the top of the tree.


CoW stands for "Copy on Write" a system that updates blocks in-place is not is not CoW.


There's lots of things which are called copy on write, where CoW means really copy-on-deduplication-otherwise-update-in-place. Like qcow2 filesystem. Or Cow type in Rust.


It's an implementation detail that can and does go either way in many real world systems, depending on the technologies involved.


tl;dr XFS will support using filesystem images as if they were directories, kind of like an "internal loop", which will allow (with the help of copy-on-write data, which they support in recent versions) having subvolumes/snapshots.


It might just be how you present it, but to me that sounds like using multiple layers of hacks to implement use-cases other filesystems were carefully designed to support in the first place, and that sounds extremely unreliable and brittle.

I'll stick to ZFS :)


Yeah, these subvolumes are going to have scalability issues compared with cleanly designed subvolumes such as in ZFS. But I wouldn't describe them as a hack - it's a rather interesting feature that not other filesystem has explored before. I would describe it as "loop devices done well". I don't think reliability will be a problem, for the upper layer these embedded filesystems are in fact just files.


Nice can’t wait for the next version of XFS from red hat.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: