Here's a funny story. At one point bcache development was funded by a startup (which I won't name here). They were using it as the local storage layer of a distributed storage product. I worked there for a year in 2014.
Apparently they were not aware of the fact that bcache was a) GPLd code, or b) developed before the company existed, first as a hobby project and then at Google. After a couple of years, they noticed that Kent was in fact posting the bcache source code on his personal web site. At this point they fired him and threatened to sue. I quit the company then (along with a number of other people, for mostly unrelated reasons, such as the fact that the CTO was a notorious brogrammer). Kent got a litigator and when it was made very clear to them that they had no case, they backed down, but not before wasting a ton of money.
As far as I know, they're still actively violating the GPL by shipping a product containing modified kernel code in it without releasing the source, nor do they acknowledge that they did not develop the key component of their product.
The "commercial" version had a rather broken and messy snapshots implementation and had diverged a bit from the open source bcachefs at that point, mostly because snapshots were poorly implemented. It's also kind of funny because after we left the company we still knew of some tricky data corruption bugs, and it's likely they're still there in the "commercial" version, because backporting the latest fixes would be non-trivial and I don't think their testing or development methodology would have caught them.
Anyway, I gave up on startups and enterprise storage after this, but Kent is still developing bcachefs on his own time and money, so if you use it please consider donating some money to support its development.
His Patreon page[1] shows he only receives $762 in donations a month, less than a third of what he needs to keep from eating into his personal savings.
Sad given how much a modern filesystem would help Linux : (
I've been using btrfs since about 2011, and I've stopped using ext4 / xfs / zfs everwhere since about 2014.
From 2012-2014 it was mostly breakage every other month. From 2014-2016, it was semi-annual issues.
For the last ~18 months I have had ~30 machines running btrfs with no issues, some servers, some personal computers. The release notes are boring, the bugs are boring, and to me its definitely in a state I would strongly consider trusting it with any workload.
I worry that btrfs is just going to remain doomed. It wasn't stable half a decade ago, so it - for some reason - cannot be more stable now. But it has seen so much work put into it to make it as mature as it is now, and in my experience it is pretty damn mature now. All I want to see is another year and a half of perfect stability before I would start arguing to drop zfs entirely.
Are you running BTRFS with its built in RAID? That's been the biggest blocker for me. There have been numerous RAID bugs that have caused data-loss and I believe at least one of them is still unpatched.
My main issue isn't actually a single thing that's wrong -- it's the completely and utterly haphazard way many of the features in btrfs have been "designed"[1]. Some of these problem they've had seem, to me at least, to be a fundamental lack of a coherent design. That does not bode well for stability, even 10(?) years after its first version.
bcachefs seems to have a much more coherent design.
[1] "Oh, yeah, I don't know how to handle this code path yet, let's stick a BUG_ON in there! I'm sure we'll figure something out later."
As far as I know they consider RAID0, 1, and 10 to be stable. Last time I used it rebuilds were substantially slower than ZFS or mdraid. Rebuild performance seems to one of a few issues that BTRFS has had trouble solving. RAID 5 and 6 were declared stable last year, only to have that retracted when some fatal flaw was discovered that would apparently cause data loss if you needed to rebuild.
It's mostly true but the issues as far as I know are:
- RAID 1 with more than 2 disks is not what you think it is: the data will be mirrored but only once, no matter how many disks you have (meaning if you have a mirror with 3 disks, you only have 2 copies of your data). Because in BTRFS lingo, RAID 1 means '2 copies of the data' https://btrfs.wiki.kernel.org/index.php/FAQ#What_are_the_dif... which is not what people expect from RAID 1 with more than 2 disks
I've been using btrfs since about 2014 on CentOS7. I only use it with mirroring+compression. No snapshots. I mostly went with it for data scrubbing and compression.
My experience has been mixed, but haven't had any data loss. There was a bug for a while regarding free space so occasionally the system would seem to be full but wasn't ....and it was a real pain to correct it.
I now have a cron job that does a monthly btrfs balance along with a mount -oremount,clear_cache. I also run the latest kernels from http://elrepo.org instead of the CentOS7 kernels so that I get the latest patches.
ZFS is not GPL? I dunno, I like feeling safe when I do kernel upgrades knowing even if for whatever reason my ZFS module doesn't compile/work under the new kernel, I won't be left without a root FS. I have been running BTRFS for 5 years with very little issues, and enjoy not having to compile a new dkms module with every kernel.
ZFS with DKMS is a disaster, at least in my experience. Honestly, I can't recommend ZoL unless you're running a distro with relatively stable kernel releases that don't change substantially or that happens to be supported by ZoL with binary packages. ZoL on Arch was... trying at times. It worked great, but my paranoia meant that I ended up adding the kernel to IgnorePkg to force manual kernel updates (mostly for my own memory). But then, it also meant having to build all of the ZFS packages (including SPL) tied to that specific version. This usually meant waiting until the AUR packages were updated as I figured that indicated someone else must have tested ZFS on that specific kernel version.
I remembered thinking DKMS might solve the problem, but I ended up having to use recovery media just to get an environment to reinstall an older kernel and let DKMS do its thing after a botched update started provoking panics. I suspect a version mismatch based on the errors but never investigated it beyond fixing the problem and moving to the prebuilt modules. Things may have changed, but the Arch ZFS+DKMS packages were a bit flaky and required some manual modification just to boot (should've taken this as a warning!).
Granted, it was my fault entirely for being a bit too enthusiastic with ZFS on Arch. To be honest, if I were to use it again, it would be on FreeBSD. Not Linux. I recognize it's fine for other people, but in my use case it wasn't.
It is more like Arch Linux is a disaster. Upgrading the kernel package replaces the current one! Come on, any distribution worth it's salt just installs new versions alongside and you can select any of them in the boot screen. This is a ridiculous packaging policy regardless of ZFS or any other DKMS modules.
And I acknowledged that using it on a different distro would be more advisable, although I still stand by my claim that FreeBSD is far more appropriate for ZFS.
I will, however, agree that having no fallback to the prior kernel version is a problem. In practice, it's never caused me much trouble except when I do something stupid like using ZFS from the AUR. initrd generation has historically seemed to be more problematic under Arch, but I'd argue that's mostly fixed with install hooks.
In all honesty, it was probably more the fault of the zfs-dkms packages than it was either the kernel packaging policy or ZoL+DKMS itself (for reasons I elaborated on in my original post).
But, that's also what you get when you use packages from the AUR or using a distro like Arch for something that really only benefits from a wider installation base (like Ubuntu does, for instance).
I know you acknowledged that using it on a different distro would be more advisable, I just wanted to vent about more broad issue of their packaging policy. Sorry if it wasn't clear.
I do agree. There are circumstances where Arch's packaging is brain dead (they only recently, within the last 2 years or so, started validating packages against signatures!). I use it for a number of applications, and as my desktop OS among others. However, I'll freely admit at least part of my choice is perhaps the fault of masochistic tendencies. After all, I migrated to Arch from Gentoo, and I used Gentoo for years! :)
In all honesty, I've been bit more by the initrd and mkinitcpio's failings than the lack of a fallback kernel. That's mostly fixed with packaging hooks that essentially guarantee it will run, but it's still a problem with the ZFS packages and may require running it manually (which is annoying). However, that wasn't always the case, and sometimes the generated initrd would be missing something important. You can imagine what happened next.
Indeed. Arch is good for a few things, but sometimes stability isn't one of them when it comes to unsupported packages. ;)
I'm not sure I'd be brave enough to run ZoL again, but given Ubuntu's FAR wider install base and availability of binary packages, it's the better option if you have to choose.
My personal preference would be to stick with ZFS on FreeBSD. Performance is probably better.
I'm actually surprised by this, but I'd wager that you also didn't use the DKMS AUR packages. I also suspect you wait for the zfs-linux (etc) packages to match the kernel version before updating. Or you manually bump the kernel version and build it, hoping for the best (edgy!).
I considered it, but I have some problems with the Arch LTS release cycle. If I were to choose an LTS kernel, why not just dump Arch and go with an Ubuntu LTS, which has better long term support?
The other problem is that at the time, the ZFS packages for LTS were pinned at a version that had a known issue with arc_reclaim encountering a deadlock essentially causing the file system to become unresponsive after a substantial transfer (think rsync).
Now, obviously, it wouldn't be that difficult to modify the PKGBUILD to pull a newer version of ZFS, but there's a point in time where the maintenance required to update starts to outweigh whatever benefit you can glean from the LTS kernel.
That's not the case now since the LTS packages appear to be at v0.6.5.9, which has the fixes, but I don't remember this being true about a year ago.
BTRFS lets you make CoW copies of files. You can even retroactively merge the blocks that store identical files. BTRFS also makes it not a giant pain to remove a file from snapshots.
ZFS does seem to work better overall, but I wouldn't call either filesystem great at this point in time.
It's a little black magic -- didn't have time to completely research (this is a home NAS system.)
I believe there was a bug in the clear space cache. This could cause the system to think it didn't have free blocks... you'd have to mount another device to create more space in order to rebalance and fix.
Eventually I saw a bug fix report about a corruption in the cache... I never investigated to see if my current kernel has the fix.
I'm using BTRFS on many VMs running Ubuntu servers, and I find that expanding the virtual hard-disk, on Proxmox, without stooping the VM it's far trivial.
I think you missed a key word there so the meaning is lost. Is it "far from trivial" or "far more trivial"? You also surely meant stopping, not stooping.
I purposefully ran btrfs on a malfunctioning drive for over a year (kernel 3.12, only metadata dup), much more reliable than ext4 which would lock up the entire filesystem on read/write failure and often go read-only, with btrfs the only visible signs of malfunction were dmesg and the scrub log.
Also been using it as / since 2012 with no issues.
It is ok for a single disk FS. It is no ZFS though which is its largest problem, people keep marketing it as "linux awnser to ZFS"
No, it is not.
Maybe some day, but today it is no ZFS. I love zfs..
Further the ZFS utilities are far easier to use and understand. zfs and zpool commands well documented, and intuitive. btrfs utilities are are not, IMO
I am fine with using btrfs as a replacment for ext on my OS drive, but for my large data arrays of multiple disks it s ZoL all the way
Likewise, I've run it on hundreds of machines over the past three years without issue. I do continue to use EXT4 for database hosts as it far outperforms BTRFS with PostgreSQL from what I've seen.
How is the speed now? I never had data issues but I definitely had speed issues. Btrfs was really slow compared to ext4/xfs back when I tried it. And I mean orders of magnitude slow. I had an application that did a lot disk access and switching off of btrfs brought the runtime down from a week to just hours. I want to like btrfs, but after that I just can't trust it for high disk load situations.
I had a bad experience with in around 2014-15 on Ubuntu. 3 different laptops in my house suddenly stopped working (didn't boot at all) within the time-span of a year, 1 laptop faced the problem multiple times. In all cases, I had to format the root partition and reinstall Linux. All 3 had btrfs for / in common.
I moved back to EXT4 and it never happened again since then.
>Snapshot implementation has been started, but snapshots are by far the most complex of the remaining features to implement
Snapshots are the #1 feature of COW filesystems. I've been using them for a bit in btrfs and this feature is game-changing (and no, it hasn't eaten my data yet).
OpenSuSE uses btrfs by default and relies on it for one of it's killer features. Before a system change, such as installing updates, changing services configuration, etc, is made using YaST, snapper takes a snapshot of the root filesystem. If something breaks, just roll back to the previous working state.
I'd say that for the OpenSuSE folks, btrfs falls squarely into the "good enough" category.
I'm a big opensuse fan, but none of my opensuse machines are running btrfs.. Although, besides the stability (which can't really be much worse than ext4/xfs which is what I apparently choose on the two machines) I think what drove me nuts about opensuse's use of btrfs was all the subvolumes. Which are cool, but just another thing for me to deal with, and my computing theory for the past few years can be summarized by "KISS, unless its really hurting".
IIRC, doesn't BTRFS also allow you to do cool things like change RAID levels dynamically? (E.g. You can be running a 2-disk RAID-1 array, pop in another disk and tell BTRFS to make it a RAID-5 array instead, then a year later pop in 2 more disks and switch to RAID6, all with no downtime.) I imagine that wouldn't be possible if you were doing RAID at a different layer.
Linux's built-in software RAID (implemented at the block level) has supported online RAID level changes for many years. Check out the "grow mode" section of the mdadm(8) man page sometime; it goes into great detail about which operations are supported.
Yeah, shout out to mdadm and the md driver. I've been using it for years and years and it's rock solid. Being able to grow arrays (online) and convert plain drives into mirror sets is great. I feel that md and lvm are underappreciated by a large number of Linux users…
Well, there are other reasons; you want to write code that operates on the data, and neither the code nor the data fits on a single machine - you have to target an abstraction which spans machines. Block storage is too low-level an abstraction.
That isn't to say that using high performance block storage isn't still a win even when the redundancy is multiplied at a higher level. The higher level redundancy is also about colocating more data with the code - i.e. it's not just redundant for integrity, but to increase the probability it's close to the code.
Of course. Most production monoliths are deployed on networked block storage - aka SAN - and NUMA is already structurally distributed memory, even on a single box. But it's not the right paradigm to scale well, no more than chatty RPC that pretends the network doesn't exist is the right way to design a distributed system.
I think the high profile ones (Google, Amazon etc) use relatively dumb OS drivers and do the fancy distributed FS abstraction stuff in userspace. Certainly stuff like Ceph and Gluster don't have very good reputations and are mostly sold to relatively clueless "enterprise" customers.
According to Kent on Reddit, bcachefs technically had online defrag since it was just bcache -- via the copying garbage collector for reclaiming space. So bcachefs will simply inherit that feature by design, which is great.
Apparently copygc is off right now because reasons, though (I'm going to assume it's almost certainly the related extent/compression issue that's holding this up from being enabled, which you can see referenced on the home page, at the bottom).
I should have been more clear - copygc is off by default in upstream bcache, it's on in bcachefs (and required, in order to guarantee a capacity when doing random writes)
I think ZFS is the only viable open source CoW total storage management option commercially. These new Linux filesystems are way too late to the party, and it will take a decade for them to reach maturity when they reach basic 1.0 feature parity.
In parallel I see XFS as the long term evolution for Linux file systems. It will continue to scale slightly up from where it sits today and address fail in place, flash, metadata checksums, snapshots etc where total storage management is done by overlays like HDFS, object stores, etc.
I think ZFS is fantastic for businesses but there are a couple places where it falls short compared to bcachefs for me:
- For non-business users who want a RAID, ZFS is too inflexible. You can't add or remove disks to a RAIDZ vdev. If you want the space efficiency of RAIDZ, you have to expand your array in units of entire vdevs. If you want replicas, you have to expand in at least pairs of disks. BTRFS and bcachefs both allow you to replicate more flexibly and reshape your array.
- ZFS doesn't work particularly well with SSDs as caches. ZIL and L2ARC are nice but they're not as nice as a full bcache-style tiering setup. bcachefs tiers let you do crazy things like a 4-tier storage setup with Nearline HDD -> 15k SAS HDD -> SATA SSD -> NVMe SSD.
- ZFS is pretty complex to manage in general and major features like ZIL and L2ARC are arcanely documented. So far, bcachefs is pretty straightforward to use.
While I really like these sort of file systems, I'm not holding my breath.
This isn't a simple filesystem project, but plays in the next-gen space ZFS opened up.
There will be a lot to do, especially IO scheduling, RAID safety with shitty drive firmwares, consistency guarantees with fsync/partial flushes etc.
I'm pessimistic about it being mainlined in the near future, the core team will be weary of a second btrfs.
What I would like to see is a APFS/exFAT crossover with COW and data checksums without all the volume mgmt with ports for all possible operating systems so everyone can use it for their SDcards, usb-sticks and external drives without making tradeoffs and using fuse.
> with COW and data checksums without all the volume mgmt
The fact that the raidz volume is not an opaque block device allows ZFS to be aware of data corruption when comparing checksums and self heal if the data can be re-constructed from the array.
I'm not saying any attempt at a new filesystem should have to bundle the two layers together, but they should allow for communication between the abstractions.
> What I would like to see is a APFS/exFAT crossover with COW and data checksums without all the volume mgmt with ports for all possible operating systems so everyone can use it for their SDcards, usb-sticks and external drives without making tradeoffs and using fuse.
+1. Filesystems without bit-rot protection on flash drives are going to become at least as big a problem as optical disc rot.
What's the problem with fuse? It allows sharing code between Linux, OS X, (Free)BSD and even Windows (via dokan).
Yes, it will not offer you the same performance as an in-kernel driver (due to context switches), but given that CPU power always increases, no big problem there.
> Yes, it will not offer you the same performance as an in-kernel driver (due to context switches), but given that CPU power always increases, no big problem there.
This might be the case if you're running something incredibly easy on I/O like large sequential read/writes, but if you do anything at all challenging on I/O like opening desktop applications (Photoshop, lots of random reads), editing or viewing high bitrate video (very high throughput) or god forbid running a database, this is a huge problem.
2. Support varies between OSes. For example OpenBSD's FUSE does not have the default_permissions/allow_other flags, which makes for example encfs (and any other virtual filesystems that are backed by multiple files) a pain to use since OpenBSD 6.0 removed user mounting.
I'm more optimistic. What you may or may not know is that bcachefs is a tweak on bcache, which has already been mainlined and is pretty stable (I've personally been running bcache for a couple years on my home linux machine).
The point is a lot of the things you bring up are already covered by bcache. Bcachefs "just" adds a filesystem layer on the bcache tree structure.
If you think we need an alternate effort and/or competition to build an advanced, native filesystem for Linux (I do), please consider a subscription on Patreon (https://www.patreon.com/bcachefs). Kent has a long history of shipping sophisticated, high-quality code.
Plus he is willing to help out when you need to nail down a bug, as I recently discovered with bcache. My first Linux kernel patch might be a fix of a deadlock in bcache :-)
Chris Mason and the btrfs team are clearly talented. But the initial excitement of btrfs has sadly dissipated and its promise as the next generation Linux fs remains unrealised. It now feels a bit jaded and the momentum is spent.
I suspect many have lost patience with the promise of COW and unfortunately for bcachefs this history will cast a shadow on its development and potential.
Database performance remains problematic on COW and while things like snapshots and adhoc disk and volume management are interesting even exciting one soon realises unless one has a pressing need they are just nice to have. Eventually boring ext4 ticks all the boxes and one may as well forget about the fs and focus elsewhere.
I don't think COW in general is a big issue for databases. You can get pretty good performance for ZFS (very stable and consistent behavior), for example. The COW is not free, of course, but you get interesting features in return, and if you need them (e.g. snapshots), it's usually much better than LVM + non-COW filesystem.
The fact that some COW filesystem perform poorly does not mean all COW filesystems do.
This takes me back. 9 years ago I was playing around with ZFS COW and OS X sparse bundle containers to host disks images for multiple "versions" (exploiting CoW) of the same VM image. I wrote up an article on what I was doing [1]. Never persevered though as it was a bit too fragile (at that time ZFS on OS X was not at all ready for prime time).
Funny, but every so often I wonder what it might be like in a parallel world where Apple bought Sun instead of Oracle.
I'm looking forward to BcacheFS. ZFS on Linux is great when it works well, but it's an absolute pain when it breaks. Not only does it taint the kernel, but it doesn't really mesh very well in the kernel due to the usage of the SPL -- a layer used to convert Linux APIs to Solaris Kernel APIs. In addition, ZFS doesn't use as much native Linux memory management as I'd like, instead it manages its own pool of memory. This makes troubleshooting more difficult. This mechanism is further aggravated with the use of the kmem cgroup.
For example, if you have a dirty page in a cgroup, and the cgroup OOMs, the kernel will trigger writes. If any of these writes require memory allocations, they'll probably fail since the current cgroup is OOM. ZFS subsequently gets stuck in an infinite loop, and locks up. See: https://github.com/zfsonlinux/zfs/issues/5535
I understand that a lot of ZFS works comes from LLNL & government funding. I'm not blaming them, as it works for their use case of machines that are running dedicated, controlled workloads.
We're experimenting with Btrfs, and we'll see how it goes.
You probably shouldn't. It's ready for adventurous testers, and is pretty stable, but unless you're willing to report bugs or hack on it, you can probably stay away.
There are reasons to still want it, despite its newness; for example, the latest updates bring huge improvements in metadata efficiency (low metadata overhead -> more metadata in the cache -> larger working set). Someone on the IRC channel reported it's somewhere around 20x faster than most filesystems when it comes to "iterate millions of files recursively", blowing everything else out of the water. (This seems somewhat synthetic, and I'd say it mostly is -- but OTOH, "tons of files in a directory" being really slow is life, and has bitten me multiple times in a prior job). In general, improved metadata efficiency helps everywhere, though. For example, if you're doing backups on a really big filesystem recursively, you'll have to traverse the metadata inodes a lot to get e.g. last modified time. bcachefs will likely do awesome here in terms of performance.
Another unique feature I recall is the fact it has very very good tail latency -- bcachefs almost never blocks on I/O unncessarily, so you do not get random 'lag spikes' when things like the page cache get flushed out (which may halt some other I/O ops). This makes the system feel much more consistent, in general.
There's lots of good info in the architecture document and Patreon posts from Kent:
I spent the last week testing ext4/btrfs/zfs on Linux and I found that zfs is rather slow and btrfs has improved its performance a lot in the last years (I should refined the script a bit, upload some graphics and make a post).
I like that we are seeing competition in this space. I think it's good for business.
I do however see some big red flags in the linked page:
> Starting from there, bcachefs development has prioritized incremental development, and keeping things stable, and aggressively fixing design issues as they are found
From what the developer has stated on reddit, it's more like he wants to aggressively make changes on the filesystem right now, before any attempt at mainlining into the kernel, to not end up like btrfs, which in his view, was mainlined prematurely.
You know honestly, I've always wondered why prisons don't have a bunch of computers for retaining. I mean, if there's a chance of conspiracy a specific inmate shouldn't have access, but your run of the mill street thug would probably really benefit from learning Linux system administration, web site building, coding . . .. It would probably be a lot easier to find gainful employment in a high demand field and would help break the cycle. If they were allowed to work, the people with longer sentences could help break the cycle for their dependants as well.
TRIM isn't super important given bcache's write pattern (sequential writes to large aligned blocks). It doesn't do random overwrite in place of small blocks.
bcache originally supported discard/TRIM commands (toggled by mount options), but taking a quick look it might have been removed in bcachefs during the progress of development.
I imagine ultimately TRIM will be supported, though (I don't see a reason why it wouldn't be, and considering Kent is focused on hammering out the design I imagine it'll inevitably fit in well).
Apparently they were not aware of the fact that bcache was a) GPLd code, or b) developed before the company existed, first as a hobby project and then at Google. After a couple of years, they noticed that Kent was in fact posting the bcache source code on his personal web site. At this point they fired him and threatened to sue. I quit the company then (along with a number of other people, for mostly unrelated reasons, such as the fact that the CTO was a notorious brogrammer). Kent got a litigator and when it was made very clear to them that they had no case, they backed down, but not before wasting a ton of money.
As far as I know, they're still actively violating the GPL by shipping a product containing modified kernel code in it without releasing the source, nor do they acknowledge that they did not develop the key component of their product.
The "commercial" version had a rather broken and messy snapshots implementation and had diverged a bit from the open source bcachefs at that point, mostly because snapshots were poorly implemented. It's also kind of funny because after we left the company we still knew of some tricky data corruption bugs, and it's likely they're still there in the "commercial" version, because backporting the latest fixes would be non-trivial and I don't think their testing or development methodology would have caught them.
Anyway, I gave up on startups and enterprise storage after this, but Kent is still developing bcachefs on his own time and money, so if you use it please consider donating some money to support its development.