Bcachefs: “the COW filesystem for Linux that won't eat your data”

slavapestov · on March 22, 2017

Here's a funny story. At one point bcache development was funded by a startup (which I won't name here). They were using it as the local storage layer of a distributed storage product. I worked there for a year in 2014.

Apparently they were not aware of the fact that bcache was a) GPLd code, or b) developed before the company existed, first as a hobby project and then at Google. After a couple of years, they noticed that Kent was in fact posting the bcache source code on his personal web site. At this point they fired him and threatened to sue. I quit the company then (along with a number of other people, for mostly unrelated reasons, such as the fact that the CTO was a notorious brogrammer). Kent got a litigator and when it was made very clear to them that they had no case, they backed down, but not before wasting a ton of money.

As far as I know, they're still actively violating the GPL by shipping a product containing modified kernel code in it without releasing the source, nor do they acknowledge that they did not develop the key component of their product.

The "commercial" version had a rather broken and messy snapshots implementation and had diverged a bit from the open source bcachefs at that point, mostly because snapshots were poorly implemented. It's also kind of funny because after we left the company we still knew of some tricky data corruption bugs, and it's likely they're still there in the "commercial" version, because backporting the latest fixes would be non-trivial and I don't think their testing or development methodology would have caught them.

Anyway, I gave up on startups and enterprise storage after this, but Kent is still developing bcachefs on his own time and money, so if you use it please consider donating some money to support its development.

indolering · on March 23, 2017

His Patreon page[1] shows he only receives $762 in donations a month, less than a third of what he needs to keep from eating into his personal savings.

Sad given how much a modern filesystem would help Linux : (

[1]: https://www.patreon.com/bcachefs

kzrdude · on March 24, 2017

Sad, but it's absurd either way that 1 person should write the next file system. A team is needed.

Chris2048 · on March 23, 2017

> As far as I know, they're still actively violating the GPL

Maybe he can get extra money from a lawsuit?

indolering · on March 23, 2017

Few people would willingly sign up for that kind of torture.

zanny · on March 22, 2017

I've been using btrfs since about 2011, and I've stopped using ext4 / xfs / zfs everwhere since about 2014.

From 2012-2014 it was mostly breakage every other month. From 2014-2016, it was semi-annual issues.

For the last ~18 months I have had ~30 machines running btrfs with no issues, some servers, some personal computers. The release notes are boring, the bugs are boring, and to me its definitely in a state I would strongly consider trusting it with any workload.

I worry that btrfs is just going to remain doomed. It wasn't stable half a decade ago, so it - for some reason - cannot be more stable now. But it has seen so much work put into it to make it as mature as it is now, and in my experience it is pretty damn mature now. All I want to see is another year and a half of perfect stability before I would start arguing to drop zfs entirely.

_ecqc · on March 22, 2017

Are you running BTRFS with its built in RAID? That's been the biggest blocker for me. There have been numerous RAID bugs that have caused data-loss and I believe at least one of them is still unpatched.

lomnakkus · on March 23, 2017

My main issue isn't actually a single thing that's wrong -- it's the completely and utterly haphazard way many of the features in btrfs have been "designed"[1]. Some of these problem they've had seem, to me at least, to be a fundamental lack of a coherent design. That does not bode well for stability, even 10(?) years after its first version.

bcachefs seems to have a much more coherent design.

[1] "Oh, yeah, I don't know how to handle this code path yet, let's stick a BUG_ON in there! I'm sure we'll figure something out later."

kzrdude · on March 22, 2017

I think Chris Mason is open with that raid support is not really stable yet, right?

Status page says so: https://btrfs.wiki.kernel.org/index.php/Status

aarmenaa · on March 23, 2017

As far as I know they consider RAID0, 1, and 10 to be stable. Last time I used it rebuilds were substantially slower than ZFS or mdraid. Rebuild performance seems to one of a few issues that BTRFS has had trouble solving. RAID 5 and 6 were declared stable last year, only to have that retracted when some fatal flaw was discovered that would apparently cause data loss if you needed to rebuild.

mshook · on March 23, 2017

It's mostly true but the issues as far as I know are:

- RAID 1 with more than 2 disks is not what you think it is: the data will be mirrored but only once, no matter how many disks you have (meaning if you have a mirror with 3 disks, you only have 2 copies of your data). Because in BTRFS lingo, RAID 1 means '2 copies of the data' https://btrfs.wiki.kernel.org/index.php/FAQ#What_are_the_dif... which is not what people expect from RAID 1 with more than 2 disks

- RAID 1 always needs 2 disks to be working, if not you can't mount the thing... Well you can, but only once... https://btrfs.wiki.kernel.org/index.php/Gotchas#raid1_volume...

- RAID 10 inherits these special RAID 1 cases as a result

Most of that stuff is described here: https://btrfs.wiki.kernel.org/index.php/Status

So as they say on the status page: "mostly ok"

rrauenza · on March 22, 2017

I've been using btrfs since about 2014 on CentOS7. I only use it with mirroring+compression. No snapshots. I mostly went with it for data scrubbing and compression.

My experience has been mixed, but haven't had any data loss. There was a bug for a while regarding free space so occasionally the system would seem to be full but wasn't ....and it was a real pain to correct it.

I now have a cron job that does a monthly btrfs balance along with a mount -oremount,clear_cache. I also run the latest kernels from http://elrepo.org instead of the CentOS7 kernels so that I get the latest patches.

gigatexal · on March 22, 2017

ZFS gets you all of that and raid and snapshots and has been tested time and time again. Why run BRTFS then?

tylerjd · on March 23, 2017

ZFS is not GPL? I dunno, I like feeling safe when I do kernel upgrades knowing even if for whatever reason my ZFS module doesn't compile/work under the new kernel, I won't be left without a root FS. I have been running BTRFS for 5 years with very little issues, and enjoy not having to compile a new dkms module with every kernel.

Zancarius · on March 23, 2017

ZFS with DKMS is a disaster, at least in my experience. Honestly, I can't recommend ZoL unless you're running a distro with relatively stable kernel releases that don't change substantially or that happens to be supported by ZoL with binary packages. ZoL on Arch was... trying at times. It worked great, but my paranoia meant that I ended up adding the kernel to IgnorePkg to force manual kernel updates (mostly for my own memory). But then, it also meant having to build all of the ZFS packages (including SPL) tied to that specific version. This usually meant waiting until the AUR packages were updated as I figured that indicated someone else must have tested ZFS on that specific kernel version.

I remembered thinking DKMS might solve the problem, but I ended up having to use recovery media just to get an environment to reinstall an older kernel and let DKMS do its thing after a botched update started provoking panics. I suspect a version mismatch based on the errors but never investigated it beyond fixing the problem and moving to the prebuilt modules. Things may have changed, but the Arch ZFS+DKMS packages were a bit flaky and required some manual modification just to boot (should've taken this as a warning!).

Granted, it was my fault entirely for being a bit too enthusiastic with ZFS on Arch. To be honest, if I were to use it again, it would be on FreeBSD. Not Linux. I recognize it's fine for other people, but in my use case it wasn't.

kasabali · on March 23, 2017

> ZFS with DKMS is a disaster

It is more like Arch Linux is a disaster. Upgrading the kernel package replaces the current one! Come on, any distribution worth it's salt just installs new versions alongside and you can select any of them in the boot screen. This is a ridiculous packaging policy regardless of ZFS or any other DKMS modules.

Zancarius · on March 23, 2017

And I acknowledged that using it on a different distro would be more advisable, although I still stand by my claim that FreeBSD is far more appropriate for ZFS.

I will, however, agree that having no fallback to the prior kernel version is a problem. In practice, it's never caused me much trouble except when I do something stupid like using ZFS from the AUR. initrd generation has historically seemed to be more problematic under Arch, but I'd argue that's mostly fixed with install hooks.

In all honesty, it was probably more the fault of the zfs-dkms packages than it was either the kernel packaging policy or ZoL+DKMS itself (for reasons I elaborated on in my original post).

But, that's also what you get when you use packages from the AUR or using a distro like Arch for something that really only benefits from a wider installation base (like Ubuntu does, for instance).

kasabali · on March 24, 2017

I know you acknowledged that using it on a different distro would be more advisable, I just wanted to vent about more broad issue of their packaging policy. Sorry if it wasn't clear.

Zancarius · on March 31, 2017

Oh, I understood where you're coming from.

I do agree. There are circumstances where Arch's packaging is brain dead (they only recently, within the last 2 years or so, started validating packages against signatures!). I use it for a number of applications, and as my desktop OS among others. However, I'll freely admit at least part of my choice is perhaps the fault of masochistic tendencies. After all, I migrated to Arch from Gentoo, and I used Gentoo for years! :)

In all honesty, I've been bit more by the initrd and mkinitcpio's failings than the lack of a fallback kernel. That's mostly fixed with packaging hooks that essentially guarantee it will run, but it's still a problem with the ZFS packages and may require running it manually (which is annoying). However, that wasn't always the case, and sometimes the generated initrd would be missing something important. You can imagine what happened next.

lomnakkus · on March 23, 2017

Your problem was running it on a distro the didn't support it. (As you acknowledge.)

So far it's been running absolutely great for me on several Ubuntu installs.

Zancarius · on March 23, 2017

Indeed. Arch is good for a few things, but sometimes stability isn't one of them when it comes to unsupported packages. ;)

I'm not sure I'd be brave enough to run ZoL again, but given Ubuntu's FAR wider install base and availability of binary packages, it's the better option if you have to choose.

My personal preference would be to stick with ZFS on FreeBSD. Performance is probably better.

georgyo · on March 23, 2017

I have been running ZFS on Arch Linux for a little over two years across four drives, and doing weekly updates. Haven't ran into any issues yet.

Zancarius · on March 23, 2017

I'm actually surprised by this, but I'd wager that you also didn't use the DKMS AUR packages. I also suspect you wait for the zfs-linux (etc) packages to match the kernel version before updating. Or you manually bump the kernel version and build it, hoping for the best (edgy!).

nialv7 · on March 23, 2017

Arch Linux has a LTS kernel available in official repo. Why not using that?

Zancarius · on March 23, 2017

I considered it, but I have some problems with the Arch LTS release cycle. If I were to choose an LTS kernel, why not just dump Arch and go with an Ubuntu LTS, which has better long term support?

The other problem is that at the time, the ZFS packages for LTS were pinned at a version that had a known issue with arc_reclaim encountering a deadlock essentially causing the file system to become unresponsive after a substantial transfer (think rsync).

Now, obviously, it wouldn't be that difficult to modify the PKGBUILD to pull a newer version of ZFS, but there's a point in time where the maintenance required to update starts to outweigh whatever benefit you can glean from the LTS kernel.

That's not the case now since the LTS packages appear to be at v0.6.5.9, which has the fixes, but I don't remember this being true about a year ago.

gigatexal · on March 23, 2017

Ahhh yes that is the one ma in n point I missed. Can't argue with that. I've just never had issues upgrading as the zfs on Linux project us fantastic.

rrauenza · on March 23, 2017

At the time of the system build, ZFS was user space only on CentOS7 if I recall.

Dylan16807 · on March 23, 2017

BTRFS lets you make CoW copies of files. You can even retroactively merge the blocks that store identical files. BTRFS also makes it not a giant pain to remove a file from snapshots.

ZFS does seem to work better overall, but I wouldn't call either filesystem great at this point in time.

Zardoz84 · on March 22, 2017

Why the remount,clear_cache ? I'm doing a monthly balance before I got problems on a few servers.

rrauenza · on March 23, 2017

It's a little black magic -- didn't have time to completely research (this is a home NAS system.)

I believe there was a bug in the clear space cache. This could cause the system to think it didn't have free blocks... you'd have to mount another device to create more space in order to rebalance and fix.

Eventually I saw a bug fix report about a corruption in the cache... I never investigated to see if my current kernel has the fix.

noja · on March 22, 2017

Me too. I love it, but only with raid-1 or no raid. btrfs as a filesystem will be dead within the next 12 months unless they fix raid-5 and raid-6.

Zardoz84 · on March 22, 2017

I'm using BTRFS on many VMs running Ubuntu servers, and I find that expanding the virtual hard-disk, on Proxmox, without stooping the VM it's far trivial.

arde · on March 22, 2017

I think you missed a key word there so the meaning is lost. Is it "far from trivial" or "far more trivial"? You also surely meant stopping, not stooping.

jl6 · on March 23, 2017

I think RAID 5/6 will be dead before btrfs.

noja · on March 23, 2017

How are you doing redundancy then? Raid-1?

jl6 · on March 29, 2017

Me personally? Yes, RAID-1.

j3097736 · on March 23, 2017

I purposefully ran btrfs on a malfunctioning drive for over a year (kernel 3.12, only metadata dup), much more reliable than ext4 which would lock up the entire filesystem on read/write failure and often go read-only, with btrfs the only visible signs of malfunction were dmesg and the scrub log.

Also been using it as / since 2012 with no issues.

jdc · on March 22, 2017

I lost data on my workstation running brtfs in 2016, but would give it another shot if I knew I could expect reliability from it.

Makes me wonder if anyone's tried something like Jepsen for filesystems.

kev009 · on March 22, 2017

https://github.com/kdave/xfstests is the old standard. Folks have also been using AFL to fuzz. In ZFS, zdb is often used for similar purposes.

koverstreet · on March 23, 2017

xfstests is still the standard, on Linux - https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git

I use it pretty much every day.

syshum · on March 23, 2017

btrfs ran before it could walk.

It is ok for a single disk FS. It is no ZFS though which is its largest problem, people keep marketing it as "linux awnser to ZFS"

No, it is not.

Maybe some day, but today it is no ZFS. I love zfs..

Further the ZFS utilities are far easier to use and understand. zfs and zpool commands well documented, and intuitive. btrfs utilities are are not, IMO

I am fine with using btrfs as a replacment for ext on my OS drive, but for my large data arrays of multiple disks it s ZoL all the way

smcleod · on March 22, 2017

Likewise, I've run it on hundreds of machines over the past three years without issue. I do continue to use EXT4 for database hosts as it far outperforms BTRFS with PostgreSQL from what I've seen.

__david__ · on March 23, 2017

How is the speed now? I never had data issues but I definitely had speed issues. Btrfs was really slow compared to ext4/xfs back when I tried it. And I mean orders of magnitude slow. I had an application that did a lot disk access and switching off of btrfs brought the runtime down from a week to just hours. I want to like btrfs, but after that I just can't trust it for high disk load situations.

untoreh · on March 24, 2017

with cow is still slower than zfs. It still uses lzo for compression whereas zfs uses lz4, which is 10x faster for decompression.

owaislone · on March 22, 2017

I had a bad experience with in around 2014-15 on Ubuntu. 3 different laptops in my house suddenly stopped working (didn't boot at all) within the time-span of a year, 1 laptop faced the problem multiple times. In all cases, I had to format the root partition and reinstall Linux. All 3 had btrfs for / in common.

I moved back to EXT4 and it never happened again since then.

mfukar · on March 23, 2017

What kind of workloads are you using it for?

boris · on March 22, 2017

>Snapshot implementation has been started, but snapshots are by far the most complex of the remaining features to implement

Snapshots are the #1 feature of COW filesystems. I've been using them for a bit in btrfs and this feature is game-changing (and no, it hasn't eaten my data yet).

std_throwaway · on March 22, 2017

At first glance, the status page of btrfs look horrible:

https://btrfs.wiki.kernel.org/index.php/Status

The problem areas are mostly RAID and exotic features. RAID can be handled by a different layer and most users don't really need the exotic features.

Judging from the media silence in the last months I'd say either people stopped using btrfs or it just about works good enough for everbody.

Mister_Snuggles · on March 22, 2017

OpenSuSE uses btrfs by default and relies on it for one of it's killer features. Before a system change, such as installing updates, changing services configuration, etc, is made using YaST, snapper takes a snapshot of the root filesystem. If something breaks, just roll back to the previous working state.

I'd say that for the OpenSuSE folks, btrfs falls squarely into the "good enough" category.

StillBored · on March 22, 2017

I'm a big opensuse fan, but none of my opensuse machines are running btrfs.. Although, besides the stability (which can't really be much worse than ext4/xfs which is what I apparently choose on the two machines) I think what drove me nuts about opensuse's use of btrfs was all the subvolumes. Which are cool, but just another thing for me to deal with, and my computing theory for the past few years can be summarized by "KISS, unless its really hurting".

binarycrusader · on March 23, 2017

FWIW, Solaris has been providing similar upgrade safety via ZFS since 2008.

JdeBP · on March 23, 2017

Yes, operating systems other than OpenSUSE have this feature. TrueOS (formerly PC-BSD) has had it for some years, too.

* https://blog.pcbsd.org/2012/07/9-1-feature-multiple-boot-env...

* https://blog.pcbsd.org/2013/06/pc-bsd-status-update/

* https://www.ixsystems.com/blog/a-closer-look-at-the-changes-...

* https://www.ixsystems.com/blog/the-revamped-life-preserver/

Mister_Snuggles · on March 23, 2017

Sadly, I got out of the Solaris game a couple of years before ZFS became a thing. I'm using ZFS on FreeBSD though, I quite like it.

binarycrusader · on March 24, 2017

Yes, FreeBSD adopted the same command Solaris has for boot environment management -- see the beadm man page.

user5994461 · on March 22, 2017

> Judging from the media silence in the last months I'd say either people stopped using btrfs or it just about works good enough for everbody.

3rd option: there aren't really much people who ever used btrfs.

from the feedback I could gather when I enquired about it, it's 33/33/33:

33% who tried and say it's fine

33% who tried, hit a few bugs and stopped

33% who say it's known for not really being finished, not a good idea to go for it.

KaiserPro · on March 22, 2017

If you're using large raid/jbod arrays then the chances are you're using zfs or hardware raid.

I wouldn't want to be the poor sucker supporting a large BTRFS array.

feld · on March 22, 2017

when the RAID is handled by a different layer you lose some very important integrity features

Ajedi32 · on March 22, 2017

IIRC, doesn't BTRFS also allow you to do cool things like change RAID levels dynamically? (E.g. You can be running a 2-disk RAID-1 array, pop in another disk and tell BTRFS to make it a RAID-5 array instead, then a year later pop in 2 more disks and switch to RAID6, all with no downtime.) I imagine that wouldn't be possible if you were doing RAID at a different layer.

teraflop · on March 22, 2017

Linux's built-in software RAID (implemented at the block level) has supported online RAID level changes for many years. Check out the "grow mode" section of the mdadm(8) man page sometime; it goes into great detail about which operations are supported.

__david__ · on March 23, 2017

Yeah, shout out to mdadm and the md driver. I've been using it for years and years and it's rock solid. Being able to grow arrays (online) and convert plain drives into mirror sets is great. I feel that md and lvm are underappreciated by a large number of Linux users…

kazinator · on March 22, 2017

That's why all big data in big corporations uses exotic-open-source-filesystem-based RAID. That's where the action is when it comes to integrity.

None of that virtual block device or driver-level software junk, let alone hardware RAID controller solutions.

barrkel · on March 22, 2017

Well, there are other reasons; you want to write code that operates on the data, and neither the code nor the data fits on a single machine - you have to target an abstraction which spans machines. Block storage is too low-level an abstraction.

That isn't to say that using high performance block storage isn't still a win even when the redundancy is multiplied at a higher level. The higher level redundancy is also about colocating more data with the code - i.e. it's not just redundant for integrity, but to increase the probability it's close to the code.

kazinator · on March 23, 2017

Block storage can be network-abstracted.

Even virtual memory for that matter. Now ancient concept:

https://en.wikipedia.org/wiki/Distributed_shared_memory

barrkel · on March 23, 2017

Of course. Most production monoliths are deployed on networked block storage - aka SAN - and NUMA is already structurally distributed memory, even on a single box. But it's not the right paradigm to scale well, no more than chatty RPC that pretends the network doesn't exist is the right way to design a distributed system.

fulafel · on March 24, 2017

I think the high profile ones (Google, Amazon etc) use relatively dumb OS drivers and do the fancy distributed FS abstraction stuff in userspace. Certainly stuff like Ceph and Gluster don't have very good reputations and are mostly sold to relatively clueless "enterprise" customers.

kasabali · on March 23, 2017

RAID 5/6 in btrfs doesn't have data integrity features so you're back to square 1 on this regard.

_ecqc · on March 22, 2017

This is kinda a subjective thing though. For me the #1 feature is being resilient to bitrot.

untoreh · on March 24, 2017

if you disable cow, you also disable checksumming, so they are kinda linked

rsync · on March 22, 2017

"but snapshots are by far the most complex of the remaining features to implement"

I thought defrag (or things like defrag) was the most complicated thing to implement ?

ZFS has always had snapshots but I am told defrag is a long, long ways away ...

aseipp · on March 22, 2017

According to Kent on Reddit, bcachefs technically had online defrag since it was just bcache -- via the copying garbage collector for reclaiming space. So bcachefs will simply inherit that feature by design, which is great.

Apparently copygc is off right now because reasons, though (I'm going to assume it's almost certainly the related extent/compression issue that's holding this up from being enabled, which you can see referenced on the home page, at the bottom).

koverstreet · on March 22, 2017

I should have been more clear - copygc is off by default in upstream bcache, it's on in bcachefs (and required, in order to guarantee a capacity when doing random writes)

kev009 · on March 22, 2017

I think ZFS is the only viable open source CoW total storage management option commercially. These new Linux filesystems are way too late to the party, and it will take a decade for them to reach maturity when they reach basic 1.0 feature parity.

In parallel I see XFS as the long term evolution for Linux file systems. It will continue to scale slightly up from where it sits today and address fail in place, flash, metadata checksums, snapshots etc where total storage management is done by overlays like HDFS, object stores, etc.

_ecqc · on March 22, 2017

I think ZFS is fantastic for businesses but there are a couple places where it falls short compared to bcachefs for me:

- For non-business users who want a RAID, ZFS is too inflexible. You can't add or remove disks to a RAIDZ vdev. If you want the space efficiency of RAIDZ, you have to expand your array in units of entire vdevs. If you want replicas, you have to expand in at least pairs of disks. BTRFS and bcachefs both allow you to replicate more flexibly and reshape your array.

- ZFS doesn't work particularly well with SSDs as caches. ZIL and L2ARC are nice but they're not as nice as a full bcache-style tiering setup. bcachefs tiers let you do crazy things like a 4-tier storage setup with Nearline HDD -> 15k SAS HDD -> SATA SSD -> NVMe SSD.

- ZFS is pretty complex to manage in general and major features like ZIL and L2ARC are arcanely documented. So far, bcachefs is pretty straightforward to use.

ysleepy · on March 22, 2017

While I really like these sort of file systems, I'm not holding my breath.

This isn't a simple filesystem project, but plays in the next-gen space ZFS opened up. There will be a lot to do, especially IO scheduling, RAID safety with shitty drive firmwares, consistency guarantees with fsync/partial flushes etc.

I'm pessimistic about it being mainlined in the near future, the core team will be weary of a second btrfs.

What I would like to see is a APFS/exFAT crossover with COW and data checksums without all the volume mgmt with ports for all possible operating systems so everyone can use it for their SDcards, usb-sticks and external drives without making tradeoffs and using fuse.

g0xA52A2A · on March 22, 2017

> with COW and data checksums without all the volume mgmt

The fact that the raidz volume is not an opaque block device allows ZFS to be aware of data corruption when comparing checksums and self heal if the data can be re-constructed from the array.

I'm not saying any attempt at a new filesystem should have to bundle the two layers together, but they should allow for communication between the abstractions.

sedachv · on March 22, 2017

> What I would like to see is a APFS/exFAT crossover with COW and data checksums without all the volume mgmt with ports for all possible operating systems so everyone can use it for their SDcards, usb-sticks and external drives without making tradeoffs and using fuse.

+1. Filesystems without bit-rot protection on flash drives are going to become at least as big a problem as optical disc rot.

mschuster91 · on March 22, 2017

> without making tradeoffs and using fuse.

What's the problem with fuse? It allows sharing code between Linux, OS X, (Free)BSD and even Windows (via dokan).

Yes, it will not offer you the same performance as an in-kernel driver (due to context switches), but given that CPU power always increases, no big problem there.

_ecqc · on March 22, 2017

> Yes, it will not offer you the same performance as an in-kernel driver (due to context switches), but given that CPU power always increases, no big problem there.

This might be the case if you're running something incredibly easy on I/O like large sequential read/writes, but if you do anything at all challenging on I/O like opening desktop applications (Photoshop, lots of random reads), editing or viewing high bitrate video (very high throughput) or god forbid running a database, this is a huge problem.

sedachv · on March 22, 2017

> What's the problem with fuse?

1. Only available on Android when rooted.

2. Support varies between OSes. For example OpenBSD's FUSE does not have the default_permissions/allow_other flags, which makes for example encfs (and any other virtual filesystems that are backed by multiple files) a pain to use since OpenBSD 6.0 removed user mounting.

khc · on March 22, 2017

1. most filesystems won't be available on Android anyway, this point is moot

2. most non-fuse filesystems won't be ported to your BSD of choice anyway

__david__ · on March 23, 2017

I'm more optimistic. What you may or may not know is that bcachefs is a tweak on bcache, which has already been mainlined and is pretty stable (I've personally been running bcache for a couple years on my home linux machine).

The point is a lot of the things you bring up are already covered by bcache. Bcachefs "just" adds a filesystem layer on the bcache tree structure.

legulere · on March 22, 2017

Well we can still hope that Apple might open source APFS (and add data checksums).

Or that someone sits down and does the reverse engineering work.

jeremyw · on March 22, 2017

To press this point:

If you think we need an alternate effort and/or competition to build an advanced, native filesystem for Linux (I do), please consider a subscription on Patreon (https://www.patreon.com/bcachefs). Kent has a long history of shipping sophisticated, high-quality code.

king_phil · on March 22, 2017

Plus he is willing to help out when you need to nail down a bug, as I recently discovered with bcache. My first Linux kernel patch might be a fix of a deadlock in bcache :-)

g0xA52A2A · on March 22, 2017

The architecture page is pleasantly illuminating, nice to see an effort made in technical documentation.

http://bcachefs.org/Architecture/

throw2016 · on March 23, 2017

Chris Mason and the btrfs team are clearly talented. But the initial excitement of btrfs has sadly dissipated and its promise as the next generation Linux fs remains unrealised. It now feels a bit jaded and the momentum is spent.

I suspect many have lost patience with the promise of COW and unfortunately for bcachefs this history will cast a shadow on its development and potential.

Database performance remains problematic on COW and while things like snapshots and adhoc disk and volume management are interesting even exciting one soon realises unless one has a pressing need they are just nice to have. Eventually boring ext4 ticks all the boxes and one may as well forget about the fs and focus elsewhere.

pgaddict · on March 23, 2017

I don't think COW in general is a big issue for databases. You can get pretty good performance for ZFS (very stable and consistent behavior), for example. The COW is not free, of course, but you get interesting features in return, and if you need them (e.g. snapshots), it's usually much better than LVM + non-COW filesystem.

The fact that some COW filesystem perform poorly does not mean all COW filesystems do.

phs318u · on March 23, 2017

This takes me back. 9 years ago I was playing around with ZFS COW and OS X sparse bundle containers to host disks images for multiple "versions" (exploiting CoW) of the same VM image. I wrote up an article on what I was doing [1]. Never persevered though as it was a bit too fragile (at that time ZFS on OS X was not at all ready for prime time).

Funny, but every so often I wonder what it might be like in a parallel world where Apple bought Sun instead of Oracle.

[1] http://macoverdrive.blogspot.com.au/2008/10/using-zfs-to-man...

sargun · on March 23, 2017

I'm looking forward to BcacheFS. ZFS on Linux is great when it works well, but it's an absolute pain when it breaks. Not only does it taint the kernel, but it doesn't really mesh very well in the kernel due to the usage of the SPL -- a layer used to convert Linux APIs to Solaris Kernel APIs. In addition, ZFS doesn't use as much native Linux memory management as I'd like, instead it manages its own pool of memory. This makes troubleshooting more difficult. This mechanism is further aggravated with the use of the kmem cgroup.

For example, if you have a dirty page in a cgroup, and the cgroup OOMs, the kernel will trigger writes. If any of these writes require memory allocations, they'll probably fail since the current cgroup is OOM. ZFS subsequently gets stuck in an infinite loop, and locks up. See: https://github.com/zfsonlinux/zfs/issues/5535

I understand that a lot of ZFS works comes from LLNL & government funding. I'm not blaming them, as it works for their use case of machines that are running dedicated, controlled workloads.

We're experimenting with Btrfs, and we'll see how it goes.

std_throwaway · on March 22, 2017

It looks like it takes quite some time till it is fully implemented. Why should we start using it right now instead of btrfs/zfs?

aseipp · on March 22, 2017

You probably shouldn't. It's ready for adventurous testers, and is pretty stable, but unless you're willing to report bugs or hack on it, you can probably stay away.

There are reasons to still want it, despite its newness; for example, the latest updates bring huge improvements in metadata efficiency (low metadata overhead -> more metadata in the cache -> larger working set). Someone on the IRC channel reported it's somewhere around 20x faster than most filesystems when it comes to "iterate millions of files recursively", blowing everything else out of the water. (This seems somewhat synthetic, and I'd say it mostly is -- but OTOH, "tons of files in a directory" being really slow is life, and has bitten me multiple times in a prior job). In general, improved metadata efficiency helps everywhere, though. For example, if you're doing backups on a really big filesystem recursively, you'll have to traverse the metadata inodes a lot to get e.g. last modified time. bcachefs will likely do awesome here in terms of performance.

Another unique feature I recall is the fact it has very very good tail latency -- bcachefs almost never blocks on I/O unncessarily, so you do not get random 'lag spikes' when things like the page cache get flushed out (which may halt some other I/O ops). This makes the system feel much more consistent, in general.

There's lots of good info in the architecture document and Patreon posts from Kent:

http://bcachefs.org/Architecture/

https://www.patreon.com/bcachefs/posts

h2hn · on March 22, 2017

I spent the last week testing ext4/btrfs/zfs on Linux and I found that zfs is rather slow and btrfs has improved its performance a lot in the last years (I should refined the script a bit, upload some graphics and make a post).

https://gist.github.com/liloman/d525131fab9b9a440140905921e9...

I'll give it a try on bcachefs. :)

The script needs a 512MB spare disk partition and some basic changes but the fundamental work is there.

JdeBP · on March 23, 2017

Do you have any similar thoughts about HAMMER (from DragonFly BSD)?

jethro_tell · on March 22, 2017

I like that we are seeing competition in this space. I think it's good for business.

I do however see some big red flags in the linked page:

> Starting from there, bcachefs development has prioritized incremental development, and keeping things stable, and aggressively fixing design issues as they are found

Which is it? Big design changes or stable FS?

OD_ · on March 22, 2017

From what the developer has stated on reddit, it's more like he wants to aggressively make changes on the filesystem right now, before any attempt at mainlining into the kernel, to not end up like btrfs, which in his view, was mainlined prematurely.

JoshTriplett · on March 22, 2017

It does make sense to have it rock-solid stable before mainlining, so that people don't get burned by it early on.

rsync · on March 22, 2017

"I like that we are seeing competition in this space. I think it's good for business."

I wonder why Hans Reiser doesn't take up filesystem work again ? He has plenty of time on his hands ...

jethro_tell · on March 23, 2017

You know honestly, I've always wondered why prisons don't have a bunch of computers for retaining. I mean, if there's a chance of conspiracy a specific inmate shouldn't have access, but your run of the mill street thug would probably really benefit from learning Linux system administration, web site building, coding . . .. It would probably be a lot easier to find gainful employment in a high demand field and would help break the cycle. If they were allowed to work, the people with longer sentences could help break the cycle for their dependants as well.

Guess that's probably the answer.

bryanlarsen · on March 22, 2017

My impression was that the base, bcache, is stable. The file system layer, bcachefs, has big design changes

Y_Y · on March 22, 2017

So by analogy with btrfs "butterface" I suppose we're supposed to pronounce this "book-a-chefs"?

X86BSD · on March 22, 2017

Curious, I didn't see mention of this, but perhaps someone here knows, is there TRIM support or planned support for bcachefs?

loeg · on March 22, 2017

TRIM isn't super important given bcache's write pattern (sequential writes to large aligned blocks). It doesn't do random overwrite in place of small blocks.

aseipp · on March 22, 2017

bcache originally supported discard/TRIM commands (toggled by mount options), but taking a quick look it might have been removed in bcachefs during the progress of development.

I imagine ultimately TRIM will be supported, though (I don't see a reason why it wouldn't be, and considering Kent is focused on hammering out the design I imagine it'll inevitably fit in well).

corppneq · on March 22, 2017

> Bcachefs: “the COW filesystem for Linux that won't eat your data

from site:

> Bcachefs is not yet upstream - you'll have to build a kernel to use it.

> Snapshot implementation has been started, but snapshots are by far the most complex of the remaining features to implement -

Yes. Very mature.