- Very old kernel used (3.10!), makes me wonder how old the packages like btrfs-progs are as well.
- BTRFS not mounted with compression (compress=lzo)
- Don't use QCOW2, just don't, it's slow and you're just adding extra layers where you don't need to.
It would be interesting to see you re-run these tests using a modern kernel, say at least 4.4 and either raw block devices or logical volumes along with mounting BTRFS properly with the compress=lzo option
The benchmark configuration appears to be designed to evaluate the use of storage technologies for a KVM host. Consequently, saying to use raw block devices when giving tips for improving btrfs performance is contradictory.
Also, there are a large number of people that will not run a newer kernel for several years because they are on RHEL6 or RHEL7, so while newer kernels are interesting, we should not discount the results on the basis that the kernel is old. The latest ZFSOnLinux code is able to run on those kernels, so while btrfs remains stagnant there, ZFS will continue to improve.
As for rerunning the tests, using recordsize=4K and compression=lz4 on ZFS should improve its performance here too. Putting the VM images on zvols (where it would be volblocksize=4K) rather than qcow2 also would help. In ZoL, zvols are block devices.
I disagree (politely) with your assumption about QCOW2 and BTRFS performance, if I get some time this week I'll do some objective benchmarking (I design and build Linux based storage systems) and see hey I might be proven wrong but regardless I can let you know the results if you're interested?
People not willing to update their kernel in production environments is a social problem, not a technical problem. The kernel is one of the most reliable, well tested and reviewed software projects in the world. When you upgrade your kernel 99.5% of the time you get new features, performance and bug fixes without any negotiate impact. There are of course rare corner cases especially if running propriety hardware when they are generally slower to release updates that show the benefits of a modern kernel.
The problem there is the culture and traditional slow moving operational engineered haven't all embraced the well proven fact that regular, small charges are safer and have various added benefits.
There is also a serious language barrier between many engineers and management / project managers who clearly would not be likely to understand the benefit of upgrading to a kernel version that say properly supported the new SCSI blk_mq backend for storage, so there either needs to be degree of trust and respect to (proven) engineers(ing) and teams or they need to be clearly taught the value add of a fast release cycle and practising quick (hopefully automated) patching. That's where I think books like Gene Kim's - The Phoenix Project and his soon to be released book - The DevOps Handbook which may in fact be more useful to people performing PM/PO tasks.
You said "Don't use QCOW2, just don't, it's slow and you're just adding extra layers where you don't need to.". I was agreeing with you when I said that it also has that effect on ZFS. What is my "assumption about QCOW2 and BTRFS performance"?
As for newer kernels, code churn tends to break things that previously worked, which upsets customers who want bugs to decrease as a function of time, rather than go up. Vendors like Redhat will not update distributions like RHEL to newer kernels because of that and they will not support kernels that they do not ship. That is why people run older kernels.
That's complete fud regarding newer kernels - the linux kernel gets more stable over time, not less - it's like updating your drivers - you don't hold back your graphics card drivers because the new stable release might be less stable than your outdated one that came bundled with your PC.
Kvm (or any overwrite workload for that matter) is the worst possible workload for btrfs because of COW. We have ideas to address this but honestly it's not high on the list.
There are no docs. I'm not a qcow2 expert, what I know is very basic so anything I say about qcow2 can be very wrong.
So qcow has a read only base image that gets updated when we change things. The image format just had the changes from the original image. So you update a package, it adds some metadata to point at the new stuff and adds the data in and you are done.
So with btrfs you have this image on top of btrfs, so you update a file and its metadata inside the image. Say you start with a pristine image that's in nice big extents. You update a package which changes small chunks all over the file. Let's say you update 12 4K extents. So now instead of one extent you now have 36 extents. This affects everything, fsyncs take longer because there's more extents we have to write out, the space is more fragmented so cold cache reads are more expensive, the csums are no longer contiguous so they also take up a larger more fragmented area. It has this really terrible cascading effect.
I tested it with preallocated raw images on Btrfs and that setup was slow as well. If I get you right it's must be possible to somehow allign VM disk sectors to Btrfs sectors for performance boost. Or compression and deduplication usage going to kill performance anyway?
Btrfs internally use logical addresses for all extents. The making from logical to physical is done via the chunk tree which not only indicates physical sector but also the device. So the reference for a file extent says nothing about what device the extent is on or the replication (raid profile) since that is all a function of the chunk and dev trees.
I think Btrfs for a guest F's is best pointed to an LV, rather than qcow2. It's been awhile since I benchmarked that compared to 'qemu-img create -f qcow2 -o nocow=on' which will set xattr +C on the file making it nocow. The nocow xattr helps a lot with this problem.
ZFS also does better when the guests are stored on zvols. If volblocksize=4K is set on creation, it can avoid read-modify-write overhead. Regular files can be used too, but those are somewhat higher overhead. In that case, recordsize=4K can be used.
That said, I do not think the tests involved nesting CoW file systems.
> Shouldn't ZFS have similar problems with overwrite workloads?
ZFS does suffer from read-modify-write on partial record writes. The effect of that is apparent in the benchmarks. However, the benchmarks are being done on mechanical disks, which have low IOPS. The IOPS of a mechanical disk are roughly the same on a given sequence of IOs at different positions regardless of whether they are 4KB or 128KB in size, so it only has to pay a penalty once. If the record size were changed to 4KB, this penalty would disappear and ZFS performance should increase, provided that the VM internals are properly aligned.
Also, read-modify-write overhead reduces IOPS by at most 2 and bandwidth to the smaller of the link bandwidth and the IOPS times the record size. A CoW filesystem should be able to perform roughly at that level when it does read-modify-write on records/extents. Unless btrfs' internal extents are huge, there is an issue somewhere. Of course, having huge extents by default on which read-modify-write is done could also be considered a design issue.
ZFS has a 7 year head start on BTRFS and has traditionally had about 5 times as many core contributors at any given time so I imagine they've solved this in some novel way by now.
Wasn't ZFS even "production ready" before BTRFS development even began?
I don't remember the state of it when it was first introduced into Solaris, maybe someones memory is better then mine. Was ZFS better of in 06-07 Then BTRFS is now?
btrfs development started in 2007 while ZFS development started in 2001. If you want to do a point in time comparison, look at how ZFS was in 2010. You can get the last copy of OpenSolaris for the comparison. It would need to be done on physical hardware due to poor support for virtualization back then.
By the way, ZFS was deemed production ready after 4 years of development.
That's true. One quick side note though: BTRFS actually had a fairly long "draft period" (multiple years) of design. I have no clue how long that was on ZFS. I just know that the original author mentioned that somewhere.
I am the guy who asked the LZJB question. To summarize my recollection, formal design work on ZFS started in 2001 when Matthew Ahrens started working at Sun. Jeff Bonwick had promised Matthew Ahrens a job at Sun making a filesystem a 6 months to a year before then when Matt was still in college. I am sure that both Jeff and Matt had some thoughts on it during that time, but there was no formal effort until Matt's employment started.
The idea that btrfs has fewer contributors ought to be a surprise for many. Users tend to assume the opposite. ZFS having 5x more seems to be on the high side to me though. How did you determine that?
By the way, I was under the impression that ZFS development started in 2001 while btrfs development started in 2007. That would be a 6 year difference.
Not sure how that tracks. These file systems are huge code bases with lots of problems and areas for improvement. When you only have 2-4 people who understand most of it at a given time things move slowly. If I could tell somebody to go fix balance my life would be a whole lot easier, but in reality we end up being interrupt driven and have to prioritize differently.
It's a general purpose file system, it does great with metadata heavy workloads and normal streaming writes. The overwrite case is special because of COW. Eventually we'll be comparable to everybody else but we aren't now.
I reinstalled my laptop with btrfs on / because "why not" about a year ago.
I ran into issues with maintaining a huge ~1M file Maildir, it seemed to do very badly with huge directories. Some kernel thread would be at 99% CPU while I was trying to populate a Maildir, and the entire system ground to a halt.
More importantly I would run into issues like running out of space on / with 20G left, but it was the "wrong kind of space". I.e. I had run out of metadata space but had plenty of file space left and had to run a rebalancing operation on the filesystem.
I didn't need any of the COW etc. advanced features that btrfs provides, and didn't test them, but as just a normal user needing a general purpose filesystem it was a bit too much hassle, especially with needing to administer the filesystem in a manner similar to how I might administer a RDBMS.
A subvolume for your Maildir, mounted with "-o nodatacow", would have prevented the first issue.
If btrfs could modify that behavior automatically using heuristics it would truly be general-purpose. However there are some inherent tradeoffs of that approach that users need to consider, and I consider that downside a basic constraint of the Unix filesystem abstraction.
I think I did try stuff like that, including mounting / with nodatacow and other similar options.
I walked away with the impression that btrfs was a powerful tool for certain use-cases, but definitely not something I'd call a "general purpose" filesystem given the need to be fairly knowledgeable about its internals to use it for common desktop use-cases.
To me a general purpose filesystem is something like ext4, it doesn't have amazing performance, it's not bad either, but generally nothing unexpected happens with it and I can just leave it there and don't have to worry about it.
The btrfs filesystem seemed like the opposite of that. A very powerful tool whose power and flexibility made it less general purpose by virtue of needing to keep close tabs on how you were using it.
nodatacow means no checksums. Whatever performance problem it was intended to resolve should also recur the moment that a snapshot is done, because then CoW re-enters the picture.
NOCOW is horribly expensive because we still have to go check and make sure that there are no snapshots pointing at the changing extents. It only solves the fragmentation issue, and if you don't prealloc your image it doesn't even do that.
Are you sure that disabling CoW solves btrfs' fragmentation issue when you pre-allocate the image? If the volume is snapshotted regularly, CoW should be active once on each extent once until the next snapshot reactivates it. That should mean that the fragmentation would still occur, although more slowly. Is that correct or am I misunderstanding something?
I am only superficially familiar with btrfs internals, but I do not see any way to implement snapshots without either doing CoW or duplicating the data in its entirety. If you are checking to see if the data is part of a snapshot, then you should be doing CoW.
(I'm not an expert, just a user) Snapshotting effectively disables NOCOW. Snapshotting always works, and needs COW to do so. Which means ryao is right that fragmentation will occur. btrfs doesn't copy a NOCOW file in its entirety if it's in a snapshot, or I would have run out of disk space by now. (I really need to move my VM images to a non-backed-up subvolume.)
How about if you have a NOCOW subvolume with just a few files on it, e.g. 3 VM images. Is it still expensive to check for snapshots? Does the cost of checking disappear if you have no snapshots of the volume? As ryao mentioned, you'd still get fragmentation if you have snapshots, so in that case is using NOCOW for a VM image counterproductive?
Seems like the expense of checking could be largely removed with clever enough metadata caching, which probably noone has had time to implement.
We have to look up the physical extent in the extent reference tree, so the cost is independent of the number of snapshots and more a function of the fragmentation of the extent tree. The metadata is all cached of course, but fragmentation means you are likely to not find the entries in cache.
The other aspect that I haven't talked about is our fsync performance is kind of shit compared to other fs'es. Now this does get better in the nocow case but it's still pretty heavy and needs optimization.
Is brtfs a good choice from the power consumption standpoint?
I'm on a laptop and trying to maximize my battery life, and am wondering if my choice of brtfs limits it.
The reality is that many are not interested in waiting and adopt ZFS permanently. This was my experience ~6 years ago. Had btrfs worked better then, I would probably have been a btrfs contributor today.
Of what I know now about the internals of each, I am glad that I went with ZFS. The decisions in ZFS are more robust in terms of reliability and performance. e.g. The merkle tree, the use of 256-but checksums, mandatory ditto blocks, the ARC algorithm, the intent log, default ditto blocks, an variable height, indirect block tree, etcetera.
That sentiment is familiar, I used to subscribe to it when I was following this closely 3-4 years ago. That it hasn't changed in that time does not imbue hope.
This is more than a year old and the hardware this was tested on wasn't even that exceptional when it was released 7 years ago. I'm kind of assuming that the tester did something strange for BTRFS in particular because his results disagree with every other benchmark out there. Using BTRFS on top of software RAID 10 is also inappropriate as this should be done by creating a filesystem containing all four devices.
That being said I'd love to see more benchmarks of BTRFS compared to other filesystems and on hardware that isn't so archaic. I think it's safe to say that this article is not representative of reality as Phoronix has tons of benchmarks that all don't have nearly this big of difference between BTRFS and Ext4. Here's an article that's even more outdated than the parent and it still shows BTRFS performing acceptably across the board.
I found it just recently when upgraded my VM images storage and BTRFS was still significally slower on HDDs using Ubuntu 16.04. Though I only compared it to EXT4 on LVM with QCOW2 + backing files.
Any idea if there was newer performance comparison for VM storage?
I could argue otherwise, but that is beside the point on enterprise distributions where these tests were done. The latest ZFS code is always avaliable on those while the btrfs code is stagnant there. The btrfs developers delegate responsibility of backporting their changes to distribution developers, who never do any backports. Consequently, the newer btrfs code is not very relevant as far as benchmarks on RHEL and others are concerned.
Not sure what you mean here. They showed that ZFS performed well on linux. Are you saying that ZFS performs poorly on BSD and Solaris?
Love me some ZFS and it works well pretty much everywhere. OpenZFS shows good promise of keeping (bringing) the BSD, Illumos and Linux versions in line with each other.
Another reason is that ZFS is tightly integrated with other things (for example jails and zones) on those systems, having switches related to other parts. I don't think that it is quiet there yet on Linux, even though I am sure that will come.
ClusterHQ's tests were unable to detect a difference between the platforms, with the exception of SELinux where performance was degraded unless xattr=sa was set. The code bases are almost identical, so performance differences tend to be caused by external factors and/or compatibility shims. There are also cases where one platform has a performance improvement before the others because the releases and porting are not done in lock step.
BTRFS has had Copy on Write disabled, while ZFS (not possible, cause the whole idea of the FS is intended to be copy on write), which actually makes BTRFS look even worse compared to ZFS, cause BTRFS writes once instead of twice and has most of its features not work, while still performance pretty bad. But then it's also younger.
Both ZFS and BTRFS (only know specifics of ZFS) can be configured to be better for DB workloads. On ZFS, I know a couple of people using it cause it has nice properties.
Anyway. You might not want to use ZFS or BTRFS for a (pure) database system when performance is the important thing (compared to data security).
What I find kind of missing is UFS because of the BSD world (it's kind of what Ext4 is in the sense of "your general purpose FS with good performance for DBs"), but okay.
And yeah, that all may sound a bit biased towards ZFS, but so far my experience with ZFS has been rather pleasent compared to BTRFS, but again. You might wanna use UFS or Ext4.
It's also kind of "wrong" to compare those file systems. It's like while you could use Redis and PostgreSQL for the same things it's probably not what you want to for one reason or the other.
But then of course it's good, cause you might really have a thing where you want to have a comparison of those two things to take the right decision for your specific application. Like those cases where you say "It's slow, but it makes a lot of thing easier" or "It doesn't guarantee that, but when I take care of it in the design I will only need only a fraction of the resources. And the other downsides don' t annoy me thaaat much".
Because the question came up. COW means Copy on Write and it really means what it says. You copy the data (so have find and to write blocks, duplicating your data on write) which in case of a full blown database which does that again (WAL, Autovacuum, keeping lots of metadata, etc.) you really shouldn't be surprised that it's a lot slower on write heavy sytems. It is more than expected.
On the other hand because the FS does a lot of things in a similar way to a database, other than snapshots and replication and all that cool stuff ZFS does things related to metadata extremely quickly and that's why some CDNs that have many small files use it. Besides just being an amazing thing for managing your data in general.
Also if you wanna learn more a about ZFS and have a good understanding of how you can run databases and other things on ZFS and really utilize it (not just gaining more performance) then I highly recommend the book FreeBSD Mastery: Advanced ZFS.
I've been curious about ZFS, btrfs, etc. But as a layperson, I don't have the technical chops, gumption, wherewithal to figure what's what. Reading posts (comments) about the edge cases where they fail (data, performance, missing features) leaves me more baffled.
The presentation is focused on how the fuzzing technique works rather than individual file system performance, but the "time to first bug" is quite telling. Copied here:
That sort of analysis is testing something on most of those that happen to be two different things on btrfs and ZFS. Specifically, corruption that passes checksums and corruption that does not pass checksums. btrfs' result was obtained by modifying it to disable checksums. How btrfs or ZFS do without the benefit of checksums should only matter for people exchanging disk images. In that case, it would be better to have a userland driver regardless of the filesystem. NetBSD is able to run its filesystem drivers in userspace for this reason.
There is plenty there to read. I would rather not spend my Sunday reading that to see if I can think of something equivalent. However, you might find these pages helpful:
Sorry, I didn't mean for you to read it, just linked to be clear. The Jespen effort verifies the reliability claims of various persistence schemes. Independent verification, if you will.
ZFS has a utility called ztest that performs stochastic testing on the core ZFS code in userspace with assertions enabled. It catches many reliability issues and is something that anyone can run. There is also the zinject framework for doing fault injection.
As for independent verification, here are the top two results from searching google from my IP address for "Zfs paper reliability":
There is nothing surprising to me there. The merkle tree with good quality 256-bit checksums strongly guarantees that corruptions are detected and redundancy allows for correction.
The main ways of messing up would be a buggy driver, something else in the kernel damaging the driver's data structures or hardware corrupting memory. The stochastic testing gives some protection against the first of those. I guess you could target the two things that live outside of the merkle tree (labels and ZIL blocks), but those are self checksummed and used in ways that minimize the potential for issues.
Here is a summary of data integrity features that I wrote:
There is a UFS driver for Linux, but it is a reimplementation rather than a port, so it's performance numbers would not be comparable. Also, it attempts to support many variants of UFS rather than just one:
It does not support the latest UFS developments in FreeBSD, NetBSD, etcetera, so its performance is also limited by the older disk format versus the newer formats used by drivers on other platforms.
Interesting. But in our case, we are running Hyper-V VMs (son sadly NTFS on host), were the VMs uses EXT4, and a few uses BTRFS.
The only problem that we had, was with a uncontrolled poweroff a year ago. Looks that BRTFS had some trouble recovering from it, but we manage to restore all data from the partitions. However, the flexibility that offer BTRFS (transparent compression, increasing hard disk space on demand, etc) is really nice. I hope that it improve more, so we can use it without any issues.
Note that ext4 on LVM can do this as well, you just need to plan for it in advance. All of my systems are configured as fs-on-lvm, even the single-disk ones. I've found it's just less hassle: I can do all storage migration or storage expansion without ever taking the machine offline.
I just realized one of my Ubuntu VMs was built without LVM and only found out because I was going to expand storage on it. Luckily, it's just a toy VM on the home server so no biggie. But yes, having fs-on-lvm should just be the default now since it makes expansion so much easier.
I'd like to see NTFS somewhere in these comparisons just to shake some of that "grass is greener" thinking. The amount I've read on the various file systems I feel like there's a lot on the table there to get out of our hardware by just using a better system. But I've not seen a whole lot about ntfs, because it's all we've got in windows land I suppose.
Got a Synology NAS showing up today. Based on the results here and the fuzzing link posted several days ago (and elsewhere ITT), it looks like I'll be going with ext4 for now.
Amazingly, I've had a ReadyNAS device last me over 10 years and here's to hoping this one lasts a similar period of time.
- BTRFS not mounted with compression (compress=lzo)
- Don't use QCOW2, just don't, it's slow and you're just adding extra layers where you don't need to.
It would be interesting to see you re-run these tests using a modern kernel, say at least 4.4 and either raw block devices or logical volumes along with mounting BTRFS properly with the compress=lzo option