About ZFS Performance

linsomniac · on May 27, 2018

ZFS is an industrial-scale technology. It's not a scooter you just hop on and ride. It's like a 747, with a cockpit full of levers and buttons and dials. It can do amazing things, but you have to know how to drive it.

I've run ZFS for a decade or more. With little or no tuning. If I dial my expectations back, it works great, much better than ext4 for my use case. But once I start trying to use deduplication, I need to spend thousands of dollars on RAM, or the filesystem buckles under the weight of deduplication.

My use case is storing backups of other systems, with rolling history. I tried the "hardlink trick" with ext4, the space consumption was out of control, because small changes to large files (log files, ZODB) caused duplication of the whole file. And managing the hard links took amazing amounts of time and disk I/O.

ZFS solved that problem for me. Just wish I could do deduplication without having to have 64GB of RAM. But, I take what I can get.

zymhan · on May 27, 2018

I totally agree with your comment, except for the last bit about 64GB of RAM being unreasonable high.

Is it unreasonable to spend thousands of dollars on memory for Enterprise-grade, production-level servers? In my experience, you'd almost certainly better be using over a hundred GB of RAM in a server if you want to maximize the overall compute density.

To be clear, plenty of Desktops and all workstation-class laptops support 64GB+ of RAM.

linsomniac · on May 30, 2018

I think that agrees with my "industrial grade" comment. :-) In production we are currently deploying boxes with 256GB of RAM. But for our backup server, it's hard to spend more on RAM than on the disks in it. The box is capable of it, it's just the cost that makes it unappealing for this use.

For various reasons, the backup server doesn't get much priority.

newnewpdro · on May 27, 2018

Today one would use reflink with xfs or btrfs for a finer-grained "hardlink trick" solution.

But that still won't deduplicate renamed files if you're using something like rsync w/--link-dest for the backups.

tatersolid · on May 28, 2018

Are there any deduplicating filesystems for Linux similar to the (very basic) NTFS deduplication in Windows? I’ve fiddled with quite a few Linux dedup solutions over the years but nothing seems to be production ready. Even ZFS isn’t that useful as it only chunks on multiples of the block size.

The basic feature set for deduplication in my book is non-block-aligned chunking with compression. NTFS has had transparent post-process dedup since 2012 and is pretty much ideal for file, VM, and backup servers. It just works in the background with little performance impact but huge space savings.

It feels dirty using Windows servers as NAS for our Linux machines, but that’s what we’re doing today.

newnewpdro · on May 29, 2018

With XFS and reflink, out-of-band deduplication is totally possible and is a userspace [1] issue. But XFS is not doing anything to assist in accelerating the identification of duplicate blocks, instead it simply implements ioctls for what is essentially extent sharing.

[1] https://github.com/markfasheh/duperemove

linsomniac · on May 30, 2018

btrfs has some dedup options, but I just recently looked and they all seem like kludges. I'd really like HAMMER on Linux, but that doesn't seem to be going anywhere. And I don't really want to run DragonFly for my backup servers, for maintenance reasons.

fgonzag · on May 27, 2018

The comparison isn't apples to apples though. You'd have to setup bcache or dm-cache with the NVMe drive in front of XFS to compare ZFS with L2ARC on a NVMe drive. This is stated in the article as an exotic technology, but bcache is generally considered stable.

The point still stands, ZFS is fast enough 99.99% of the time (when tuned correctly), and simplifies a lot of administrative tasks.

ofrzeta · on May 27, 2018

> bcache is generally considered stable

Is it? How about the recent bug in Linux 4.14 that could lead to data loss with bcache (fixed in 4.14.2)? On the other hand ZFS for Linux also had a regression recently.

RX14 · on May 27, 2018

To be fair to bcache, that bug wasn't in the bcache code. It was in core kernel bio code and could easily have effected other subsystems.

loeg · on May 27, 2018

That was someone introducing a regression in a core kernel API; bcache was just the victim.

tobias3 · on May 27, 2018

With bcache one would have a write-back cache, so it would speed up writes as well (which would be superior to L2ARC).

Comparable to the L2ARC would be adding the NVMe as swap and then increasing the mysql buffer cache to 400GB or something (don't know if that would work without issues, though). With zswap that would even have compression.

ioquatix · on May 27, 2018

ZFS also supports ZIL, would that improve write speeds?

tpetry · on May 27, 2018

Yes it would. And then it would be an apples to apples conparison. But this was not the intended result the percona team tried to show as dicussed in the comments. They sinply wanted to show that „if you throw mire hardware at it, zfs will outperform xfs with less hardware“.

Don‘t get me wrong. ZFS is an incredible masterpiece, but there are almost no realistic benchmarks. They are in favor of zfs like the benchmark from percona or in favor if other system when l2arc/zil is not used.

foobarbazetc · on May 27, 2018

ZFS is “fast enough” depending on the use case.

Throwing expensive hardware at it to make it perform better than XFS is... an interesting “benchmarking” methodology.

mtarnovan · on May 27, 2018

That is mentioned in the article though:

> Of course, I could use flashcache or bcache with XFS and improve the XFS results but these technologies are way more exotic than the ZFS L2ARC.

ghusbands · on May 27, 2018

The message you're replying to mentioned that and explicitly disagreed with it.

skywhopper · on May 27, 2018

The point to take away from this is the entirely unsurprising conclusion that locally-attached unthrottled ephemeral disk has higher throughput than network-attached IOPS-throttled EBS. So yeah, sure, if you take the time to pre-cache all your EBS-stored data onto the local ephemeral drive and only do reads from the cache, then you will get more query throughput on the local ephemeral drive than on the remote EBS drive. But I'm not sure what that is supposed to tell us about ZFS or XFS.

boomboomsubban · on May 27, 2018

The post is to explain how ZFS functions, and how you can tune it for your best performance. The author notes that the benchmarks are not fair, they are there to show how much improvement you could potentially gain from tuning ZFS.

The title is "About ZFS performance," not "An in depth comparison of ZFS and XFS."

barrkel · on May 27, 2018

Part of the point is that you can easily take advantage of that fast local storage to add transparent caching to a database that is otherwise unaware of it.

radiowave · on May 27, 2018

For a transaction processing database workload, it's generally good practice to run with "zfs set primarycache=metadata", which means that ZFS won't attempt to cache anything except metadata. This might have reduced the 15% cache overhead the author observed.

I'll be interested to read the future work on using larger page sizes. Conventional wisdom would hold that it was a bad idea (at least, for a write-heavy workload) because of the write-amplification that it produces, but it sets me wondering to what extent ZIL-offloading could mitigate that. Then again, there's really no point in putting the ZIL on ephemeral storage.

gigatexal · on May 27, 2018

ZFS is such a delight to work with. I learned something new in this article. Thanks for linking it.

rsync · on May 27, 2018

It may interest you to know that in addition to running on ZFS, the rsync.net platform also has the option to 'zfs send' and 'receive' (over SSH):

https://arstechnica.com/information-technology/2015/12/rsync...

jlgaddis · on May 28, 2018

I normally couldn't care that you guys try to slip a thinly-veiled advertisement into any HN discussion that concerns ZFS but, geez, c'mon guys, this is really pushing it.

At the very least, throw in that mention of the HN discount.

gigatexal · on May 28, 2018

Agreed.

dekhn · on May 28, 2018

uncalled for. Really?

viraptor · on May 27, 2018

> ZFS is much less affected by the file level fragmentation, especially for point access type.

I'm disappointed they didn't show that comparison. How much does the original result change in case of fragmented db files?

ivan78 · on May 29, 2018

Actually, it is quite opposite. ZFS is a copy-on-write filesystem - if you do write in the middle of your file the datablock gets moved to a new place on your disk. For typical database load your db files get more and more fragmented with time.

z3t4 · on May 27, 2018

"RAIDZ" seems to be most popular, but if you want better performance go with mirrors. Then let yours VM's have their own FS on top of zvol. For example NTFS on top of a zvol you can get GB rw/s even with spinning HDD's.

equalunique · on May 27, 2018

If you have 24 disks for a pool, then mirroring across 4 6-disk RAIDZ2 zvols works well too.

ivan78 · on May 29, 2018

Another interesting project of next-gen file system is Red Hat Stratis: https://stratis-storage.github.io/

After dropping btrfs support some time ago they started developing their own next-gen file system based on LVM and XFS. It is now available for technology preview on Fedora 28.

dekhn · on May 27, 2018

I read this article and took away the conclusion that to get acceptable performance from ZFS compared to XFS, I have to do extensive tuning and throw in a half terabyte of NVMe storage as a cache.

Not very impressive.

viraptor · on May 27, 2018

I agree they didn't present it very well. But in reality if you do want to get as much as possible out of your database, you'll need to look at tuning at that level anyway - regardless of the database, or the filesystem you use.

I expect that XFS with bcache would need much less tuning up front though. Adding bcache in front of the original configuration should give a similar improvement to l2arc.

acd · on May 27, 2018

You can also enable Lz4 compression on the dataset itself the ZFS data volume for even faster performance.

# zfs set compression=lz4 mysqldatavol

nixgeek · on May 27, 2018

At a cost of CPU to do the work.

BrainInAJar · on May 27, 2018

your CPU spends most of it's time idle and your disks are 10's of thousands of times slower. For systemic performance, if you can trade almost anything for clock cycles, you should do that.

late2part · on May 27, 2018

Yes, sort of like saying you for car insurance at the cost of a better car. The cpu overhead for lz4 is low low low.

damm · on May 28, 2018

The trend of crappy posts from Percona continues they must want attention suddenly.

Percona was best at consulting; their DBA's were worth it. The software maybe close to irrelevant now but not their DBA's.

jsgo · on May 27, 2018

This is the worst GDPR or whatever implementation I’ve seen yet on mobile where the button to accept (I assume it is to accept) renders under (depth) the fixed social buttons at the bottom. And the modal is fixed, too, so no matter of scrolling exposes it.

frozenport · on May 27, 2018

Been doing hardware RAID for years. Went for the ZFS meme, got half the performance and even that modicum required unnaturally deep queues. "Lost" 10k+ of my employer's money.

How many drives are these guys using, how does it scale compared to theoretical performance.