Hacker News new | past | comments | ask | show | jobs | submit login
About ZFS Performance (percona.com)
165 points by okket on May 27, 2018 | hide | past | favorite | 40 comments



ZFS is an industrial-scale technology. It's not a scooter you just hop on and ride. It's like a 747, with a cockpit full of levers and buttons and dials. It can do amazing things, but you have to know how to drive it.

I've run ZFS for a decade or more. With little or no tuning. If I dial my expectations back, it works great, much better than ext4 for my use case. But once I start trying to use deduplication, I need to spend thousands of dollars on RAM, or the filesystem buckles under the weight of deduplication.

My use case is storing backups of other systems, with rolling history. I tried the "hardlink trick" with ext4, the space consumption was out of control, because small changes to large files (log files, ZODB) caused duplication of the whole file. And managing the hard links took amazing amounts of time and disk I/O.

ZFS solved that problem for me. Just wish I could do deduplication without having to have 64GB of RAM. But, I take what I can get.


I totally agree with your comment, except for the last bit about 64GB of RAM being unreasonable high.

Is it unreasonable to spend thousands of dollars on memory for Enterprise-grade, production-level servers? In my experience, you'd almost certainly better be using over a hundred GB of RAM in a server if you want to maximize the overall compute density.

To be clear, plenty of Desktops and all workstation-class laptops support 64GB+ of RAM.


I think that agrees with my "industrial grade" comment. :-) In production we are currently deploying boxes with 256GB of RAM. But for our backup server, it's hard to spend more on RAM than on the disks in it. The box is capable of it, it's just the cost that makes it unappealing for this use.

For various reasons, the backup server doesn't get much priority.


Today one would use reflink with xfs or btrfs for a finer-grained "hardlink trick" solution.

But that still won't deduplicate renamed files if you're using something like rsync w/--link-dest for the backups.


Are there any deduplicating filesystems for Linux similar to the (very basic) NTFS deduplication in Windows? I’ve fiddled with quite a few Linux dedup solutions over the years but nothing seems to be production ready. Even ZFS isn’t that useful as it only chunks on multiples of the block size.

The basic feature set for deduplication in my book is non-block-aligned chunking with compression. NTFS has had transparent post-process dedup since 2012 and is pretty much ideal for file, VM, and backup servers. It just works in the background with little performance impact but huge space savings.

It feels dirty using Windows servers as NAS for our Linux machines, but that’s what we’re doing today.


With XFS and reflink, out-of-band deduplication is totally possible and is a userspace [1] issue. But XFS is not doing anything to assist in accelerating the identification of duplicate blocks, instead it simply implements ioctls for what is essentially extent sharing.

[1] https://github.com/markfasheh/duperemove


btrfs has some dedup options, but I just recently looked and they all seem like kludges. I'd really like HAMMER on Linux, but that doesn't seem to be going anywhere. And I don't really want to run DragonFly for my backup servers, for maintenance reasons.


The comparison isn't apples to apples though. You'd have to setup bcache or dm-cache with the NVMe drive in front of XFS to compare ZFS with L2ARC on a NVMe drive. This is stated in the article as an exotic technology, but bcache is generally considered stable.

The point still stands, ZFS is fast enough 99.99% of the time (when tuned correctly), and simplifies a lot of administrative tasks.


> bcache is generally considered stable

Is it? How about the recent bug in Linux 4.14 that could lead to data loss with bcache (fixed in 4.14.2)? On the other hand ZFS for Linux also had a regression recently.


To be fair to bcache, that bug wasn't in the bcache code. It was in core kernel bio code and could easily have effected other subsystems.


That was someone introducing a regression in a core kernel API; bcache was just the victim.


With bcache one would have a write-back cache, so it would speed up writes as well (which would be superior to L2ARC).

Comparable to the L2ARC would be adding the NVMe as swap and then increasing the mysql buffer cache to 400GB or something (don't know if that would work without issues, though). With zswap that would even have compression.


ZFS also supports ZIL, would that improve write speeds?


Yes it would. And then it would be an apples to apples conparison. But this was not the intended result the percona team tried to show as dicussed in the comments. They sinply wanted to show that „if you throw mire hardware at it, zfs will outperform xfs with less hardware“.

Don‘t get me wrong. ZFS is an incredible masterpiece, but there are almost no realistic benchmarks. They are in favor of zfs like the benchmark from percona or in favor if other system when l2arc/zil is not used.


ZFS is “fast enough” depending on the use case.

Throwing expensive hardware at it to make it perform better than XFS is... an interesting “benchmarking” methodology.


That is mentioned in the article though:

> Of course, I could use flashcache or bcache with XFS and improve the XFS results but these technologies are way more exotic than the ZFS L2ARC.


The message you're replying to mentioned that and explicitly disagreed with it.


The point to take away from this is the entirely unsurprising conclusion that locally-attached unthrottled ephemeral disk has higher throughput than network-attached IOPS-throttled EBS. So yeah, sure, if you take the time to pre-cache all your EBS-stored data onto the local ephemeral drive and only do reads from the cache, then you will get more query throughput on the local ephemeral drive than on the remote EBS drive. But I'm not sure what that is supposed to tell us about ZFS or XFS.


The post is to explain how ZFS functions, and how you can tune it for your best performance. The author notes that the benchmarks are not fair, they are there to show how much improvement you could potentially gain from tuning ZFS.

The title is "About ZFS performance," not "An in depth comparison of ZFS and XFS."


Part of the point is that you can easily take advantage of that fast local storage to add transparent caching to a database that is otherwise unaware of it.


For a transaction processing database workload, it's generally good practice to run with "zfs set primarycache=metadata", which means that ZFS won't attempt to cache anything except metadata. This might have reduced the 15% cache overhead the author observed.

I'll be interested to read the future work on using larger page sizes. Conventional wisdom would hold that it was a bad idea (at least, for a write-heavy workload) because of the write-amplification that it produces, but it sets me wondering to what extent ZIL-offloading could mitigate that. Then again, there's really no point in putting the ZIL on ephemeral storage.


ZFS is such a delight to work with. I learned something new in this article. Thanks for linking it.


It may interest you to know that in addition to running on ZFS, the rsync.net platform also has the option to 'zfs send' and 'receive' (over SSH):

https://arstechnica.com/information-technology/2015/12/rsync...


I normally couldn't care that you guys try to slip a thinly-veiled advertisement into any HN discussion that concerns ZFS but, geez, c'mon guys, this is really pushing it.

At the very least, throw in that mention of the HN discount.


Agreed.


uncalled for. Really?


> ZFS is much less affected by the file level fragmentation, especially for point access type.

I'm disappointed they didn't show that comparison. How much does the original result change in case of fragmented db files?


Actually, it is quite opposite. ZFS is a copy-on-write filesystem - if you do write in the middle of your file the datablock gets moved to a new place on your disk. For typical database load your db files get more and more fragmented with time.


"RAIDZ" seems to be most popular, but if you want better performance go with mirrors. Then let yours VM's have their own FS on top of zvol. For example NTFS on top of a zvol you can get GB rw/s even with spinning HDD's.


If you have 24 disks for a pool, then mirroring across 4 6-disk RAIDZ2 zvols works well too.


Another interesting project of next-gen file system is Red Hat Stratis: https://stratis-storage.github.io/

After dropping btrfs support some time ago they started developing their own next-gen file system based on LVM and XFS. It is now available for technology preview on Fedora 28.


I read this article and took away the conclusion that to get acceptable performance from ZFS compared to XFS, I have to do extensive tuning and throw in a half terabyte of NVMe storage as a cache.

Not very impressive.


I agree they didn't present it very well. But in reality if you do want to get as much as possible out of your database, you'll need to look at tuning at that level anyway - regardless of the database, or the filesystem you use.

I expect that XFS with bcache would need much less tuning up front though. Adding bcache in front of the original configuration should give a similar improvement to l2arc.


You can also enable Lz4 compression on the dataset itself the ZFS data volume for even faster performance.

# zfs set compression=lz4 mysqldatavol


At a cost of CPU to do the work.


your CPU spends most of it's time idle and your disks are 10's of thousands of times slower. For systemic performance, if you can trade almost anything for clock cycles, you should do that.


Yes, sort of like saying you for car insurance at the cost of a better car. The cpu overhead for lz4 is low low low.


The trend of crappy posts from Percona continues they must want attention suddenly.

Percona was best at consulting; their DBA's were worth it. The software maybe close to irrelevant now but not their DBA's.


This is the worst GDPR or whatever implementation I’ve seen yet on mobile where the button to accept (I assume it is to accept) renders under (depth) the fixed social buttons at the bottom. And the modal is fixed, too, so no matter of scrolling exposes it.


Been doing hardware RAID for years. Went for the ZFS meme, got half the performance and even that modicum required unnaturally deep queues. "Lost" 10k+ of my employer's money.

How many drives are these guys using, how does it scale compared to theoretical performance.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: