ZFS is an industrial-scale technology. It's not a scooter you just hop on and ride. It's like a 747, with a cockpit full of levers and buttons and dials. It can do amazing things, but you have to know how to drive it.
I've run ZFS for a decade or more. With little or no tuning. If I dial my expectations back, it works great, much better than ext4 for my use case. But once I start trying to use deduplication, I need to spend thousands of dollars on RAM, or the filesystem buckles under the weight of deduplication.
My use case is storing backups of other systems, with rolling history. I tried the "hardlink trick" with ext4, the space consumption was out of control, because small changes to large files (log files, ZODB) caused duplication of the whole file. And managing the hard links took amazing amounts of time and disk I/O.
ZFS solved that problem for me. Just wish I could do deduplication without having to have 64GB of RAM. But, I take what I can get.
I totally agree with your comment, except for the last bit about 64GB of RAM being unreasonable high.
Is it unreasonable to spend thousands of dollars on memory for Enterprise-grade, production-level servers? In my experience, you'd almost certainly better be using over a hundred GB of RAM in a server if you want to maximize the overall compute density.
To be clear, plenty of Desktops and all workstation-class laptops support 64GB+ of RAM.
I think that agrees with my "industrial grade" comment. :-) In production we are currently deploying boxes with 256GB of RAM. But for our backup server, it's hard to spend more on RAM than on the disks in it. The box is capable of it, it's just the cost that makes it unappealing for this use.
For various reasons, the backup server doesn't get much priority.
Are there any deduplicating filesystems for Linux similar to the (very basic) NTFS deduplication in Windows? I’ve fiddled with quite a few Linux dedup solutions over the years but nothing seems to be production ready. Even ZFS isn’t that useful as it only chunks on multiples of the block size.
The basic feature set for deduplication in my book is non-block-aligned chunking with compression. NTFS has had transparent post-process dedup since 2012 and is pretty much ideal for file, VM, and backup servers. It just works in the background with little performance impact but huge space savings.
It feels dirty using Windows servers as NAS for our Linux machines, but that’s what we’re doing today.
With XFS and reflink, out-of-band deduplication is totally possible and is a userspace [1] issue. But XFS is not doing anything to assist in accelerating the identification of duplicate blocks, instead it simply implements ioctls for what is essentially extent sharing.
btrfs has some dedup options, but I just recently looked and they all seem like kludges. I'd really like HAMMER on Linux, but that doesn't seem to be going anywhere. And I don't really want to run DragonFly for my backup servers, for maintenance reasons.
The comparison isn't apples to apples though. You'd have to setup bcache or dm-cache with the NVMe drive in front of XFS to compare ZFS with L2ARC on a NVMe drive. This is stated in the article as an exotic technology, but bcache is generally considered stable.
The point still stands, ZFS is fast enough 99.99% of the time (when tuned correctly), and simplifies a lot of administrative tasks.
Is it? How about the recent bug in Linux 4.14 that could lead to data loss with bcache (fixed in 4.14.2)? On the other hand ZFS for Linux also had a regression recently.
With bcache one would have a write-back cache, so it would speed up writes as well (which would be superior to L2ARC).
Comparable to the L2ARC would be adding the NVMe as swap and then increasing the mysql buffer cache to 400GB or something (don't know if that would work without issues, though). With zswap that would even have compression.
Yes it would. And then it would be an apples to apples conparison. But this was not the intended result the percona team tried to show as dicussed in the comments. They sinply wanted to show that „if you throw mire hardware at it, zfs will outperform xfs with less hardware“.
Don‘t get me wrong. ZFS is an incredible masterpiece, but there are almost no realistic benchmarks. They are in favor of zfs like the benchmark from percona or in favor if other system when l2arc/zil is not used.
The point to take away from this is the entirely unsurprising conclusion that locally-attached unthrottled ephemeral disk has higher throughput than network-attached IOPS-throttled EBS. So yeah, sure, if you take the time to pre-cache all your EBS-stored data onto the local ephemeral drive and only do reads from the cache, then you will get more query throughput on the local ephemeral drive than on the remote EBS drive. But I'm not sure what that is supposed to tell us about ZFS or XFS.
The post is to explain how ZFS functions, and how you can tune it for your best performance. The author notes that the benchmarks are not fair, they are there to show how much improvement you could potentially gain from tuning ZFS.
The title is "About ZFS performance," not "An in depth comparison of ZFS and XFS."
Part of the point is that you can easily take advantage of that fast local storage to add transparent caching to a database that is otherwise unaware of it.
For a transaction processing database workload, it's generally good practice to run with "zfs set primarycache=metadata", which means that ZFS won't attempt to cache anything except metadata. This might have reduced the 15% cache overhead the author observed.
I'll be interested to read the future work on using larger page sizes. Conventional wisdom would hold that it was a bad idea (at least, for a write-heavy workload) because of the write-amplification that it produces, but it sets me wondering to what extent ZIL-offloading could mitigate that. Then again, there's really no point in putting the ZIL on ephemeral storage.
I normally couldn't care that you guys try to slip a thinly-veiled advertisement into any HN discussion that concerns ZFS but, geez, c'mon guys, this is really pushing it.
At the very least, throw in that mention of the HN discount.
Actually, it is quite opposite. ZFS is a copy-on-write filesystem - if you do write in the middle of your file the datablock gets moved to a new place on your disk.
For typical database load your db files get more and more fragmented with time.
"RAIDZ" seems to be most popular, but if you want better performance go with mirrors. Then let yours VM's have their own FS on top of zvol. For example NTFS on top of a zvol you can get GB rw/s even with spinning HDD's.
After dropping btrfs support some time ago they started developing their own next-gen file system based on LVM and XFS.
It is now available for technology preview on Fedora 28.
I read this article and took away the conclusion that to get acceptable performance from ZFS compared to XFS, I have to do extensive tuning and throw in a half terabyte of NVMe storage as a cache.
I agree they didn't present it very well. But in reality if you do want to get as much as possible out of your database, you'll need to look at tuning at that level anyway - regardless of the database, or the filesystem you use.
I expect that XFS with bcache would need much less tuning up front though. Adding bcache in front of the original configuration should give a similar improvement to l2arc.
your CPU spends most of it's time idle and your disks are 10's of thousands of times slower. For systemic performance, if you can trade almost anything for clock cycles, you should do that.
This is the worst GDPR or whatever implementation I’ve seen yet on mobile where the button to accept (I assume it is to accept) renders under (depth) the fixed social buttons at the bottom. And the modal is fixed, too, so no matter of scrolling exposes it.
Been doing hardware RAID for years. Went for the ZFS meme, got half the performance and even that modicum required unnaturally deep queues. "Lost" 10k+ of my employer's money.
How many drives are these guys using, how does it scale compared to theoretical performance.
I've run ZFS for a decade or more. With little or no tuning. If I dial my expectations back, it works great, much better than ext4 for my use case. But once I start trying to use deduplication, I need to spend thousands of dollars on RAM, or the filesystem buckles under the weight of deduplication.
My use case is storing backups of other systems, with rolling history. I tried the "hardlink trick" with ext4, the space consumption was out of control, because small changes to large files (log files, ZODB) caused duplication of the whole file. And managing the hard links took amazing amounts of time and disk I/O.
ZFS solved that problem for me. Just wish I could do deduplication without having to have 64GB of RAM. But, I take what I can get.