Other file systems and RAID setups do not checksum your data. If you have a mirror of two disks (RAID1) and during a read two blocks differ from one another, most (all?) RAID controllers (hardware and software) will simply choose the lower numbered disk as canonical and silently "repair" the bad block. This leads to loads of silent data corruption on what we might consider a reliable storage solution.
By contrast, ZFS will store both the block and its checksum and will use the block with the correct checksum as canonical.
In other words, if you care about your data at all, use ZFS. I am frankly surprised it's not the standard file system for most situations as it is the only production filesystem that can actually be trusted with data.
P.S.: I have been told that at least on Linux if you have more than two drives, the Linux software RAID controller will try to choose the version of the block that is agreed upon by most drives, if that's possible. This is no guarantee, but it's better than randomly choosing one version.
P.P.S: BTRFS and friends seem to not yet be as production ready as ZFS. Conversely, ZFS works beautifully on Linux thanks to the ZFS on Linux project.
I can only speak from my own limited experience, but for me ZFS on Ubuntu works as advertised. I was trying to put together a long term storage solution without becoming a storage expert. ZFS fit the bill. If you are nervous, go with a BSD: they are great choices.
I can only speak from my own limited experience, but for me RAID6 on linux works as advertised...
I'll gladly take silent data corruption over a hard total failure of the entire volume any day. At least I have a chance at fixing the former, and for my use cases that's preferable.
Every time ZFS on linux is brought up it's always met with skepticism, just as a lot of people trying out ZFS in virtual machines discovered the hard way (lost volumes) there are a lot that can go wrong (and it does seem like it can go wrong) and any version running on linux will not have been tested as much making it a risk. And that's the thing you are trying to minimize with using ZFS on Linux in the first place...
Now BTRFS isn't ready for production yet so that leaves, well, nothing if you want checksums and cheap snapshots.
I guess I'll have to wait another 5 years and see if BTRFS is ready, until then it seems the only thing I can hope for is luck.
> Examples of things you don't want to do if you want to keep your data intact include using non-ECC RAM
I'm planning to build a (Linux) development workstation soon and was intending to run ZFS on at least a proportion of the disks in the system, if not the whole thing. But building a workstation that supports ECC appears to be ridiculously hard and expensive - most of the Haswell CPUs don't support ECC. Can anyone point me at a sensible CPU/motherboard combo that's comparable to something like an i7-4770k/Gigabyte GA-Z87MX-D3H?
(As an indication of the state of the component market, pcpartpicker.com doesn't support searching CPUs or motherboards by ECC support, but it does allow you to search by what colour the motherboard is!)
I feel your pain -- I've been trying to figure out the state of (cheap-ish) ecc support for a while now (the freenas forums can be a big help here).
Basically you can roll the dice with AMD, or just go with Intel (what I'll be doing). If you want reasonable performance, get a socket 1150 xeon v3 (basically a re-branded i7 with ecc support). The low-mid xeon v3 appear to score pretty well on a price/performance ratio. Can't be overclocked, and you'd probably get more single thread performance/price with an i5 -- but apart from that they seem pretty solid. As others have mentioned, (some) i3s are also an option -- but as you'll likely have to pay a little more for a main board with ecc support, that doesn't make much sense as I see it.
The final option is to get one of the newish atom based Avoton boards, like ASROCK C2750D4 INTEL C2750 AVOTON OCTACORE MITX -- not for a workstation, though. But should be nice as a NAS.
If anyone knows of a useful overview of AMD cpus and mainboard combinations that support ECC -- please let us know. I've yet to find anything beyond the anecdotal "Here's my build, and it's got ECC sticks in it, and maybe the ECC is actually enabled, but I haven't really checked."
YMMV, but I've used ZFS on all my workstations and servers for over 2 years now. Many of them are ZFS on top of LUKS/cryptsetup for FDE. None of the machines use ECC RAM, and due lots of things (ie. me doing dumb shit, bad cutting edge kernel builds, fubar GFX drivers causing X to boot to black screen etc) I frequently either hard reset them, or Alt + Prtscreen + REISUB them, I've never had a single issue with ZFS.
By comparison, I've had issues with virtually every other filesystem that exists for Linux: ext(2,3,tho not 4), reiserfs, and especially the much hated xfs, death to xfs etc :)
I run his hardware, picked because... well cheap:
GIGABYTE GA-990FXA-UD5 ( AMD 990FX/SB950 - Socket AM3+ - FSB5200 )
4 x DDR3 4GB DDR1866 (PC3-15000) - KINGSTON HyperX [KHX1866C9D3/4G]
or
4 x 16gb of the same type as above
AMD FX 8-Core FX-8320 [Socket AM3+ - 1000Kb - 3.5 GHz - 32nm ]
The ZFS all include root + swap running on ZFS with FDE, this sort of thing for workstations:
You ever try JFS? I standardised on it for all my machines a few years ago and it's served me very well - fast fsck and no corruption so far. Bit long in the tooth now but that's no bad thing for a filesystem.
I have not, but the only reason for not trying it is because of my increasing use of RAID in hardware and VMs, I wanted something that was completely agnostic to disks, which is something ZFS delivers but JFS didn't seem to when I checked.
Basically I wanted something that doesn't intertwine the concept of file-system with the concept of disks or partitions, but as space. So changes to the underlying pool of disks (or partitions, or loopback files, etc) don't mean so much painful rebuilding for me, something I spent many hundreds of hours on in the past, often at stupid o'clock in the morning. Did I mention how much I hate XFS? That's why :)
While the recommendation is in general sound, I don't think the article applies to using ZFS on a workstation. It's about using ZFS as a storage subsystem (of multiple TB size) where data must be kept safe and without errors.
You can still get most of the benefits on a workstation with non-ECC RAM, consider that any other file system would have the same challenge to deal with on non-ECC RAM.
So, as in all things, it's a money/benefits tradeoff: Does the data you're working with warrant spending a ridiculous amount of money?
You need a Xeon v3 for the ECC support, basically. Also obviously getting ECC ram for stability and getting a K series CPU for overclocking is just an incompatible mindset.
Most AMD chips support ECC on the cpu, many compatible motherboards include it but don't bother to specify on the spec sheet. I was running an Asus M5A97 with ECC with ZFS for a few before it died a tragic early death.
Supermicro has a ton of mainboards with ECC support. Many of them have integrated "RAID" or JBOD. 8 SAS + 6 SATA is typical. I haven't checked the prices for a while, but last time I bought, it was about $250 for a mainboard with integrated 8x SAS JBOD and 6x SATA. Very cheap, when you consider no separate JBOD card is required.
Those boards and their SAS chipsets (LSI SAS2008, etc.) tend to be well supported on platforms capable of running ZFS, like Solaris, FreeBSD etc.
Other typical goodies include IPMI (integrated IP-KVM), good quality dual NICs and generally good quality components.
I don't know what your budget is, but I found that the ASRock E3C226D2I Mini ITX "Server Motherboard" fit my needs well. Being an LGA 1150 board, it seems to support the i7-4770k CPU you mentioned and is about $100 more than your comparable setup.
It has 2 USB 3.0 ports, supports up to 16GiB of RAM, and has 6 SATA 6Gb/s ports, so I think it would also work well for a workstation board, if you added a video/audio card -- HDMI out is the only glaring omission for a workstation.
I'm using it in my FreeNAS fileserver and have been happy with it.
What I would like to see is a Feynman-style "Introduction to Filesystems" geared toward programmers. And when I say Feynman-style, I mean lucid, devoid of jargon, and dense with meaningful concepts.
Granted, most application programmers don't deal with the file system in any meaningful way - most interaction is deferred to other processes (e.g. the datastore). But I for one would be interested in certain silly things like, oh, writing a program that can (empirically) tell the difference between a spinning disk and a solid state disk, or what file system it's running on, just from performance characteristics. Other fun things would be to determine just how fast you can write data to disk, and what parameters make this rate faster or slower.
Many of these considerations don't have anything to do ZFS per se, but come up in designing any non-trivial storage system. These include most of the comments about IOPS capacity, the comments about characterizing your workload to understand if it would benefit from separate intent logs devices and SSD read caches, the notes about quality hardware components (like ECC RAM), and most of the notes about pool design, which come largely from the physics of disks and how most RAID-like storage systems use them.
Several of the points are just notes about good things that you could ignore if you want to, but are also easy to understand, like "compression is good." "Snapshots are not backups" is true, but that's missing the point that constant-time snapshots are incredibly useful even if they don't also solve the backup problem.
Many of the caveats are highly configuration specific: hot spares are good choices in many configurations; the 128GB DRAM limit is completely bogus in my experience; and the "zfs destroy" problem has been largely fixed for a long time now.
There was a cheesy "CSI" video with some Sun engineers building a zfs pool from a crapton of USB sticks. I think I might do that to have a play with it before I spend ~$1500AUD building a NAS.
I actually did this with 64x32GB eBay special (read: godawful slow) MicroSD's. More to make a joke about my "enterprise flash storage system" as we were in the middle of a bunch of storage vendor trials, and sick of dealing with buzzword sales guys.
It was... slow. But the blinky lights were pretty awesome.
We did a similar experiment to http://macguild.org/raid.html (RAID 5 over USB floppy drives) with a bunch of small slow USB sticks we'd collected between us (~64..256Mb IIRC, using just 60M on each to treat them as same sized devices) and various Linux RAID arrangements. It was a good way to play with the tools in an environment where we can break things artificially, as well as an interesting little play thing. Another good place to practise is in VMs of course. Both options are far enough away from a real physical NAS that any performance tests you do are largely irrelevant and some of the failure modes aren't as easy to properly simulate as you might think, but still a useful learning or confidence building exercise if you are new to working with the relevant tools.
Or you could just use multiple partitions on a single scratch disk (or even a ramdisk). A great way to get familiar with the workflow and tools without going to a lot of effort. This is how I first experimented with ZFS.
You can also create files on disk and add them to pools.
There's actually a trick you can use to create a failed ZFS array (e.g. if you want to create a 5+1 array with only five disks), where you use dd to create a sparse file of the appropriate size, which lets you create a '1 TB' file while only writing 1 byte to disk. Add it to your zpool along with the rest of your disks, then fail it out, remove it, and replace it with a physical disk.
It's a good way of taking a drive with 2 TB of data on it and adding it to a ZFS filesystem which includes that disk while also keeping the data, but without using a separate disk (or copying the data twice).
I actually run such a USB stick raid setup (using regular linux raid/ext4). Its write-speed is horrible but its really cheap and serves well as my personal vault.
I'd say more like "I'm not smart enough to run a high-performance NAS". If you follow the tutorials ZFS is the friendliest raidlike system I've ever used, and the most tolerant of sysadmin mistakes (which isn't to say you can't destroy all your data, but that's true of any filesystem). But sure, if you need particular IOPS numbers then you should probably hire a professional or buy a prebuilt NAS system that provides those guarantees. Or be prepared to benchmark and experiment until you get it right.
ZFS is perfect for home NAS. That article is talking more about business / enterprise usage so obviously mentions on the horror stories of consumer-grade hardware. But in all practicality running ZFS over a bunch of WD-Reds with non-ECC RAM is fine for home use (given how much worse the alternatives are).
Just remember to keep back ups. It doesn't matter how good you might think your storage array is; always make back ups.
I've been using ZFS for a home NAS for a few years, and it's been great. I only know three commands, and it's served me perfectly, even with low RAM and no ECC. Just don't make the mistake I made, check if your disks are 4k and tell ZFS when creating the pool.
You can just use FreeNAS - it uses ZFS underneath and provides a nice web admin interface. Of course you can also ssh in and run commands on the shell if you want.
I don't see any mention of "for god's sake don't let your pool get more than 90% full or the molasses gremlins will ooze out of your disk controllers and gum up the plates" in that list...
I'd add one more: it is impossible to remove a vdev from a pool. That means in a small server / home scenario, you have less flexibility to throw in an extra mirror vdev of old disks you had lying around, or gradually changing your storage architecture if HDD sizes increase faster than your data consumption.
However, none of these diminish what I think is the main use case of ZFS: very robustly protecting against corruption below the total-disk-failure level.
Best practice is to build a new vdev with the 4x4TB drives and move the entire dataset across. It isn't like other systems where you can just bolt in another drive and it expands to absorb that.
The critical concept here is that where other systems use physical disks, ZFS uses vdevs. vdevs are 1 or more disks but are presented to the RAID system as a single 'storage entity'. Thus, you don't add disks to a storage pool, you add vdevs.
Doing some more reading, this sounds like it "degrades" the array each time and could be risky. Other replies to my comment seem to suggest it can't be done.
I believe there are no problems with your scenario, there's zpool replace command which does this:
zpool replace [-f] pool device [new_device]
Replaces old_device with new_device. This is equivalent to attaching
new_device, waiting for it to resilver, and then detaching
old_device.
The size of new_device must be greater than or equal to the minimum
size of all the devices in a mirror or raidz configuration.
new_device is required if the pool is not redundant. If new_device is
not specified, it defaults to old_device. This form of replacement
is useful after an existing disk has failed and has been physically
replaced. In this case, the new disk may have the same /dev path as
the old device, even though it is actually a different disk. ZFS
recognizes this.
-f Forces use of new_device, even if its appears to be in use.
Not all devices can be overridden in this manner.
Thanks for pointing this out. I still don't know how to see all of this in a home NAS context however. It would be really great if someone could explain.
Lets say I have 6 SATA ports. I have 4 drives that I collected from various computers that I now want to unify in a home built NAS:
A. 1TB
B. 2TB
C. 4TB
D. 1TB
E. -empty-
F. -empty-
Now all my drives are full and I want to either add a disk or replace a disk; How do I:
1. replace disk A. with a 4TB disk
2. add a 4TB disk on slot E.
As long as you have room to connect one extra drive there's no "degrading" - add new disk, zfs replace, remove old disk, repeat. I've done this and it worked, and I've had the new disk fail partway through and not had to spend any time resilvering, so I don't think there's any degrading. (In any case you're certainly in no worse a situation than you would be if you had a single-drive failure. If you're worried about having a second drive fail while you're replacing a disk after a failure, use raidz2).
Obviously if you only have 4 drive slots and have to remove a disk to replace it then you're going to have less redundancy while you're in the process of doing the replacement.
You can replace the 4x3TB drives with 4x4TB drives for more space. But you can't replace the 4x3TB drives with 3x4TB drives or 2x6TB drives. If you started with two mirrored pairs and want to change to a raidz, you can't without destroying the pool. You can upgrade disks and add more disks, but you can't remove disks without replacing them.
I'd thought that being able to use different-sized drives was a selling point, for piecemeal upgrades, but the article says to never mix sizes across vdevs or spools.
You can upgrade the capacity of a pool, but you have to upgrade each drive, one at a time, with time to resilver in between, and the bigger capacity doesn't come online until all drives are incorporated.
Another gotcha: If you intend to physically move a pool of disks from one system to another, or even move them to different SATA or SAS ports on the same system (or different drive bays in a JBOD), you should EXPORT THE POOL FIRST. I came within a hair's breadth of losing a pool when I didn't do that, and import attempts resulted in "corrupted data" messages. I managed to straighten that out and recover the pool, but it was a close call.
When it is not exported, the pool stores the device names of its component disks, and if the wrong disks end up on the wrong devices, you get the "corrupted data" problem even though the data really isn't corrupted.
Its claim is one can add nodes to the cluster to expand its storage capability. Given enough physical space it might be cheaper to find 3-4 machines (towers) with 4 drive bays than building one new server with 12+ bays. Then use linux's fuse client to access it.
I've never seen a clustered filesystem that was friendly or simple enough for the small office / home use case. With ZFS I pile a bunch of disks in a server, run samba on it, and there you go. Still, best of luck if you try it.
Only problem with ZFS for budget/home use as I see it, is the inability to grow zpools (add a disk to get more space on an existing filesystem). Basically the the design is more "buy what you need (eg: 4x2TB), then replace (eg: 4x4TB as prices come down) -- and throw away your old disks".
I have a "two generation" setup; nominally 4x1TB + 4x2TB (though one of the disks in the latter raidz is actually 4TB because I had to replace a failed 2TB disk). It's gone fairly well so far; 8 disks + 1 for the OS is about all I can fit in an ordinary PC, and I didn't feel too bad about throwing away 500GB disks to replace them with 2TB ones; replacing the 1TB disks with 4TB ones will be the same.
If you're building something today, and get an avoton board, you'll have 10+sata ports on the board. I don't know what kind of midi-tower only support 4x 3.5" drives -- maybe you're thinking hot-swap, front-accessible?
(3.5" & 5.25" Black Tray-less 5 x 3.5" HDD in 3 x 5.25" Bay SATA Cage) in front -- if yo feel you need front-accessible drives.
I'd say you normally want less boxes to take care of (even if failure will be more catastrophic) on such a small scale. Depends on what you need of course.
I'll say that reading about the ext3 to btrfs conversion made me laugh - it's a mad scientist thing that builds all the btrfs pointers in the ext3 free space, pointing to the data blocks as stored by ext3, and it keeps the filesystem readable as ext3 until it's completed and mounted as btrfs.
While I don't disagree that the original title is perhaps mildly link-baity, I don't think "Read me first" is very apt, either; this article assumes a nontrivial degree of familiarity with ZFS and its concepts.
Something along the lines of "ZFS gotchas" or "ZFS caveats", perhaps?
Unrelated: I really don't think dang's post here warranted downvoting. I expect that it's being done by folks who really want HN's moderators to know they're unhappy about titles being changed as capriciously as they sometimes are. This post is an actual moderator soliciting actual input on how to make at least one instance of that problem better, however, and I think that should be encouraged, not dinged.
Ok, we changed it to "ZFS Gotchas" because the body of the article more or less uses that word.
It's ok for dang to get dinged. But "capricious"? No. I realize it sometimes seems that way to people paying sporadic attention, but every change we make is in keeping with the HN guidelines, and those are hardly secret or unclear.
We make mistakes, of course, and are always open to improvement. The helpful way to criticize an HN title is to suggest a better one.
It may be the case that every change is in keeping with the guidelines, but it's certainly not the case that the guidelines are applied consistently. Looking at stories where the submitter added some useful, non-linkbaity detail that wasn't in the original title, it seems entirely arbitrary whether that detail will be left in place or cut from the title. And belittling your users' experience like that ("sporadic attention"?) will make you few friends.
Better policy a): Don't change the titles users submit. Trust the community to flag linkbait titles.
Better policy b): Don't allow users to submit a title. Automatically scrape the title from the linked page. Have moderators change the titles when they're unhelpful or misleading (this is much less likely to annoy users than the current system, because the moderator wouldn't be replacing the submitter's title, it would be the HN system replacing another part of the HN system).
edit: Suggest a better title rather than suggest a better policy? How is one supposed to suggest a better title? There is no form for that, and it's not something to clutter the comments with.
So by changing it to a more boring title, you ensure that fewer people read the article.
Brilliant.
Incidentally, "Things Nobody Told You About ZFS" is the subtitle of the article in question, and arguably is a more informative description of its contents than "ZFS: Read Me First".
What I've resolved to do when having to choose between a title and subtitle, or potentially creating my own title, is to include a rationale comment after I make the submission. dang seems pretty reasonable, but also has a lot of work between moderation and community outreach here as well as other responsibilities. Short of opening up more moderator/admin slots it seems reasonable (to me) to take some of that burden on ourselves to make dang's job easier and ensure that post titles end up somewhere nearer what we want for attracting readers and for clarity on the subject they point to.
I wish the author had provided an explanation. The main issue I'm familiar with is that throughput and IOPS capacity generally don't increase linearly with storage capacity, so the time to recover from a drive failure increases significantly with larger drives. The author may be saying that you should use raidz2 or raidz3 with 1TB drives because the time to resilver 1TB is long enough that the odds of sustaining another drive failure with raidz1 are too high, or alternatively that you should use 750GB or smaller drives with raidz1 to keep the resilver times lower in order to reduce the odds of a second failure during resilvering).
It is not due to the time for resilvering. It is due to the rated probability of a non recoverable 1bit (or more) read error on modern drives. This probability is high enough that you have a 32% chance of it on reading 1TB. However, this is actually less of a problem on ZFS compared to hardware raid because zfs will only read actual data, not blindly every sector.
HW RAID does not read every sector blindly, there is a level of error detection there. And an errored sector in one read does not mean it errors in every read.
Now, the error detection schemes at the disk level may be insufficient. I don't know enough about how it's done on modern drives (but I suspect that every manufacturer has its own scheme).
I am as well (5x 3TB in Raidz1). I'm pretty sure it's because of the likelihood of having an unreadable bit/byte/sector on one of the non failed disks gets higher as the capacity increases and thus there is a good chance that you'll lose some data. This article discusses the theory. http://www.zdnet.com/blog/storage/why-raid-5-stops-working-i...
"zpool status" will show if there have been errors reading data from individual devices. If a drive experiences enough failures, at least on illumos and Solaris-based systems, it will be marked degraded or faulted and removed from service. You can view individual failures on these systems with "fmdump -e". Here's a made-up worked example:
https://blogs.oracle.com/bobn/entry/zfs_and_fma_two_great
Also note, if you're using an SAS HBA: the LSI 1068E-R based HBAs, when flashed with IT firmware, have a persistent drive mapping setting. When it is active, it will always give a specific physical disk the same device name, even if it gets moved to a different physical port. Let's say you a drive with serial number ABC123 on port 0 as c4t0d0 and serial number XYZ987 on port 1 as c4t1d0. You could swap the two drives and ABC123 would still be c4t0d0 and XYZ987 would still be c4t1d0.
If all ports on the controller were in use, and you yanked c4t7d0 and slid in a new drive, it would become c4t8d0 (unless you used lsiutil to remove the persistent mapping).
Ok, say you implemented Dedep accidentally and now your table is huge. I'm sure it is a slog to go back, but HOW DO YOU DO IT. I've tried the method here: https://groups.google.com/a/zfsonlinux.org/forum/#!topic/zfs... and didn't get it fixed. Any more thoughts? Please don't recommend I make a new pool and send all data over, I don't have the space at 70% capacity.
Overall, this is quite good. Some of it is oversold as "things nobody told you about" and some of it could really benefit from real data. A few things are a little too close to seemingly unsupported folk-wisdom rather than sound advice.
Other file systems and RAID setups do not checksum your data. If you have a mirror of two disks (RAID1) and during a read two blocks differ from one another, most (all?) RAID controllers (hardware and software) will simply choose the lower numbered disk as canonical and silently "repair" the bad block. This leads to loads of silent data corruption on what we might consider a reliable storage solution.
By contrast, ZFS will store both the block and its checksum and will use the block with the correct checksum as canonical.
In other words, if you care about your data at all, use ZFS. I am frankly surprised it's not the standard file system for most situations as it is the only production filesystem that can actually be trusted with data.
P.S.: I have been told that at least on Linux if you have more than two drives, the Linux software RAID controller will try to choose the version of the block that is agreed upon by most drives, if that's possible. This is no guarantee, but it's better than randomly choosing one version.
P.P.S: BTRFS and friends seem to not yet be as production ready as ZFS. Conversely, ZFS works beautifully on Linux thanks to the ZFS on Linux project.