While I'm no ZFS expert, I've been using it for several years now and my understanding is this. Take what a normal fsck type tool does and build those features into the underlying FS and supporting toolchain. For what ZFS does and how it works, it really doesn't make sense to me at all for it to have an "fsck," whatever that means. Really, it's hard to even imagine what an "fsck" would do for zfs. You'd just end up rewriting bits of the toolchain or asking for the impossible.
I asked this in the other thread, but I'll ask here again. Excluding semantics, what is it that people want fsck to do specifically that zfs doesn't provide a method for already? Seriously, the question to me seems akin to asking why manufacturers don't publish the rpm spec for SSDs. It's a really odd thing to ask and can't be answered without an exhaustive review of the mechanics of the system.
I can't help but get the feeling that a lot of people complaining about ZFS have very little knowledge or familiarity with it and/or BSD/Unix in general. ZFS is not like any Linux FS. It doesn't use fstab, the toolchain is totally different, the FS is fundamentally different. It was built for Solaris and really reflects their ideology, which is completely foreign to people who only have familiarity with Linux. Accept it and move on or don't, but I've yet to see any evidence to back up these claims other than "this is what is done in Linux for everything else" which is just FUD.
Yes, emphatically agreed. I co-founded Fishworks, the group within Sun that shipped a ZFS-based appliance. The product was (and, I add with mixed emotion, still is) very commercially successful, and we shipped hundreds of thousands of spindles running ZFS in production, enterprise environments. And today I'm at Joyent, where we run ZFS in production on tens of thousands of spindles and support software customers running many tens of thousands more. Across all of that experience -- which has (naturally) included plenty of pain -- I have never needed or wanted anything resembling a traditional "fsck" for ZFS. Those that decry ZFS's lack of a fsck simply don't understand (1) the semantics of ZFS and specifically of pool import, (2) the ability to rollback transactions and/or (3) the presence of zdb (which we've taken pains to document in illumos[1], the repository of record for ZFS). So please, take it from someone with a decade of production experience with ZFS: it does not need fsck.
Hi. rsync.net here. I wonder if you concur with us that ZFS does need a defrag utility ?
I know you can (and we have) do the copy back and forth method to rebalance how things are organized on multiple vdevs inside one pool, but it would be nice if that were not a manual process, or could be optional on a scrub.
Or something.
People can and do run their pools up higher than 80% utilization. It happens. It's happened to you. There should be a non-surgical way to regain balanced vdevs after such a state...
Ah, yes -- now we're talking about a meaningful way in which ZFS can be improved! (But then, you know that.) Metaslab fragmentation is a very real issue -- and (as you point out) when pools are driven up in terms of utilization, that fragmentation can become acute. (Uday Vallamsetty at Delphix has an excellent blog entry that explores this visually and quantitatively.[1]) In terms of fixing it: ZFS co-inventor Matt Ahrens did extensive prototyping work on block pointer rewrite, but the results were mixed[2] -- and it was a casualty of the Oracle acquisition regardless. I don't know if the answer is rewriting blocks or behaving better when the system has become fragmented (or both), but this is surely a domain in which ZFS can be improved. I would encourage anyone interested in taking a swing at this problem to engage with Uday and others in the community -- and to attend the next ZFS Day[3].
We'll see what we can do. We have some ZFS firefighting that dominates our to-do list currently[1][2] but if we can work through that in short order we will dedicate some funds and resources to the FreeBSD side of ZFS and getting "defrag" in place.
I meant to attend zfs day and will try to come in 2013 if it is held.
[1] Space accounting. How much uncompressed space, minus ZFS metadata, does that ZFS filesystem actually take up ? Nobody knows.
Hey Bryan, I loved your Fork Yeah! talk at USENIX. Everyone who's interested in the history and future of Solaris and ZFS should watch it: https://www.youtube.com/watch?v=-zRN7XLCRhc
I know it's a bit off topic but could you elaborate on the issues you see with BTRFS?
From an operations perspective, ZFS is a god-send. Correct me if I'm wrong, but AFAIK BTRFS lacks any significant quanta of improvement over ZFS, merely a "better licensed" non-invented-here mostly duplicated effort with a couple refinements that isn't nearly as well thought-out from the experience of ops folk.
- Online scrubbing rather than fsck, so there is little/no downtime if the filesystem were interrupted, i.e., datacenter power failure. fsck/raid rebuild of large filesystems can mean lengthy outages for users.
- "Always" consistent: Start writing the data of a transaction to unallocated space (or ZIL) and update metadata last.
- Greatly configurable block device layer:
* RAIDZ, RAIDZ2, RAIDZ3, mirror, concat ...
* ZIL (fs journal), L2ARC (cache) can be placed on different media or even combinations of media.
- Send & receive snapshots across the network.
Reading about ZFS on Wikipedia and here [1] I see that there is a procedure called scrubbing. From [1]: "ZFS provides a mechanism to perform routine checking of all inconsistencies. This functionality, known as scrubbing, is commonly used in memory and other systems as a method of detecting and preventing errors before they result in hardware or software failure."
So I guess "alias fsck='zpool scrub tank'" would quail everyone's concerns.
Scrubbing is more like reading the block, calculate the checksum. It's the same as stored for the block (not at the block, important difference)? Nice. Not the same checksum. Go to the mirror and fetch the correct one, respectively use an algorithm to calculate the missing block from the raidz1/2/3 set and write it to the disk with the error.
I see. From the description it's quite obvious that ZFS handles most of the fault cases automatically. Still admins might want to force it to recheck all the blocks, this command achieves exactly that. Frankly I don't know what exactly fsck does and I guess it's different for every FS. For ZFS scrubbing seems to be an adequate alternative.
Hence if anyone asks you whether ZFS has fsck it's easier to say yes it does and it's called scrubbing.
For most people, fsck and filesystems that rely on fsck are a known quantity. They understand that a filesystem does what it can to keep itself consistent, but that sometimes outside assistance is necessary in the form of fsck. When you show them ZFS and say "Oh, it doesn't need a fsck program", they assume that you're bullshitting them to cover up for the fact that it doesn't have a fsck program yet.
As far as the Linux/Unix thing, I think you're reading way too far into it. Linux neither invented nor popularized filesystem consistency checking programs. Unix filesystems (such as UFS) often have consistency checking programs as well, though they might not be called "fsck". Windows has Scandisk. ZFS is the odd one out here, and it's not surprising for people to treat it as such. Give it time, and if ZFS's approach becomes more widespread, people will come around.
But ZFS has the ability to scrub pools, it can roll back transactions, and it can import of degraded arrays as well as many other things. Again, people seem to be failing to understand that the things fsck does are there either in the FS itself or in the tools. I stand by the argument that this whole debate is due to fear and a lack of understanding(aka ignorance).
I brought up UFS as people who just use linux often similarly complain about the way BSD does "partitioning." The two debates seem very similar to me.
I've had to deal with one scenario where ZFS fsck would have been very useful, though it comes up rarely: corruption leading to valid checksums, but bad ZFS data.
In my case[1], our setup died due to some power problems, and somehow, NULL pointers got written to disk with valid checksums. Normally this wouldn't happen, but when it does, it's a PITA to debug because trying to traverse/read the disk gives a kernel fault, instead of a segfault as you might see in a user-level fsck program. This was a real pain, as we had encrypted disks, and every reboot meant going through the disk attach steps (enter password, etc etc), every time.
As a result, a userspace implementation of scrubbing would be useful since in this sort of rare instance, I'd be able to probe the fsck process with a good debugger and see why it's crashing. Since it's in userspace, the fsck program can also quit more sanely, with a full report on where it found the corruption. I was able to get my data back via some ad-hoc patches, but it was an... interesting experience having to debug in kernel vs in userspace.
zdb isn't a substitute for most of these things as the on-disk compression and RAIDZ sharding makes it difficult to actually see the raw data structures. Max Bruning wrote a post a while back with an on-disk data walkthrough[2], where he wrote some patches to fix this, but they haven't made their way upstream yet. Additionally, FreeBSD and Linux don't have mdb. :(
I wrote that code several years ago. It is not quite the right way to go.
zdb now does decompression, though slightly differently from what I implemented.
Syntax is: zdb -R poolname vdev:offset:size:d
The "d" at the end says to decompress. zdb tries different decompression algorithms until it finds one that
is correct.
As for my mdb changes, I really think mdb should be able to pick up kernel ctf info so that it can print data structures on disk. That I could probably get working on illumos fairly easily.
My method used zdb to get the data uncompressed, then
used mdb to print it out ala ::print. I actually think something like "offset::zprint [decompression] type" in mdb is the way to go. It would mean no need for zdb, which usually gives too much or not enough, and is not interactive (hence, not really a good debugger as far as I'm concerned). Better would be:
# mdb -z poolname
20000::walk uberblock | ::print -t uberblock_t
And from there, something like:
offset::zprint lzjb objset_phys_t
where offset comes from a DVA displayed in the uberblock_t.
Some people seem to get my idea and think it's good. Others either don't get it, or don't care.
Someone like Delphix might really like it.
Interesting rant (from 2009). At NetApp the WAFL file system is also always consistent on disk, so it too doesn't need fsck. That said, WAFL had 'wack' (WAfl ChecK) which could go through and check that the on disk image was correct.
Unlike UFS or FFS or EXTn the file system couldn't be corrupted by loss of power mid write, but like ZFS it can be corrupted by bugs in the code which write a corrupted version to disk. So the tool does something similar to fsck but it is simpler, more of a data structure check rather than a "recreate the flow of buffers through the buffer cache to save as much as possible" exercise.
"At NetApp the WAFL file system is also always consistent on disk, so it too doesn't need fsck."
How does it manage to stay consistent if a cosmic ray strikes it and flips one or more bits?
How does it manage to stay consistent if you physically bump in to the drives and cause physical damage by having the disk head briefly touch the disk surface?
Wouldn't you need a filesystem consistency check and repair tool like fsck in these cases?
"How does it manage to stay consistent if a cosmic ray strikes it and flips one or more bits?"
At the time (and I think its still true) cosmic rays do not have sufficient energy to flip a magnetic domain on disk. Memory bit flips are detected by ECC and channel (between the I/O card and memory and/or disk) are identified with CRC codes.
"How does it manage to stay consistent if you physically bump in to the drives and cause physical damage by having the disk head briefly touch the disk surface?"
The disks are part of a RAID4 or 6 group (RAID 6 preferred for drives > 500MB, required for drives >= 2TB) so physically damaging a drive results in a group reconstruction of the data on that drive.
NetApp has always had a pretty solid "don't trust anything" sort of mantra that has been tested and fortified a few times by various events. The ones I got to see first hand were an HBA that corrupted traffic through it in flight, drives that returned a different block than you asked for, and drives that acknowledged they had written data to the drive when in fact they had not.
Back in the early 2000's anything that could happen with a disk with a probability larger than once in billion operations or higher, they got to see once a month. It was an interesting challenge which requires a certain discipline to deal with. When I went to Google and saw their "we assume everything is crap, we just fix it in software" model it gave me another perspective on how to tackle the problem of storage reliability.
Both schemes work and have their plusses and minuses.
Worked at a company that bought a ton of NetApp filers to support a webmail service. NetApp sales engineers swear on their mothers' graves that there is no such thing as fsck for wafl, that wafl always transitions from one gold-plated consistent state to the next with no possibility of metadata inconsistency. OK.
Three months later, big outage. On-site techs report the filers display "fast wack" on the front panel. Call NetApp support. What is "fast wack"? That's the fsck. Assholes!
It turned out that the filer had got corrupt somehow, and wack itself could not comprehend a filesystem with more than 2 billion files. Inode number stored in signed int32. Major, major surgery, hotpatching of filer firmware, three days of downtime, serious negative press coverage.
Bottom line: whenever anyone tells you their filesystem is guaranteed to be consistent, kick that person right in the shins.
1. When there is a bug in the code that writes the ZFS stuff, why should the bug be addressed by the fsck code. This would assume, that you know of the bug beforehand, but then you could better fix the bug in the code that writes.
2. When there is a bug in the on-disk-state it should be addressed by the code that reads the data , not by a fsck tool.
2.1. The correction of the bug in the on-disk-state should be done on the basis of the exact knowledge about the bug and not by a generic check tool.
3. Repair is always based on assumptions. Those could be correct or incorrect. The more you know about the problem that led to the repair-worthy state, the more probable the assumptions are correct.
4. What is the reasoning behind the argument "when your metadata is corrupt , that the data is correct" and so you could repair metadata corruption without problems. It sounds more sensible to fall back to the last known correct and consistent state of metadata and data, based on the on-disk-state represented by the pointer structure of the ueberblock with the highest transaction group commit number with a correct checksum . The Transaction Group rollback at mount does exactly this.
I lost a ZFS pool once. The cause ultimately turned out to be a slowly failing PSU. (It was an expensive OCZ PSU, too, which is why I didn't suspect it as quickly as I probably should have. OCZ did replace it under warranty without argument.)
It was a development machine, so it wasn't being backed up. I thought it was just one disk going bad; by the time it was clear that it was something worse than that, it was too late. Most of the important contents of the pool had been checked into the VCS, but not everything. I wound up grepping the raw disk devices to find the latest versions of a couple of files.
Any filesystem would have had serious trouble in such a situation, of course. But I can't help thinking that picking up the pieces might have been easier with, say, EXT3.
On the other hand, I think it speaks well for ZFS that a slowly failing PSU seems to be almost the only way to lose a pool.
So if you have an unmountable zfs pool, instead of reaching for fsck (which doesn't and won't exist) you can instead do:
zpool clear -F data
And it will take advantage of the copy-on-write nature of ZFS to roll back to the last verifiably consistent state of your filesystem. That's a lot better than the uncertain fixes applied by fsck to other file systems. It even tells you how old the restored state is.
A slight disagreement: the advantage of a ZFS online consistency checker would be to help ensure that there are no bugs in ZFS.
It appears that ZFS lacks a full consistency checker -- scrub only walks the tree and computes checksums; notably absent in this procedure appears to be validating the DDT. While ZFS claims to be always on-disk consistent--and I certainly believe that the intent is that it be so!--I seem to have tripped over some bug ( http://lists.freebsd.org/pipermail/freebsd-fs/2013-March/016... ) which corrupted the DDT, and now I have no way of rebuilding it, so I dropped $$$ (for me) on a new disk array and zfs send | zfs recv so that everything rebuilt. That's sort of crazy, if I may be so bold.
I suppose I could take the pool offline for several days and poke at it with zdb, but that is not really desirable either.
The article doesn't ever consider that ZFS might have bugs. Dodgy disks, bad firmware, power failures, yes. But no consideration that the ZFS code could contain problems.
If you are happy that the ZFS code is perfect, then it makes sense to rely upon its consistency checks, snapshot features, etc (and I'm not criticising those). But what if ZFS isn't 100%? How do you recover your data?
Even ZFS isn't a replacement for backups. Think of it more as a better indicator of when things are happening that clearly shouldn't be (e.g. hardware problems).
True and true. But a decent checker like e2fsck should be paranoid & untrusting of on-disk data structures, and able to retrieve at least some data from a badly-mangled filesystem. It is a fallback tool but a useful one.
I don't quite understand the difference between that and how ZFS is untrusting by holding multiple levels of checksum and having copy-on-write and scrubbing etc.
Is there a particular type of data corruption that fsck would recover that ZFS would not?
Suppose there is some bug in ZFS or the OS. Checksums will fail (problem is detected), but conceivably, there could be a type of bug that an fsck tool could fix, without having to roll back to a previous version. Another potential advantage is that a fix by fsck could be much quicker than restoring a complete backup.
OK. Suppose you are running a buggy ZFS, and it causes problems. Why would you trust a filesystem repair tool written by the people who wrote your buggy filesystem to fix those for you without causing more problems? If anything, I would trust the filesystem more, as it (one would hope) sees way more usage.
If your OS or disks are buggy, you are hosed, anyway, as that checking tool would run on the same OS and hardware.
You should look at ZFS as having a built-in fsck that is automatically invoked when needed.
ZFS should be able to restore a failure in a single block quickly. With fsck you have to unmount the drive to even check. If you have a bad write operation (in ZFS) it should be detected even before the block pointer is set.
I'm not too familiar with exactly how fsck works, but it seems to mainly stitch bad metadata back together so there's still no guarantee that your data is perfectly restored.
This would be especially true if a bad FS write operation caused the data to be corrupt.
If the data is really corrupt the FS doesn't have any inherent way of knowing what the correct data should be.
ZFS may not need fsck, but it would be great if Oracle would re-open-source it. I've considered using it, but I can't trust that it has a future.
I'm also rather confused by Oracle contributing to btrfs while also building ZFS privately. My intuition is that if they open-sourced ZFS and offered it under a dual BSD/GPL license, it would become the fs standard overnight.
It has a future. FreeBSD and companies based on it like IXSystems have adopted it and developers are actively hacking it. In FreeBSD version 10 it will feature TRIM support as well as a new data compression scheme (LZ4): https://wiki.freebsd.org/WhatsNew/FreeBSD10
ZFS is open sourced, and it is available in FreeBSD for example. Most of the development has been going on in Illumos and the team behind that are some of the original heavy hitters that worked on ZFS.
Why use a dual BSD/GPL license? I think that would inevitably lead to a fork between the two versions -- for something like ZFS that's going to be used on many different platforms and which is already well-established, the Apache 2.0 license probably makes the most sense.
And after a dozen paragraphs on how ZFS is unlikely to get corrupted, the meat of the conent: "my opinion is that you shouldn't try to repair it anyway".
Anyway: You do not repair the state last state of the data. And in my opinion: You should not try to repair it ... at least not by automatic means. Such a repair would be risky in any case. [..] In this situation i would just take my money from the table and call it a day. You may lose the last few changes, but your tapes are older.
This "you do not need an emergency repair tool because in an emergency I think you should just forget it" is exactly the claim that this blog post was supposed to be countering. Explaining why a do-the-best-you-can repair utility is not necessary, and the argument it boils down to is "because I don't think you should do that".
The basic problem of filesystem repair is the point that you repair the metadata, but it can do nothing about your data. So when your filesystem enables you to fall back to a known consistent state of metadata and data by the COW, you should fall back to a known consistent state. And not to something that is repaired by a generic tool. How do you know in such a situation that somewhere in you thousands and thousands of files the repair got something wrong. The copy on write behaviour of ZFS has it's advantages.
And as i already wrote in a different comment: If there is a bug in the stuff writing the on-disk state, the bug should be addressed on the exact knowledge of the bug in the code reading the on-disk-state and thus doesn't make assumption what could be halfway correct, but by some piece of code that does the correct with the incorrect on-disk-state.
Yes, I know you said both those two things, and I agree - ZFS has inbuilt error detection and healing. Any code that can detect errors and heal them should be there. And if you have corruption the only long term, safe, ass-covering advice to give is to restore to a pre-corrupt state.
But the argument went:
Detractor: ZFS needs fsck.
You: No it doesn't.
Detractor: ZFS creators attitude has always been "we don't think it should exist", but there's no more reason than this. It still corrupts so it still needs an fsck tool.
You: Here is a big blog post about why it doesn't: OK so it can get corrupted but I don't think an fsck tool should exist.
You know how useful it is to post on StackExchange "help I have this situation, I know conceptually there is a way out, but how can I actually do it?" and get the replies "you shouldn't want to do that"? It's not helpful at all.
When i leverage the copyonwrite-ness (which is more redirect-on-write) of the filesystem to recover from a defunct state of the on-disk state to a known state, a filesystem check is just a suboptimal solution. That what i wanted to express with the article. Of course - most filesystems are not COW and can't use it, and thus the notion that a filesystem check is needed prevails. But at the end a filesystem check is just about forcing a filesystem into a state of metadata correctness, without caring much about the data. I wouldn't count that as a way out, when there are better ways.
I think the situation is pretty much similar to the "shoot the messenger" problem of ZFS. Some people are annoyed that ZFS reports errors because of corruption and blocks access to the data (of course without having any redundancy). However the alternative would be reading incorrect data. What's worse. Knowing that you have to recover data or processing incorrect data without knowing it.
Been running it since 0.6.0-rc14 on a Proliant Microserver with ECC RAM and I am happy with it. 4x2TB RAIDZ internal, and 2x1TB USB3 zpools with SSD for zil and logs. Shared over GigE using Samba4 and AFP.
Performance is decent enough with lz4 compression and dedup off. Dedup on takes more CPU but nothing even the 2.2Ghz Turion can't handle. Main thing is stability has improved a lot too.
If you want the utmost performance may be this isn't for you but for NAS/backup/streaming type usage ZFS on Linux is nearly perfect.
Maybe I'm just tired this morning but I wish this article could just get to the point. I feel like I'm reading a mystery novel but I am never going to make it to the end and find out who did it.
This article explains in-depth why ZFS doesn't need a fsck tool. It's very much to the point in its entirety. I'm sorry content-free articles posted here lately have diluted your expectations.
I agree this article in not content free but all these points could be made in many fewer and less convoluted words. And they buried the real news (at least to me) that the hardware is lying to the OS. Did not mean to offend.
It reminds me of how people used to think all filesystems needed to be explicitly defragmented because of design flaws in FAT, which was designed for floppies (and wasn't especially well-designed even at that).
While I'm no ZFS expert, I've been using it for several years now and my understanding is this. Take what a normal fsck type tool does and build those features into the underlying FS and supporting toolchain. For what ZFS does and how it works, it really doesn't make sense to me at all for it to have an "fsck," whatever that means. Really, it's hard to even imagine what an "fsck" would do for zfs. You'd just end up rewriting bits of the toolchain or asking for the impossible.
I asked this in the other thread, but I'll ask here again. Excluding semantics, what is it that people want fsck to do specifically that zfs doesn't provide a method for already? Seriously, the question to me seems akin to asking why manufacturers don't publish the rpm spec for SSDs. It's a really odd thing to ask and can't be answered without an exhaustive review of the mechanics of the system.
I can't help but get the feeling that a lot of people complaining about ZFS have very little knowledge or familiarity with it and/or BSD/Unix in general. ZFS is not like any Linux FS. It doesn't use fstab, the toolchain is totally different, the FS is fundamentally different. It was built for Solaris and really reflects their ideology, which is completely foreign to people who only have familiarity with Linux. Accept it and move on or don't, but I've yet to see any evidence to back up these claims other than "this is what is done in Linux for everything else" which is just FUD.