The whole fsck discussion seems baffling to me. While I'm no ZFS expert, I've be...

bcantrill · on March 29, 2013

Yes, emphatically agreed. I co-founded Fishworks, the group within Sun that shipped a ZFS-based appliance. The product was (and, I add with mixed emotion, still is) very commercially successful, and we shipped hundreds of thousands of spindles running ZFS in production, enterprise environments. And today I'm at Joyent, where we run ZFS in production on tens of thousands of spindles and support software customers running many tens of thousands more. Across all of that experience -- which has (naturally) included plenty of pain -- I have never needed or wanted anything resembling a traditional "fsck" for ZFS. Those that decry ZFS's lack of a fsck simply don't understand (1) the semantics of ZFS and specifically of pool import, (2) the ability to rollback transactions and/or (3) the presence of zdb (which we've taken pains to document in illumos[1], the repository of record for ZFS). So please, take it from someone with a decade of production experience with ZFS: it does not need fsck.

[1] http://illumos.org/man/1m/zdb

rsync · on March 29, 2013

Hi. rsync.net here. I wonder if you concur with us that ZFS does need a defrag utility ?

I know you can (and we have) do the copy back and forth method to rebalance how things are organized on multiple vdevs inside one pool, but it would be nice if that were not a manual process, or could be optional on a scrub.

Or something.

People can and do run their pools up higher than 80% utilization. It happens. It's happened to you. There should be a non-surgical way to regain balanced vdevs after such a state...

bcantrill · on March 29, 2013

Ah, yes -- now we're talking about a meaningful way in which ZFS can be improved! (But then, you know that.) Metaslab fragmentation is a very real issue -- and (as you point out) when pools are driven up in terms of utilization, that fragmentation can become acute. (Uday Vallamsetty at Delphix has an excellent blog entry that explores this visually and quantitatively.[1]) In terms of fixing it: ZFS co-inventor Matt Ahrens did extensive prototyping work on block pointer rewrite, but the results were mixed[2] -- and it was a casualty of the Oracle acquisition regardless. I don't know if the answer is rewriting blocks or behaving better when the system has become fragmented (or both), but this is surely a domain in which ZFS can be improved. I would encourage anyone interested in taking a swing at this problem to engage with Uday and others in the community -- and to attend the next ZFS Day[3].

[1] http://blog.delphix.com/uday/2013/02/19/78/

[2] http://permalink.gmane.org/gmane.os.illumos.devel/5203

[3] http://zfsday.com/zfsday/

rsync · on March 29, 2013

We'll see what we can do. We have some ZFS firefighting that dominates our to-do list currently[1][2] but if we can work through that in short order we will dedicate some funds and resources to the FreeBSD side of ZFS and getting "defrag" in place.

I meant to attend zfs day and will try to come in 2013 if it is held.

[1] Space accounting. How much uncompressed space, minus ZFS metadata, does that ZFS filesystem actually take up ? Nobody knows.

[2] extattrs + busy ZFS == crash

anon987 · on March 29, 2013

Hey Bryan, I loved your Fork Yeah! talk at USENIX. Everyone who's interested in the history and future of Solaris and ZFS should watch it: https://www.youtube.com/watch?v=-zRN7XLCRhc

I know it's a bit off topic but could you elaborate on the issues you see with BTRFS?

ballard · on March 30, 2013

From an operations perspective, ZFS is a god-send. Correct me if I'm wrong, but AFAIK BTRFS lacks any significant quanta of improvement over ZFS, merely a "better licensed" non-invented-here mostly duplicated effort with a couple refinements that isn't nearly as well thought-out from the experience of ops folk.

- Online scrubbing rather than fsck, so there is little/no downtime if the filesystem were interrupted, i.e., datacenter power failure. fsck/raid rebuild of large filesystems can mean lengthy outages for users. - "Always" consistent: Start writing the data of a transaction to unallocated space (or ZIL) and update metadata last. - Greatly configurable block device layer: * RAIDZ, RAIDZ2, RAIDZ3, mirror, concat ... * ZIL (fs journal), L2ARC (cache) can be placed on different media or even combinations of media. - Send & receive snapshots across the network.

anon987 · on March 30, 2013

I totally agree.

From what I understand ZFS will never be supported for RHEL (and that's mostly what I work with) so I'm hoping for the best with BTRFS.

notimetorelax · on March 29, 2013

Reading about ZFS on Wikipedia and here [1] I see that there is a procedure called scrubbing. From [1]: "ZFS provides a mechanism to perform routine checking of all inconsistencies. This functionality, known as scrubbing, is commonly used in memory and other systems as a method of detecting and preventing errors before they result in hardware or software failure."

So I guess "alias fsck='zpool scrub tank'" would quail everyone's concerns.

[1] http://docs.oracle.com/cd/E19082-01/817-2271/gbbwa/

c0t0d0s0 · on March 29, 2013

Scrubbing is more like reading the block, calculate the checksum. It's the same as stored for the block (not at the block, important difference)? Nice. Not the same checksum. Go to the mirror and fetch the correct one, respectively use an algorithm to calculate the missing block from the raidz1/2/3 set and write it to the disk with the error.

notimetorelax · on March 29, 2013

I see. From the description it's quite obvious that ZFS handles most of the fault cases automatically. Still admins might want to force it to recheck all the blocks, this command achieves exactly that. Frankly I don't know what exactly fsck does and I guess it's different for every FS. For ZFS scrubbing seems to be an adequate alternative.

Hence if anyone asks you whether ZFS has fsck it's easier to say yes it does and it's called scrubbing.

tomku · on March 29, 2013

For most people, fsck and filesystems that rely on fsck are a known quantity. They understand that a filesystem does what it can to keep itself consistent, but that sometimes outside assistance is necessary in the form of fsck. When you show them ZFS and say "Oh, it doesn't need a fsck program", they assume that you're bullshitting them to cover up for the fact that it doesn't have a fsck program yet.

As far as the Linux/Unix thing, I think you're reading way too far into it. Linux neither invented nor popularized filesystem consistency checking programs. Unix filesystems (such as UFS) often have consistency checking programs as well, though they might not be called "fsck". Windows has Scandisk. ZFS is the odd one out here, and it's not surprising for people to treat it as such. Give it time, and if ZFS's approach becomes more widespread, people will come around.

tl;dr: It's about trust, not ignorance.

deelowe · on March 29, 2013

But ZFS has the ability to scrub pools, it can roll back transactions, and it can import of degraded arrays as well as many other things. Again, people seem to be failing to understand that the things fsck does are there either in the FS itself or in the tools. I stand by the argument that this whole debate is due to fear and a lack of understanding(aka ignorance).

I brought up UFS as people who just use linux often similarly complain about the way BSD does "partitioning." The two debates seem very similar to me.

cyanoacry · on March 30, 2013

I've had to deal with one scenario where ZFS fsck would have been very useful, though it comes up rarely: corruption leading to valid checksums, but bad ZFS data.

In my case[1], our setup died due to some power problems, and somehow, NULL pointers got written to disk with valid checksums. Normally this wouldn't happen, but when it does, it's a PITA to debug because trying to traverse/read the disk gives a kernel fault, instead of a segfault as you might see in a user-level fsck program. This was a real pain, as we had encrypted disks, and every reboot meant going through the disk attach steps (enter password, etc etc), every time.

As a result, a userspace implementation of scrubbing would be useful since in this sort of rare instance, I'd be able to probe the fsck process with a good debugger and see why it's crashing. Since it's in userspace, the fsck program can also quit more sanely, with a full report on where it found the corruption. I was able to get my data back via some ad-hoc patches, but it was an... interesting experience having to debug in kernel vs in userspace.

zdb isn't a substitute for most of these things as the on-disk compression and RAIDZ sharding makes it difficult to actually see the raw data structures. Max Bruning wrote a post a while back with an on-disk data walkthrough[2], where he wrote some patches to fix this, but they haven't made their way upstream yet. Additionally, FreeBSD and Linux don't have mdb. :(

[1] http://lists.freebsd.org/pipermail/freebsd-fs/2012-November/...

[2] http://mbruning.blogspot.com/2009/12/zfs-raidz-data-walk.htm...

edit: formatting

mrbruning · on March 30, 2013

I wrote that code several years ago. It is not quite the right way to go.

zdb now does decompression, though slightly differently from what I implemented.

Syntax is: zdb -R poolname vdev:offset:size:d

The "d" at the end says to decompress. zdb tries different decompression algorithms until it finds one that is correct.

As for my mdb changes, I really think mdb should be able to pick up kernel ctf info so that it can print data structures on disk. That I could probably get working on illumos fairly easily. My method used zdb to get the data uncompressed, then used mdb to print it out ala ::print. I actually think something like "offset::zprint [decompression] type" in mdb is the way to go. It would mean no need for zdb, which usually gives too much or not enough, and is not interactive (hence, not really a good debugger as far as I'm concerned). Better would be:

# mdb -z poolname

20000::walk uberblock | ::print -t uberblock_t

And from there, something like:

offset::zprint lzjb objset_phys_t

where offset comes from a DVA displayed in the uberblock_t.

Some people seem to get my idea and think it's good. Others either don't get it, or don't care. Someone like Delphix might really like it.

Just my 2 cents. max