The reason deduplication won't help much with large mail archives is that it ope...

ComputerGuru · on Jan 10, 2011

You're missing my point: it's not duplicate blocks in a single mail archive that I'm concerned about, it's duplicate blocks across multiple versions of the same mail archive.

New mails are typically appended to the end of the archive, but the rest of it remains more or less the same across versions.

Also, a well-written block-level deduplication algorithm will work regardless of boundary shifts. I designed and wrote the proprietary blocklevel algorithm used in Genie Timeline, and can tell you that it saves terabytes of data in corporations for just PSTs alone. And it doesn't rely on block boundaries, capable of detecting slided or shifted data.

johnthedebs · on Jan 10, 2011

I'm admittedly not an expert, but isn't that a very different scenario?

In other words, ZFS operates on blocks at a block level - it can't make any assumptions about the contents other than "this block is the same as this block" or "this block is not the same as this block." It can't (based on my understanding) realize that "this block is almost the same as this block" and shift things accordingly, because then it would be messing with application data that might be corrupted by the changes.

If the application is making those shifts on the other hand, it would work just fine because the application could be smart enough to shuffle and unshuffle things in the most efficient way it knows how, but the file system doesn't have the luxury.

If I'm still missing something, please let me know. Anyway, it'd be pretty straightforward to properly test this so maybe I'll do that.

wmf · on Jan 10, 2011

Hasn't Mail.app been using maildirs for years? They should be pretty Time-Machine-friendly.