Sometimes the corruption is in the metadata, in which case you are lucky and the...

acqq · on Aug 22, 2016

> But neither method you mentioned lets you recover your data if the data itself is corrupted.

I have another experience: If I've made a zip archive of a 1000 relatively same files, even if it's a few GB big, is if I'm left with a third of the archive I can easily recover first third of the files, if I have half of the archive, around the half of the files. Exactly because the common implementations of zip archivers would recognize that the directory at the end is missing but still allow me to extract everything that it can read, as they would use the data in front of every file, even with the fully missing "central directory".

If the corruption affects the file 3 of a 1000 files I can extract both the first two and all 4-1000. If the corruption is in the first 0.3 part of the tar.gz, most probably I won't be able to extract anything but the files 1 and 2. All the remaining 998 are then lost.

And listing the files inside of the non-corrupted ZIP archive, no matter how big, is effectively instantaneous, thanks to the same central directory. Try to list the files inside of a tar.gz which is a few GB big, then compare. ZIP is a very convenient format for fast access to the list of the files and to the single file in the archive, and still very convenient when the part of the archive is missing.

Edit, this is the response to your response: I've never ever mentioned Reed-Solomon, nowhere in the discussion. I've supported in my opinion good designing decision of Zip archives having the central directory and the metadata in front of the every compressed file and I just give some real-life examples of the corrupted files I've encountered and where I've had the best recovery ratio with the ZIP format. And I've never experienced the kind of "data mismatch" myself that you refer to. You are welcome to describe it more, I'd like to read how exactly you experienced it. If the answer is that you yourself wrote a program that ignored the full format of the Zip, or used such a bad library, then there is the real problem. Once again: "Almost all programming can be viewed as an exercise in caching." —Terje Mathisen

dalke · on Aug 22, 2016

Please pick a topic and stick with it.

You asked about corruption, specifically when "a small part of the whole archive file gets corrupted". I addressed that point. I'm not going to cover all of the advantages and disadvantages between the two systems, because they are well-known and boring.

You of course must choose the method that matches your expected corruption methods. A incomplete write is a different failure mode than you wanted me to address. Reed-Solomon isn't designed for that. Other methods are, and they are often used in databases.

And you of course must choose a method which fits your access model.

Just like you would of course choose tar over zip if you want to preserve all of the filesystem metadata on a unix-based machine. Zip doesn't handle hard links, for example.

My point again is that data duplication causes well-known problems when there is a data mismatch, and while the duplication in zip as the result of its write-only/journalling format helps with some forms of error recovery, there are better methods for error recovery if that is your primary concern.

acqq · on Aug 22, 2016

What you call the "data duplication" is actually a "metadata redundancy" which results in the faster access to the list of all the files in the archive (a good example of the pre-computed pre-cached info). The format is good, if some library can't cope with the format, the problem is with the library making the corruption of the archive. And that archive is still, even after being corrupted by wrongly written tool, recoverable because of the existing redundancy.

The concept of metadata redundancy isn't anything specific to the archives, the filesystems do it too, and it's always an engineering trade-off, and I consider the one in Zip format is a good one, based on the real-life examples I've given.

dalke · on Aug 22, 2016

I agree that it's a good format. It's an engineering compromise. The benefits you see also cause creshal's comment:

> Beginning and the end, which continues to bite us in the ass until today, where we regularly stumble over bugs when one part of a toolchain uses one entry and the rest the other.

Do you think creshal is mistaken?

Of course they are due to bugs in the toolchain. That's a natural consequence of the file format. That's the whole point of creshal's comment.

What do you think I'm trying to say, because I don't understand your responses based on what I thought I was trying to say.