Zip – How not to design a file format (2021)

cincinnatus · 2023-10-16T16:33:55.000000Z

I was sitting 10 feet from PK when a lot of these decisions were made, but you have to realize what a different time it was, and it how much becomes obvious only in retrospect.

There's more than a few misunderstandings and errors in the article but mostly it's a valid 20-20 hindsight perspective. We knew it at the time too; when a couple of us formed a different company later it was with the express idea that we knew all the things that sucked about .zip files and would love a do-over. We never got around to it though; our thinking at the time was to get established with other products first and then come out with a new format. Our initial stuff ended up being pretty successful and we exited to McAfee and didn't look back. Different times, maybe we did the world a disservice not tackling archiving first :-)

One note about putting "PK" on the file format. In the dev culture Phil learned his craft in there was no source control and change tracking. Files were exchanged via 'sneakernet'. This was true even at PKWare in the 89-91 timeframe. Change tracking was done via file renaming, and if you added a new variable you put your initials on it so if someone had a question they'd know who touched it. I used usenet at home but not at work.

Our communication to the outside world was an 8-line BBS in the next room and the EXECPC BBS that one of us would dial into at least once a day to answer questions on.

The multi-part/disk spanning stuff wasn't added until pretty late in the game, so it was kind of hacky. The format overall is what you get when something takes off and people start asking to use it in ways that weren't anticipated on day one but you want to maintain backwards compatibility. For sure time was spent answering questions from devs who were implementing their own versions, especially info-zip which was iirc the most important seeming effort at the time, and a guy that worked in the office a few nights a week on a port to a mainframe environment which escapes me at the moment.

It was pretty heady to be able to walk up to practically any computer on the show floor at Comdex and type 'pkzip' and get a response, but even then the idea of just how ubiquitous it would become was impossible to anticipate.

opless · 2023-10-16T10:07:56.000000Z

The articles main point seems to be complaining about ambiguity.

> Scan from the back, find the end-of-central-directory-record and then use it to read through the central directory, only looking at things the central directory references.

It was pretty obvious 30 years ago that this was the correct way to unzip. Streaming content wasn't even a thing back then (dialup being a luxury). It was the only way to ensure that a self extractor could work reliably.

lifthrasiir · 2023-10-16T10:34:06.000000Z

The problem is not that the reading method should be obvious once you consider that era, but that the reading or writing method is not exactly specified and there can be a considerable variety in implementations.

For example the OP complains about a passing mention (4.3.1):

> Files MAY be added or replaced within a ZIP file, or deleted.

While the author considers this to imply (but not acknowledge) that local file headers and central directory headers may disagree to each other, I'm not even sure because the specification never clearly confirms whether a random data can appear before any local file header. The only thing that has been acknowledged is SFX (4.1.9), and it is not even specified where the "extraction code" can be embedded. Can it be placed between two files? If such thing is indeed possible, shouldn't there be a provision to avoid ambiguity due to the stray data? Say, it could have said that deleting a file in place should mask all occurrences of local file header signature 0x04034b50 so that the vacant space can no longer interpreted as a file at all.

iforgotpassword · 2023-10-16T10:12:58.000000Z

Exactly, some of the quirks are just a result of designing this for these old systems. ZIP needed to work acceptably on machines with less than a megabyte of RAM. The file format contains ambiguity and redundancy yes, but given well behaving software it's the "good enough" of archive formats. Even in the *nix world we didn't get anything better, tar isn't even seekable and has its own idiosyncrasies, especially if you compress it because the compression happens afterwards and is applied to the whole archive.

opless · 2023-10-16T10:33:04.000000Z

Tar IS seekable, as it was designed especially for tapes.

The modern usage of it adds compression, which you have to undo first before doing your archive operations ;)

Zip was > tar (and cpio) in that regard because you didn't have that two step process and only had to have enough spare disk space for the extracted file, rather than extracted file + decompressed tar

iforgotpassword · 2023-10-16T11:20:44.000000Z

Tar is seekable in the literal sense, you have to seek around until you find the file you want. You have to look at each header, if it's not the file you want you skip over that one file and read the next header. Because you only know how long that one file is. So you cannot even know where the third file is without having processed the headers of the previous two. It's "read, skip, read, skip, read skip" until you maybe find the file you wanted. ZIP otoh has the central directory index where you can look things up much faster.

masklinn · 2023-10-16T11:44:03.000000Z

Zip still > tar in that regard, since you can list archive contents (and extract individual files) without having to go through the entire thing every time, hence zip being suitable as a local file system (à la opendocument or ooxml).

xxs · 2023-10-16T12:02:03.000000Z

Zip is terrible as a local files system, unless you consider read only (randomly, and not many files). Writing/append to a file requires some form of garbage organization/compaction - which requires rewrite of the entire CEN. Unlike the regular files/entries, CEN doesn't have a checksum, so corruption there is pretty bad.

Each file being compressed individually means low compression rates, the deflate overhead might be acceptable for slow mediums (although in that case the memory overhead would be non-trivial).

xxs · 2023-10-16T11:54:53.000000Z

The massive downsize of zip is that each file is compressed individually, which means very poor compression. Overall the compression rate is significantly lower, same with an attempt to decompress it all. It's only useful for random reads of few select files.

Tar -> read, skip, read, till you find what you need, compression doesn't change much, as most stream compression algorithms have a flush method.

nfriedly · 2023-10-16T12:15:45.000000Z

> The massive downsize...

I expect you meant "downside", because when talking about file compression, a "massive downsize" sounds like a good thing :-)

xxs · 2023-10-16T12:31:31.000000Z

indeed, I suck at typing

tinus_hn · 2023-10-16T12:05:21.000000Z

Also the only way such a file can be editable which was a major design goal of the zip format and which is not really possible with Tar files.

efreak · 2023-10-18T04:34:23.000000Z

Not sure what you're saying here. Tar supports appending new files and new versions of files to an existing archive without rewriting the entire archive. I don't believe it supports marking files as deleted (not sure why, tbh) but you can add an empty file with the same name for non-tape storage to prevent extracting the original version. It does this by simply appending the new version of the file to the end of the archive, which is understandable as unlike zip files tar (tape archive) was designed explicitly for tape, where there is no random access. Unless you're referring to editing files in-place inside a tar file, which should actually be easier than doing so with a zip file (as long as the new version isn't longer than the old version, and you're not actually storing it on tape).

tinus_hn · 2023-10-18T07:26:34.000000Z

So there’s three options: index at the beginning, index at the end and no index.

Index at the beginning means it’s easy to look up, but impossible to add anything.

No index, which is tar, means it’s impossible to tell what’s in it without reading the entire file.

Index at the end, which is zip, means you have to do slightly more work to find the index, but then you can read it without reading the rest of the file, and move it further along the file if you want to extend the file and you can edit it or add to it.

And if you want compression tar is just a compressed blob and the only thing you can do is read it from beginning to end. That’s just not very helpful unless the only things you want to do are packages a bunch of things and unpack all of them. A major use case but one met perfectly fine by hundreds of formats.

tecleandor · 2023-10-16T10:14:08.000000Z

Well, depending on the definition you could fit ZMODEM as an streaming + compression (RLE) protocol. I used it a lot almost 30 years ago!

opless · 2023-10-16T10:27:33.000000Z

As I mentioned, dialup for many was a luxury.

dale_glass · 2023-10-16T10:25:43.000000Z

Streaming was a thing if you include tape

opless · 2023-10-16T10:29:04.000000Z

Cassettes were mostly used on 8 bit machines ;)

Tape drives were relatively expensive to floppy disks.

Karellen · 2023-10-16T11:03:13.000000Z

Tape drives were used on a lot of early computers, including the hardware that ran early Unices (e.g. 16-bit PDP-11) and which tar/gzip were written for

https://en.wikipedia.org/wiki/IBM_7-track

https://en.wikipedia.org/wiki/9-track_tape

https://en.wikipedia.org/wiki/DECtape

"tape" includes reel-to-reel systems as well as cassette.

opless · 2023-10-16T11:06:02.000000Z

And exactly none of them were used on contemporary machines that PKZIP ran on.

wongarsu · 2023-10-16T11:25:05.000000Z

Tape drives for backup were used at the time, and are still used. Though I can't blame PKZIP for not caring about that use case

https://en.wikipedia.org/wiki/Digital_Linear_Tape

https://en.wikipedia.org/wiki/Data8

https://en.wikipedia.org/wiki/Linear_Tape-Open

Karellen · 2023-10-16T11:16:40.000000Z

Huh. I wasn't aware that cassette tapes were used on the systems that PKZip ran on either.

Pretty sure DOS didn't have tape drive support ordinarily, and I wasn't aware that PKZip was widely ported to anything that wasn't DOS or a clone/derivative?

opless · 2023-10-16T11:28:49.000000Z

PK(A)Zip was available on the Amiga in 1990

https://groups.google.com/g/comp.sys.amiga/c/_XvqYUvBwvc/m/c...

Unix support came later, in the form of infozip in August 1992

bonzini · 2023-10-16T11:42:11.000000Z

I mean they were, but maybe sometimes for backups of hard drives? Certainly not as a common medium to exchange software.

opless · 2023-10-16T11:25:46.000000Z

https://en.wikipedia.org/wiki/IBM_cassette_tape

Sorry, but I guess you had to be there to have that knowledge?

Karellen · 2023-10-16T13:33:40.000000Z

Cool, TIL.

I was kind of there, but I don't feel like this is something I should necessarily have been aware of:

> apart from one diagnostic tape available from IBM, there seems never to have been any software sold on tape, and the interface was not included on the followup PC XT.

Yeah, I don't think I knew anyone with a first-gen IBM PC, rather than an XT or XT clone. Even so, it seems like even for something as niche as the personal computer was pre-1983, this seems like an extra niche bit of lore. I wouldn't be surprised if a good proportion of people who were properly there didn't have that knowledge.

But, sure, be a douche about it if it makes you feel more superior. Whatever.

> An IBM PC with just an external cassette recorder for storage could only use the built-in ROM BASIC as its operating system, which supported cassette operations. IBM PC DOS had no support for cassette tape,

So, are you saying that PKZip supported running on the pre-XT ROM BASIC OS, as well as PC-DOS/MS-DOS?

opless · 2023-10-16T13:59:33.000000Z

> But, sure, be a douche about it if it makes you feel more superior. Whatever.

No, but don't try to assert conjecture as facts. "Whatever"

> So, are you saying that PKZip supported running on the pre-XT ROM BASIC OS, as well as PC-DOS/MS-DOS?

Now you're being a douche, no? I said that the hardware had cassette tape access.

The IBM PC, and PCjr both has the hardware to access tape drives. And yes you could access the cassette tape routines, INT15, 0/1/2/3 ( https://stanislavs.org/helppc/int_15.html ) through DOS, though as you allude to, there was little need to do so.

However BASICA could load programs from tape. https://www.youtube.com/watch?v=KTmzUBb924A

There, something else you can learn today. I'll leave it to you to discover exactly what.

Karellen · 2023-10-16T16:58:44.000000Z

> No, but don't try to assert conjecture as facts.

I hedged with "I wasn't aware", "pretty sure", "ordinarily", and ended with a question mark. I don't know how I could have been more guarded in expressing my recollection as non-authoritative. OK, maybe "pretty sure" was a bit on the strong side. sigh Sorry for that.

> I said that the hardware had cassette tape access.

I thought your point was that IBM cassette drives, unlike the tape systems I mentioned, "were used on contemporary machines that PKZIP ran on."

If your intended meaning was something other than "original IBM PCs running ROM BASIC OS so they could access an IBM Cassette tape was a supported platform by PKZIP", then I apologise for not being smart enough to make whatever leaps of logic were required to understand what your actual point was.

opless · 2023-10-17T00:32:59.000000Z

And the winner of the mental gymnastics award goes to ...

benj111 · 2023-10-16T12:48:32.000000Z

I'm trying to parse your comment.

Are you being pedantic by suggesting cassettes aren't tapes? When colloquially in the UK at least cassettes were called tapes.

Plus both tapes and cassettes are streaming formats.

And other tape formats came along later for the pc market anyway.

amelius · 2023-10-16T11:00:51.000000Z

Self-extractors are too dangerous to run, except perhaps in a sandbox.

opless · 2023-10-16T11:03:27.000000Z

Apart from 100% of windows installers, and curl | bash installers ? ;-)

bonzini · 2023-10-16T11:40:10.000000Z

Remember we're talking about late 80s/early 90s. Self extractors were how you distributed binaries on floppies to people that might not have had PKZIP and couldn't just go and download it.

lifthrasiir · 2023-10-16T10:06:02.000000Z

Probably "how not to specify a file format" is a better title? Lots of issues in the APPNOTE are already well-known, Info-Zip had its own version of an unofficially corrected specification [1] [2] for example. Even the international standard (ISO/IEC 21320-1:2015) is not helpful, as it is specified by subsetting the APPNOTE, not correcting it.

[1] https://libzip.org/specifications/appnote_iz.txt

[2] https://entropymine.wordpress.com/2019/08/22/survey-of-zip-a...

sph · 2023-10-16T11:49:37.000000Z

Any updated version of this revised spec? Because the official APPNOTE.TXT [1] has been updated in Nov 2022, while your first link has been updated in April 2004.

1: https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT

koito17 · 2023-10-16T09:54:35.000000Z

Maybe I missed it in the article, but

> the central directory might not reference all the files in the zip file because otherwise this statement about files being added, replaced, or delete has no point to be in the spec

I think the author missed one optimization that I have observed from various archiving utilities. Suppose you want to compress deeply/nested/folders/with/one-file into an archive. Many archivers will not bother creating dedicated entries for deeply, nested, folders, etc. They will simply have a single entry with that path, and it is assumed that the intermediate folders exist. This threw me off at $DAYJOB when I needed to generate a flat list of "valid paths" in a ZIP archive, "valid" in the sense that the path exists after decompressing the archive.

Note: I don't think this is good design of a file format, and although I've never been able to find documentation explicitly mentioning this behavior, I do believe this is the primary reason the central directory does not have to contain entries for each file/directory in an archive.

crabbone · 2023-10-16T10:27:51.000000Z

Don't worry, you aren't the only one. Python craps its pants when this happens too.

You see, Python packages (i.e. Wheels) are Zip files, and can be installed w/o unpacking them. There are various tools in standard library, eg. `importlib.resources` (used to be `resources_pkg`) which could be used to list all the stuff that came with the package as if it was sitting in directories. But, in this case they fail...

crabbone · 2023-10-16T11:13:58.000000Z

My baptism by fire was when I had to write Protobuf parser and generator. It's the same exact story: format author didn't think it through, didn't realize some bad consequences are about to be had, made vague or unrealistic claims in documentation (or forgot to mention things in documentation).

Since then, I've seen a lot of this stuff... I cannot think about a format suitable for archives / data transfer / configuration that was well designed. And by this I mean that the claims made by the author of the format are shown to be true in a formal way.

It feels very sad that despite countless failures, which all result from the lack of a tool for doing this formal thing that can check the specification and tell you if it makes sense... nobody really works on that. It seems obvious that you'd want to have a tool that tells you how well you are doing, and programmers are very dedicated to tools that do formal verification (s.a. compile-time type checkers), but when it comes to something with far more reaching consequences is a "whatever".

GuB-42 · 2023-10-16T11:53:28.000000Z

A way to deal with the problem is to make a reference implementation instead of a specification document. A document can also be written for convenience, but if code and documentation disagree, short of an obvious bug, the code is right and the docs need to change to reflect that.

And it is actually a common practice.

This way, nothing is vague or missing. If you wonder what is going to happen if a certain field has a certain value, run the code and see what happen, it may not be great, but you know exactly what to expect. With code, you also have access to all the developer tools: compilers, linters, test frameworks, debuggers, coverage analyzers, fuzzers,... and even provers. Ideally the reference implementation should be open source with a permissive license, be written with readability in mind, and come with a good test suite.

The problem with zip as described in the article is that the specs are half-assed and the real test for a correct zip file is that it should work with pkzip. But we can't really use pkzip as a reference either because it is proprietary.

crabbone · 2023-10-16T17:22:27.000000Z

A way to deal with the problem?

I think this is how the problem was created in the first place: the authors wrote the reference implementation, but then didn't test it enough, didn't think of all possible ways things can go wrong.

Also, I'm not saying that this necessarily results in bad formats. Eg. I don't particularly like Thrift, but it's kind of OK in the sense that it doesn't do much weird stuff unexpectedly. I don't believe they had any formal way to verify that their plan is going to work, they were "lucky" because they had to work with Protobuf a lot and figured some of its shortfalls before they wrote their own.

So, unless you have some sort of a mechanical way to verify that your goals are achieved by the format you are creating, it's going to work just in the same way how any given C program won't corrupt memory: you might get lucky, or after a lot of testing you'll conclude that it's very unlikely that the program will corrupt memory, but you can never be sure.

And, really, I don't blame the authors. I'm not saying they were lazy, or didn't pay enough attention. It's just hard to do when you don't have a watchdog who will absolutely not allow any and all transgressions. I wouldn't count on myself to do that w/o such a watchdog -- I don't have that kind of confidence even after implementing many different formats used for similar purpose.

rendaw · 2023-10-16T10:14:11.000000Z

Are there any standard alternatives today? Especially for bundles of data - music, images, json, etc. There's .tar.gz which has (I think) a variety of implementations, but it has issues with seeking and the compression isn't great, and brings with it a whole bagload of linux stuff. Maybe something using zstd?

crabbone · 2023-10-16T10:44:18.000000Z

I think ISO is seekable and from the sound of it, you want something to work on an OS where you have very little control over the utilities it offers (i.e. MS Windows), where, I think ISO is supported by the pre-installed system utilities. The downside is that these aren't compressed. I don't think the OS that you chose gives you more options... it has native support for more filesystems, but using those as archives seems very problematic (but maybe I don't know something).

Had you not been constrained by the choice of the operating system, a reasonable OS + FS combination will offer you some way to make snapshots. Filesystems are usually optimized for random access (with exception of things like TAR, which is designed for streaming). Most modern general purpose filesystems support compression, so that wouldn't be a problem. Often these filesystems offer a way to mount the file with the filesystem image w/o involving OS utilities like loop devices, so it would give you a way to access archive contents as if you were accessing any other filesystem.

abetusk · 2023-10-16T12:02:29.000000Z

The bioinformatics community uses block based gzip compression (bgzip) [0]. The gzip standard allows for blocks so, using an additional index file, you can use it to seek to arbitrary locations and uncompress the block.

gzip compression is maybe not optimal now and the block segmentation reduces the efficiency even further.

Though not very standard, there is also a tar indexer program [1] that allows you to create an index on tar files to do the same.

My information is at least a couple years old so things may have changed.

[0] http://www.htslib.org/doc/bgzip.html

[1] https://github.com/devsnd/tarindexer

wongarsu · 2023-10-16T11:34:41.000000Z

Do you consider rar a standard? It's a pretty good format, even though there aren't any good open-source implementations. But if you're willing to pay for your software, WinRAR command line versions are available for most platforms.

7zip is the most obvious free alternative. There is also a 7zip fork that offers zstd [1]. The command line experience for 7zip isn't very good however.

1: https://github.com/mcmilk/7-Zip-zstd

sph · 2023-10-16T11:56:08.000000Z

tar.gz/bz2/zstd might be more sane, as it separates the concept of concatenating multiple files together from the compression aspect. Perhaps the thing we can improve upon is the tar format, which IIRC was designed for tape devices.

wongarsu · 2023-10-16T12:27:16.000000Z

Separating these concepts makes seekability much harder, and imho has dubious benefits at best.

I'm all for formats that separate the exact compression algorithm from the file storage format. Both rar and 7zip do this. For example 7zip offers you 4 compression algorithms (LZMA/LZMA2/PPMD/Bzip2) and forks extending the choice while keeping the same file format.

And if you don't keep the storage and compression layer in the dark about the existence of the other you can chunk compression more sensibly, keep your file directory uncompressed (or compressed separately), embed recovery information (like RAR's embedded par2), and choose different algorithms for different files (including not compressing certain file types)

sph · 2023-10-16T13:25:45.000000Z

True, you have convinced me. That said, I think the reality is that Zip and tar+compression are good enough and there is no real need to change the status quo. Both are extensible in their compression algo, which is the thing people care the most about their archives. I would go as far as to say people these days use 7z and rar out of inertia, rather than actual need.

lifthrasiir · 2023-10-16T10:42:04.000000Z

Not exactly standardized, but .7z should be versatile enough to cover most use cases except for a basis of application data files. For that case, you can actually design your own file format, which is surprisingly easy if you know your own requirements.

SushiHippie · 2023-10-16T11:03:23.000000Z

tar is just the way you pack multiple files and directories into one file, you can pipe the tar file through any compression algorithm you like it doesn't matter. The tar command has many compression options builtin https://man.archlinux.org/man/tar.1#Compression_options And you can also specify any compression program you like through `-I CMD` e.g. `-I "zstd -10"` would compress the tar archive with zstd and compression level 10.

Semaphor · 2023-10-16T09:52:13.000000Z

162 comments in 2021: https://news.ycombinator.com/item?id=27925393

lproven · 2023-10-16T12:03:40.000000Z

Trying to generalise and extract a lesson from this...

It was a quick DOS shareware app in the 1980s. There are better file-compression solutions now, for modern multicore processors on multitasking cross-platform OSes, but this one caught on, because it was shareware, and now has critical mass so we have to deal.

Generalising a lot, going a century back:

QWERTY was a handy keyboard format back when typewriters were catching on. There are better layouts now, but because QWERTY had a big win that led to another big win (it was designed to stop type levers jamming, so it was designed to make English users continually alternate left and right hands, and that happens to be quite ergonomically efficient), it caught on and now has critical mass so we have to deal with it.

Generalising just a little, just a couple of decades back from PKZip, and recursing 1 step:

UNIX was a quick hack of a text-terminal-driven minicomputer OS in the late 1960s, based on some ideas from mainstream OS R&D back then. Then they ported it to other text-only minicomputers, so they came up with a quick hack of a language to do it (C). There are better solutions now, including what the UNIX creators went on to do in the 1980s (an OS for LANs of homogenous graphical single-user workstations, Plan 9), and in the 1990s (a similar OS for heterogenous LANs of workstations some running other OSes, Inferno), and better languages (Limbo, Oberon) but Unix and direct copies of Unix (Linux, xBSD) got critical mass, so we have to deal.

Semaphor · 2023-10-16T13:24:27.000000Z

> Contrary to popular belief, the QWERTY layout was not designed to slow the typist down, but rather to speed up typing. Indeed, there is evidence that, aside from the issue of jamming, placing often-used keys farther apart increases typing speed, because it encourages alternation between the hands

-- https://en.wikipedia.org/wiki/QWERTY#Properties

So you took the common urban legend and altered it with some half-truths ;)

lproven · 2023-10-16T16:38:40.000000Z

Hang on, what?

You prefix "Contrary to..." with a > which usually indicates quoting...

But that's not a quote. I didn't say that. I didn't say anything at all about that. I didn't say anything about slowing anyone down or speeding up typing.

So you are replying to my comment by questioning what someone else said.

I have nothing to say about what you're talking about.

Semaphor · 2023-10-16T17:23:44.000000Z

> a > which usually indicates quoting

And followed by -- and a link, which usually indicated the source of the quote.

lproven · 2023-10-16T20:27:57.000000Z

Not to me it doesn't, no. Never heard of that convention before, and I've been online since 1985, but hey.

I did not refer to that Wikipedia page, or any Wikipedia page.

I can't argue with something I never said or meant. I don't see the relevance. I did not mention typing speed at all.

benj111 · 2023-10-16T13:01:52.000000Z

Or nearly any programming language that is actually in wide spread use. Any protocol, standard, markup language........

Iprovens law. A) Anything in widespread use will be used for things it isn't intended for, and thus won't be suitable for those things. The more widespread the use, the greater the proportion of its use is for things it isn't intended for.

B) Anything in widespread use won't be able to change because it will break existing things.

(I wasn't sure if you were intending both or one of these)

lproven · 2023-10-16T13:28:53.000000Z

It's an L.

I would say those generalise way too far, but that doesn't make the observation entirely invalid.

But the thing is that in some fields and in some areas of software, the industry has proved willing to drop older stuff and make breaking changes where it works out necessary.

A 10Y old web browser barely works now. A 20Y old web browser doesn't.

Despite its inadequacies, a huge number of people and machines have moved to IPv6.

Javascript now isn't the quick hack that it was 25Y ago.

We dropped DOS and Win9x and most other OSes with roots in the 1980s. When it proved impossible to extend them any more, the industry let them die and moved on.

artursapek · 2023-10-16T10:59:23.000000Z

OK but having your initials as the first two bytes in billions of files circulated every year online is kinda badass

Karellen · 2023-10-16T11:05:28.000000Z

It was a well-established tradition before PKZip came along

https://en.wikipedia.org/wiki/DOS_MZ_executable

klyrs · 2023-10-16T12:26:47.000000Z

Badass? Sounds obnoxiously self-indulgent to me. Looking at you, toml.

chenxiaolong · 2023-10-16T14:02:01.000000Z

What I don't understand is the use case for data descriptors. If the entire file is available, then the central directory is sufficient. If the file is being streamed, then there's no way to know with certainty where the data descriptor is, especially for a scenario like an uncompressed zip stored within a zip.

I suppose a parser could try to compare the CRC and uncompressed size fields every time it encounters the data descriptor magic bytes until it finds one that matches, but the magic bytes aren't even mandatory. The data descriptor can just be CRC + compressed size + uncompressed size with no marker.

From basic testing, `unzip`, libarchive/bsdtar, and the Python zipfile module all choke on files like this when read from a stream (as expected).

Why were data descriptors ever included in the spec?

wvh · 2023-10-16T12:59:31.000000Z

I implemented a tool in Go based on the "Bag It" format spec for open data repositories. I remember investigating if it's possible to validate the contents of a very large ZIP file in an API without uploading multiple gigabytes, that is, drop the upload if the central index does not contain required entries. I learned that ZIP is an indexed format and that the index is (usually) at the end, while TAR is a streaming format (because it wrote to non-random access tape). You can inspect the index on the fly, but you will have already received almost all of the data at that point.

The ZIP format (and derivatives) makes sense for the purpose it was likely designed for, which is not useful for use cases where a streaming or incremental approach would be a better solution.

mjevans · 2023-10-16T10:06:15.000000Z

What we can do today:

Use different, better structured formats. (Except just about everything can handle a 'well formed' zip file)

Take care to emit well-formed zip files.

Never trust un-sanitized input. This includes ensuring that any malformed input does not escalate to a security issue such as a buffer overflow, or a path escape when not allowed.

Be tolerant of poorly formed zip files.

Have repair tools which operate in useful ways, such as 'streaming' (front to back), with optional rename / replace / etc on name collision / delete, or utilize the Nth located (backwards of forwards?) directory record set.

self_awareness · 2023-10-16T10:11:20.000000Z

> Except just about everything can handle a 'well formed' zip file

This depends on the extensions that are used. For example, ZIP file can use a LZMA compression, and not all decompressors can handle it.

mjevans · 2023-10-16T21:09:20.000000Z

If LZMA is your intended target, I more strongly suggest selecting a more modern container such as the 7z archive format.

Someone selects a zip file not for it's compression, but for how widely accessible it is. Most likely this involves basic compression on text files, error detection checksums, and storage (no compression) of already well compressed images and possibly videos.

ToddWBurgess · 2023-10-16T11:55:37.000000Z

Anyone here old enough to remember from the BBS days when Phil Katz was embroiled in a legal dispute over his program PKARC and the ARC compression file format? At the time it was cast as a David and Goliath story (with Phil being David) when it was really just two small home based software developers fighting it out.

Long story short, Phil lost the arc dispute which is why I assume he moved onto the ZIP format. In the end, Phil Katz was taken from us too soon because his personal demons got the better of him.

justin66 · 2023-10-16T12:08:55.000000Z

This dispute was the subject of the last chapter of Jason Scott’s BBS documentary. The whole thing is worth watching, of course.

https://m.youtube.com/watch?v=uNXCd2EATSo&list=PL7nj3G6Jpv2G...

xxs · 2023-10-16T12:07:40.000000Z

Sure, I had a talk recently where I mentioned P.K. and opening the ZIP which made it the default compression format. And zip became a proper noun (aside zoning improvement plant, version, that's it)

>In the end, Phil Katz was taken...

Indeed, 37 -- it's almost a cautionary tale, nowadays.

justsomehnguy · 2023-10-16T09:50:00.000000Z

Regarding backreading (and fowrard too, honestly)... you can store files in .zip (ie not compressed at all). Some of those files could be a pretty valid .zip files too. *grin*

rhymeshot · 2023-10-16T10:07:16.000000Z

How about a .zip that stores a copy of itself?

https://alf.nu/ZipQuine

justsomehnguy · 2023-10-16T19:08:06.000000Z

How about zip-bombs?

masklinn · 2023-10-16T11:50:02.000000Z

An other fun trick is that zip being back-read means it can be smushed together with a file that’s forwards-read, and you get a dual-format file. That was a common way to smuggle archives on image boards.

pico-8 does that by design, the “cartridges” are image files, except it uses steganography to encode the program in the image instead of just smuggling.

semi · 2023-10-16T15:55:58.000000Z

this is also how self extracting executables work. It's just a program that extracts whatever zip file exists after the program itself. Rename it from exe to zip and you have a valid zip file that just happens to have windows executable code before the start of the zip contents

SkiFire13 · 2023-10-16T11:43:51.000000Z

The article is missing a nice reference to a potential exploit in Firefox addons caused by ambiguous parsing of zip files. https://bugzilla.mozilla.org/show_bug.cgi?id=1534483

Fnoord · 2023-10-16T09:54:29.000000Z

(2021)