Hacker News new | past | comments | ask | show | jobs | submit login
Zip – How not to design a file format (2021) (greggman.com)
68 points by sph 11 months ago | hide | past | favorite | 75 comments



I was sitting 10 feet from PK when a lot of these decisions were made, but you have to realize what a different time it was, and it how much becomes obvious only in retrospect.

There's more than a few misunderstandings and errors in the article but mostly it's a valid 20-20 hindsight perspective. We knew it at the time too; when a couple of us formed a different company later it was with the express idea that we knew all the things that sucked about .zip files and would love a do-over. We never got around to it though; our thinking at the time was to get established with other products first and then come out with a new format. Our initial stuff ended up being pretty successful and we exited to McAfee and didn't look back. Different times, maybe we did the world a disservice not tackling archiving first :-)

One note about putting "PK" on the file format. In the dev culture Phil learned his craft in there was no source control and change tracking. Files were exchanged via 'sneakernet'. This was true even at PKWare in the 89-91 timeframe. Change tracking was done via file renaming, and if you added a new variable you put your initials on it so if someone had a question they'd know who touched it. I used usenet at home but not at work.

Our communication to the outside world was an 8-line BBS in the next room and the EXECPC BBS that one of us would dial into at least once a day to answer questions on.

The multi-part/disk spanning stuff wasn't added until pretty late in the game, so it was kind of hacky. The format overall is what you get when something takes off and people start asking to use it in ways that weren't anticipated on day one but you want to maintain backwards compatibility. For sure time was spent answering questions from devs who were implementing their own versions, especially info-zip which was iirc the most important seeming effort at the time, and a guy that worked in the office a few nights a week on a port to a mainframe environment which escapes me at the moment.

It was pretty heady to be able to walk up to practically any computer on the show floor at Comdex and type 'pkzip' and get a response, but even then the idea of just how ubiquitous it would become was impossible to anticipate.


The articles main point seems to be complaining about ambiguity.

> Scan from the back, find the end-of-central-directory-record and then use it to read through the central directory, only looking at things the central directory references.

It was pretty obvious 30 years ago that this was the correct way to unzip. Streaming content wasn't even a thing back then (dialup being a luxury). It was the only way to ensure that a self extractor could work reliably.


The problem is not that the reading method should be obvious once you consider that era, but that the reading or writing method is not exactly specified and there can be a considerable variety in implementations.

For example the OP complains about a passing mention (4.3.1):

> Files MAY be added or replaced within a ZIP file, or deleted.

While the author considers this to imply (but not acknowledge) that local file headers and central directory headers may disagree to each other, I'm not even sure because the specification never clearly confirms whether a random data can appear before any local file header. The only thing that has been acknowledged is SFX (4.1.9), and it is not even specified where the "extraction code" can be embedded. Can it be placed between two files? If such thing is indeed possible, shouldn't there be a provision to avoid ambiguity due to the stray data? Say, it could have said that deleting a file in place should mask all occurrences of local file header signature 0x04034b50 so that the vacant space can no longer interpreted as a file at all.


Exactly, some of the quirks are just a result of designing this for these old systems. ZIP needed to work acceptably on machines with less than a megabyte of RAM. The file format contains ambiguity and redundancy yes, but given well behaving software it's the "good enough" of archive formats. Even in the *nix world we didn't get anything better, tar isn't even seekable and has its own idiosyncrasies, especially if you compress it because the compression happens afterwards and is applied to the whole archive.


Tar IS seekable, as it was designed especially for tapes.

The modern usage of it adds compression, which you have to undo first before doing your archive operations ;)

Zip was > tar (and cpio) in that regard because you didn't have that two step process and only had to have enough spare disk space for the extracted file, rather than extracted file + decompressed tar


Tar is seekable in the literal sense, you have to seek around until you find the file you want. You have to look at each header, if it's not the file you want you skip over that one file and read the next header. Because you only know how long that one file is. So you cannot even know where the third file is without having processed the headers of the previous two. It's "read, skip, read, skip, read skip" until you maybe find the file you wanted. ZIP otoh has the central directory index where you can look things up much faster.


Zip still > tar in that regard, since you can list archive contents (and extract individual files) without having to go through the entire thing every time, hence zip being suitable as a local file system (à la opendocument or ooxml).


Zip is terrible as a local files system, unless you consider read only (randomly, and not many files). Writing/append to a file requires some form of garbage organization/compaction - which requires rewrite of the entire CEN. Unlike the regular files/entries, CEN doesn't have a checksum, so corruption there is pretty bad.

Each file being compressed individually means low compression rates, the deflate overhead might be acceptable for slow mediums (although in that case the memory overhead would be non-trivial).


The massive downsize of zip is that each file is compressed individually, which means very poor compression. Overall the compression rate is significantly lower, same with an attempt to decompress it all. It's only useful for random reads of few select files.

Tar -> read, skip, read, till you find what you need, compression doesn't change much, as most stream compression algorithms have a flush method.


> The massive downsize...

I expect you meant "downside", because when talking about file compression, a "massive downsize" sounds like a good thing :-)


indeed, I suck at typing


Also the only way such a file can be editable which was a major design goal of the zip format and which is not really possible with Tar files.


Not sure what you're saying here. Tar supports appending new files and new versions of files to an existing archive without rewriting the entire archive. I don't believe it supports marking files as deleted (not sure why, tbh) but you can add an empty file with the same name for non-tape storage to prevent extracting the original version. It does this by simply appending the new version of the file to the end of the archive, which is understandable as unlike zip files tar (tape archive) was designed explicitly for tape, where there is no random access. Unless you're referring to editing files in-place inside a tar file, which should actually be easier than doing so with a zip file (as long as the new version isn't longer than the old version, and you're not actually storing it on tape).


So there’s three options: index at the beginning, index at the end and no index.

Index at the beginning means it’s easy to look up, but impossible to add anything.

No index, which is tar, means it’s impossible to tell what’s in it without reading the entire file.

Index at the end, which is zip, means you have to do slightly more work to find the index, but then you can read it without reading the rest of the file, and move it further along the file if you want to extend the file and you can edit it or add to it.

And if you want compression tar is just a compressed blob and the only thing you can do is read it from beginning to end. That’s just not very helpful unless the only things you want to do are packages a bunch of things and unpack all of them. A major use case but one met perfectly fine by hundreds of formats.


Well, depending on the definition you could fit ZMODEM as an streaming + compression (RLE) protocol. I used it a lot almost 30 years ago!


As I mentioned, dialup for many was a luxury.


Streaming was a thing if you include tape


Cassettes were mostly used on 8 bit machines ;)

Tape drives were relatively expensive to floppy disks.


Tape drives were used on a lot of early computers, including the hardware that ran early Unices (e.g. 16-bit PDP-11) and which tar/gzip were written for

https://en.wikipedia.org/wiki/IBM_7-track

https://en.wikipedia.org/wiki/9-track_tape

https://en.wikipedia.org/wiki/DECtape

"tape" includes reel-to-reel systems as well as cassette.


And exactly none of them were used on contemporary machines that PKZIP ran on.


Tape drives for backup were used at the time, and are still used. Though I can't blame PKZIP for not caring about that use case

https://en.wikipedia.org/wiki/Digital_Linear_Tape

https://en.wikipedia.org/wiki/Data8

https://en.wikipedia.org/wiki/Linear_Tape-Open


Huh. I wasn't aware that cassette tapes were used on the systems that PKZip ran on either.

Pretty sure DOS didn't have tape drive support ordinarily, and I wasn't aware that PKZip was widely ported to anything that wasn't DOS or a clone/derivative?


PK(A)Zip was available on the Amiga in 1990

https://groups.google.com/g/comp.sys.amiga/c/_XvqYUvBwvc/m/c...

Unix support came later, in the form of infozip in August 1992


I mean they were, but maybe sometimes for backups of hard drives? Certainly not as a common medium to exchange software.


https://en.wikipedia.org/wiki/IBM_cassette_tape

Sorry, but I guess you had to be there to have that knowledge?


Cool, TIL.

I was kind of there, but I don't feel like this is something I should necessarily have been aware of:

> apart from one diagnostic tape available from IBM, there seems never to have been any software sold on tape, and the interface was not included on the followup PC XT.

Yeah, I don't think I knew anyone with a first-gen IBM PC, rather than an XT or XT clone. Even so, it seems like even for something as niche as the personal computer was pre-1983, this seems like an extra niche bit of lore. I wouldn't be surprised if a good proportion of people who were properly there didn't have that knowledge.

But, sure, be a douche about it if it makes you feel more superior. Whatever.

> An IBM PC with just an external cassette recorder for storage could only use the built-in ROM BASIC as its operating system, which supported cassette operations. IBM PC DOS had no support for cassette tape,

So, are you saying that PKZip supported running on the pre-XT ROM BASIC OS, as well as PC-DOS/MS-DOS?


> But, sure, be a douche about it if it makes you feel more superior. Whatever.

No, but don't try to assert conjecture as facts. "Whatever"

> So, are you saying that PKZip supported running on the pre-XT ROM BASIC OS, as well as PC-DOS/MS-DOS?

Now you're being a douche, no? I said that the hardware had cassette tape access.

The IBM PC, and PCjr both has the hardware to access tape drives. And yes you could access the cassette tape routines, INT15, 0/1/2/3 ( https://stanislavs.org/helppc/int_15.html ) through DOS, though as you allude to, there was little need to do so.

However BASICA could load programs from tape. https://www.youtube.com/watch?v=KTmzUBb924A

There, something else you can learn today. I'll leave it to you to discover exactly what.


> No, but don't try to assert conjecture as facts.

I hedged with "I wasn't aware", "pretty sure", "ordinarily", and ended with a question mark. I don't know how I could have been more guarded in expressing my recollection as non-authoritative. OK, maybe "pretty sure" was a bit on the strong side. sigh Sorry for that.

> I said that the hardware had cassette tape access.

I thought your point was that IBM cassette drives, unlike the tape systems I mentioned, "were used on contemporary machines that PKZIP ran on."

If your intended meaning was something other than "original IBM PCs running ROM BASIC OS so they could access an IBM Cassette tape was a supported platform by PKZIP", then I apologise for not being smart enough to make whatever leaps of logic were required to understand what your actual point was.


And the winner of the mental gymnastics award goes to ...


I'm trying to parse your comment.

Are you being pedantic by suggesting cassettes aren't tapes? When colloquially in the UK at least cassettes were called tapes.

Plus both tapes and cassettes are streaming formats.

And other tape formats came along later for the pc market anyway.


Self-extractors are too dangerous to run, except perhaps in a sandbox.


Apart from 100% of windows installers, and curl | bash installers ? ;-)


Remember we're talking about late 80s/early 90s. Self extractors were how you distributed binaries on floppies to people that might not have had PKZIP and couldn't just go and download it.


Probably "how not to specify a file format" is a better title? Lots of issues in the APPNOTE are already well-known, Info-Zip had its own version of an unofficially corrected specification [1] [2] for example. Even the international standard (ISO/IEC 21320-1:2015) is not helpful, as it is specified by subsetting the APPNOTE, not correcting it.

[1] https://libzip.org/specifications/appnote_iz.txt

[2] https://entropymine.wordpress.com/2019/08/22/survey-of-zip-a...


Any updated version of this revised spec? Because the official APPNOTE.TXT [1] has been updated in Nov 2022, while your first link has been updated in April 2004.

1: https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT


Maybe I missed it in the article, but

> the central directory might not reference all the files in the zip file because otherwise this statement about files being added, replaced, or delete has no point to be in the spec

I think the author missed one optimization that I have observed from various archiving utilities. Suppose you want to compress deeply/nested/folders/with/one-file into an archive. Many archivers will not bother creating dedicated entries for deeply, nested, folders, etc. They will simply have a single entry with that path, and it is assumed that the intermediate folders exist. This threw me off at $DAYJOB when I needed to generate a flat list of "valid paths" in a ZIP archive, "valid" in the sense that the path exists after decompressing the archive.

Note: I don't think this is good design of a file format, and although I've never been able to find documentation explicitly mentioning this behavior, I do believe this is the primary reason the central directory does not have to contain entries for each file/directory in an archive.


Don't worry, you aren't the only one. Python craps its pants when this happens too.

You see, Python packages (i.e. Wheels) are Zip files, and can be installed w/o unpacking them. There are various tools in standard library, eg. `importlib.resources` (used to be `resources_pkg`) which could be used to list all the stuff that came with the package as if it was sitting in directories. But, in this case they fail...


My baptism by fire was when I had to write Protobuf parser and generator. It's the same exact story: format author didn't think it through, didn't realize some bad consequences are about to be had, made vague or unrealistic claims in documentation (or forgot to mention things in documentation).

Since then, I've seen a lot of this stuff... I cannot think about a format suitable for archives / data transfer / configuration that was well designed. And by this I mean that the claims made by the author of the format are shown to be true in a formal way.

It feels very sad that despite countless failures, which all result from the lack of a tool for doing this formal thing that can check the specification and tell you if it makes sense... nobody really works on that. It seems obvious that you'd want to have a tool that tells you how well you are doing, and programmers are very dedicated to tools that do formal verification (s.a. compile-time type checkers), but when it comes to something with far more reaching consequences is a "whatever".


A way to deal with the problem is to make a reference implementation instead of a specification document. A document can also be written for convenience, but if code and documentation disagree, short of an obvious bug, the code is right and the docs need to change to reflect that.

And it is actually a common practice.

This way, nothing is vague or missing. If you wonder what is going to happen if a certain field has a certain value, run the code and see what happen, it may not be great, but you know exactly what to expect. With code, you also have access to all the developer tools: compilers, linters, test frameworks, debuggers, coverage analyzers, fuzzers,... and even provers. Ideally the reference implementation should be open source with a permissive license, be written with readability in mind, and come with a good test suite.

The problem with zip as described in the article is that the specs are half-assed and the real test for a correct zip file is that it should work with pkzip. But we can't really use pkzip as a reference either because it is proprietary.


A way to deal with the problem?

I think this is how the problem was created in the first place: the authors wrote the reference implementation, but then didn't test it enough, didn't think of all possible ways things can go wrong.

Also, I'm not saying that this necessarily results in bad formats. Eg. I don't particularly like Thrift, but it's kind of OK in the sense that it doesn't do much weird stuff unexpectedly. I don't believe they had any formal way to verify that their plan is going to work, they were "lucky" because they had to work with Protobuf a lot and figured some of its shortfalls before they wrote their own.

So, unless you have some sort of a mechanical way to verify that your goals are achieved by the format you are creating, it's going to work just in the same way how any given C program won't corrupt memory: you might get lucky, or after a lot of testing you'll conclude that it's very unlikely that the program will corrupt memory, but you can never be sure.

And, really, I don't blame the authors. I'm not saying they were lazy, or didn't pay enough attention. It's just hard to do when you don't have a watchdog who will absolutely not allow any and all transgressions. I wouldn't count on myself to do that w/o such a watchdog -- I don't have that kind of confidence even after implementing many different formats used for similar purpose.


Are there any standard alternatives today? Especially for bundles of data - music, images, json, etc. There's .tar.gz which has (I think) a variety of implementations, but it has issues with seeking and the compression isn't great, and brings with it a whole bagload of linux stuff. Maybe something using zstd?


I think ISO is seekable and from the sound of it, you want something to work on an OS where you have very little control over the utilities it offers (i.e. MS Windows), where, I think ISO is supported by the pre-installed system utilities. The downside is that these aren't compressed. I don't think the OS that you chose gives you more options... it has native support for more filesystems, but using those as archives seems very problematic (but maybe I don't know something).

Had you not been constrained by the choice of the operating system, a reasonable OS + FS combination will offer you some way to make snapshots. Filesystems are usually optimized for random access (with exception of things like TAR, which is designed for streaming). Most modern general purpose filesystems support compression, so that wouldn't be a problem. Often these filesystems offer a way to mount the file with the filesystem image w/o involving OS utilities like loop devices, so it would give you a way to access archive contents as if you were accessing any other filesystem.


The bioinformatics community uses block based gzip compression (bgzip) [0]. The gzip standard allows for blocks so, using an additional index file, you can use it to seek to arbitrary locations and uncompress the block.

gzip compression is maybe not optimal now and the block segmentation reduces the efficiency even further.

Though not very standard, there is also a tar indexer program [1] that allows you to create an index on tar files to do the same.

My information is at least a couple years old so things may have changed.

[0] http://www.htslib.org/doc/bgzip.html

[1] https://github.com/devsnd/tarindexer


Do you consider rar a standard? It's a pretty good format, even though there aren't any good open-source implementations. But if you're willing to pay for your software, WinRAR command line versions are available for most platforms.

7zip is the most obvious free alternative. There is also a 7zip fork that offers zstd [1]. The command line experience for 7zip isn't very good however.

1: https://github.com/mcmilk/7-Zip-zstd


tar.gz/bz2/zstd might be more sane, as it separates the concept of concatenating multiple files together from the compression aspect. Perhaps the thing we can improve upon is the tar format, which IIRC was designed for tape devices.


Separating these concepts makes seekability much harder, and imho has dubious benefits at best.

I'm all for formats that separate the exact compression algorithm from the file storage format. Both rar and 7zip do this. For example 7zip offers you 4 compression algorithms (LZMA/LZMA2/PPMD/Bzip2) and forks extending the choice while keeping the same file format.

And if you don't keep the storage and compression layer in the dark about the existence of the other you can chunk compression more sensibly, keep your file directory uncompressed (or compressed separately), embed recovery information (like RAR's embedded par2), and choose different algorithms for different files (including not compressing certain file types)


True, you have convinced me. That said, I think the reality is that Zip and tar+compression are good enough and there is no real need to change the status quo. Both are extensible in their compression algo, which is the thing people care the most about their archives. I would go as far as to say people these days use 7z and rar out of inertia, rather than actual need.


Not exactly standardized, but .7z should be versatile enough to cover most use cases except for a basis of application data files. For that case, you can actually design your own file format, which is surprisingly easy if you know your own requirements.


tar is just the way you pack multiple files and directories into one file, you can pipe the tar file through any compression algorithm you like it doesn't matter. The tar command has many compression options builtin https://man.archlinux.org/man/tar.1#Compression_options And you can also specify any compression program you like through `-I CMD` e.g. `-I "zstd -10"` would compress the tar archive with zstd and compression level 10.



Trying to generalise and extract a lesson from this...

It was a quick DOS shareware app in the 1980s. There are better file-compression solutions now, for modern multicore processors on multitasking cross-platform OSes, but this one caught on, because it was shareware, and now has critical mass so we have to deal.

Generalising a lot, going a century back:

QWERTY was a handy keyboard format back when typewriters were catching on. There are better layouts now, but because QWERTY had a big win that led to another big win (it was designed to stop type levers jamming, so it was designed to make English users continually alternate left and right hands, and that happens to be quite ergonomically efficient), it caught on and now has critical mass so we have to deal with it.

Generalising just a little, just a couple of decades back from PKZip, and recursing 1 step:

UNIX was a quick hack of a text-terminal-driven minicomputer OS in the late 1960s, based on some ideas from mainstream OS R&D back then. Then they ported it to other text-only minicomputers, so they came up with a quick hack of a language to do it (C). There are better solutions now, including what the UNIX creators went on to do in the 1980s (an OS for LANs of homogenous graphical single-user workstations, Plan 9), and in the 1990s (a similar OS for heterogenous LANs of workstations some running other OSes, Inferno), and better languages (Limbo, Oberon) but Unix and direct copies of Unix (Linux, xBSD) got critical mass, so we have to deal.


> Contrary to popular belief, the QWERTY layout was not designed to slow the typist down, but rather to speed up typing. Indeed, there is evidence that, aside from the issue of jamming, placing often-used keys farther apart increases typing speed, because it encourages alternation between the hands

-- https://en.wikipedia.org/wiki/QWERTY#Properties

So you took the common urban legend and altered it with some half-truths ;)


Hang on, what?

You prefix "Contrary to..." with a > which usually indicates quoting...

But that's not a quote. I didn't say that. I didn't say anything at all about that. I didn't say anything about slowing anyone down or speeding up typing.

So you are replying to my comment by questioning what someone else said.

I have nothing to say about what you're talking about.


> a > which usually indicates quoting

And followed by -- and a link, which usually indicated the source of the quote.


Not to me it doesn't, no. Never heard of that convention before, and I've been online since 1985, but hey.

I did not refer to that Wikipedia page, or any Wikipedia page.

I can't argue with something I never said or meant. I don't see the relevance. I did not mention typing speed at all.


Or nearly any programming language that is actually in wide spread use. Any protocol, standard, markup language........

Iprovens law. A) Anything in widespread use will be used for things it isn't intended for, and thus won't be suitable for those things. The more widespread the use, the greater the proportion of its use is for things it isn't intended for.

B) Anything in widespread use won't be able to change because it will break existing things.

(I wasn't sure if you were intending both or one of these)


It's an L.

I would say those generalise way too far, but that doesn't make the observation entirely invalid.

But the thing is that in some fields and in some areas of software, the industry has proved willing to drop older stuff and make breaking changes where it works out necessary.

A 10Y old web browser barely works now. A 20Y old web browser doesn't.

Despite its inadequacies, a huge number of people and machines have moved to IPv6.

Javascript now isn't the quick hack that it was 25Y ago.

We dropped DOS and Win9x and most other OSes with roots in the 1980s. When it proved impossible to extend them any more, the industry let them die and moved on.


OK but having your initials as the first two bytes in billions of files circulated every year online is kinda badass


It was a well-established tradition before PKZip came along

https://en.wikipedia.org/wiki/DOS_MZ_executable


Badass? Sounds obnoxiously self-indulgent to me. Looking at you, toml.


What I don't understand is the use case for data descriptors. If the entire file is available, then the central directory is sufficient. If the file is being streamed, then there's no way to know with certainty where the data descriptor is, especially for a scenario like an uncompressed zip stored within a zip.

I suppose a parser could try to compare the CRC and uncompressed size fields every time it encounters the data descriptor magic bytes until it finds one that matches, but the magic bytes aren't even mandatory. The data descriptor can just be CRC + compressed size + uncompressed size with no marker.

From basic testing, `unzip`, libarchive/bsdtar, and the Python zipfile module all choke on files like this when read from a stream (as expected).

Why were data descriptors ever included in the spec?


I implemented a tool in Go based on the "Bag It" format spec for open data repositories. I remember investigating if it's possible to validate the contents of a very large ZIP file in an API without uploading multiple gigabytes, that is, drop the upload if the central index does not contain required entries. I learned that ZIP is an indexed format and that the index is (usually) at the end, while TAR is a streaming format (because it wrote to non-random access tape). You can inspect the index on the fly, but you will have already received almost all of the data at that point.

The ZIP format (and derivatives) makes sense for the purpose it was likely designed for, which is not useful for use cases where a streaming or incremental approach would be a better solution.


What we can do today:

Use different, better structured formats. (Except just about everything can handle a 'well formed' zip file)

Take care to emit well-formed zip files.

Never trust un-sanitized input. This includes ensuring that any malformed input does not escalate to a security issue such as a buffer overflow, or a path escape when not allowed.

Be tolerant of poorly formed zip files.

Have repair tools which operate in useful ways, such as 'streaming' (front to back), with optional rename / replace / etc on name collision / delete, or utilize the Nth located (backwards of forwards?) directory record set.


> Except just about everything can handle a 'well formed' zip file

This depends on the extensions that are used. For example, ZIP file can use a LZMA compression, and not all decompressors can handle it.


If LZMA is your intended target, I more strongly suggest selecting a more modern container such as the 7z archive format.

Someone selects a zip file not for it's compression, but for how widely accessible it is. Most likely this involves basic compression on text files, error detection checksums, and storage (no compression) of already well compressed images and possibly videos.


Anyone here old enough to remember from the BBS days when Phil Katz was embroiled in a legal dispute over his program PKARC and the ARC compression file format? At the time it was cast as a David and Goliath story (with Phil being David) when it was really just two small home based software developers fighting it out.

Long story short, Phil lost the arc dispute which is why I assume he moved onto the ZIP format. In the end, Phil Katz was taken from us too soon because his personal demons got the better of him.


This dispute was the subject of the last chapter of Jason Scott’s BBS documentary. The whole thing is worth watching, of course.

https://m.youtube.com/watch?v=uNXCd2EATSo&list=PL7nj3G6Jpv2G...


Sure, I had a talk recently where I mentioned P.K. and opening the ZIP which made it the default compression format. And zip became a proper noun (aside zoning improvement plant, version, that's it)

>In the end, Phil Katz was taken...

Indeed, 37 -- it's almost a cautionary tale, nowadays.


Regarding backreading (and fowrard too, honestly)... you can store files in .zip (ie not compressed at all). Some of those files could be a pretty valid .zip files too. *grin*


How about a .zip that stores a copy of itself?

https://alf.nu/ZipQuine


How about zip-bombs?


An other fun trick is that zip being back-read means it can be smushed together with a file that’s forwards-read, and you get a dual-format file. That was a common way to smuggle archives on image boards.

pico-8 does that by design, the “cartridges” are image files, except it uses steganography to encode the program in the image instead of just smuggling.


this is also how self extracting executables work. It's just a program that extracts whatever zip file exists after the program itself. Rename it from exe to zip and you have a valid zip file that just happens to have windows executable code before the start of the zip contents


The article is missing a nice reference to a potential exploit in Firefox addons caused by ambiguous parsing of zip files. https://bugzilla.mozilla.org/show_bug.cgi?id=1534483


(2021)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: