Python – Create large ZIP archives without memory inflation

rahimiali · on Jan 3, 2021

I have questions about the code. Why do you need to say int('0x1', 16) and int('0x2', 16)? Why not just write 0x1 and 0x2? Or just plain 1 and 2?

I'm also perplexed by the goal as this seems to just call zipfile.write under the hood, which already streams to a zip file without accumulating a memory buffer?

[0] https://github.com/BuzonIO/zipfly/blob/master/zipfly/zipfly....

mlyle · on Jan 3, 2021

I think the appeal is that it's a generator, so that if you need to encapsulate/cram bytes of the zip over some other transport that you can just naturally ask for a few more each time without having to accumulate it in memory.

Of course, by crafting a special file-like object you could avoid this too, but perhaps a bit less elegantly.

userbinator · on Jan 2, 2021

I'm a little perplexed by the "marketing" around this --- all the archivers I know of don't require more memory than the compression state (which AFAIK for ZIP/deflate is not much more than a 64k window), since it is natural that files can be larger than available RAM.

icegreentea2 · on Jan 2, 2021

I think it's meant for a pretty narrow use-case: serving compressed files through frameworks(as mentioned, for example Django or Flask) that expect to serve file objects, but without writing to disk.

The "usual"/naive solution (if you stay within the python ecosystem) is to compress the files and write to a BytesIO or other in-memory file like object, and then have your framework serve it. The naive solution leads to writing the whole file to memory before serving (thus memory inflation).

This library just looks like a pretty straightforward way to implement the same idea, but with chunking to bound memory usage. At the bottom, it's doing the same thing, but using generators to yield chunks at a time.

It's a useful utility for that context. Nothing groundbreaking, it's something that most intermediate and higher developers could stitch together in probably a few days (especially if they had to brush on on DEFLATE and generator protocol), but it's nice to not have to.

harel · on Jan 3, 2021

"without writing to disk"

This is a valid concern and a good enough reason to have such a library. I've written a similar thing for uploading large files to S3 (via Django) by streaming them without the file ever touching the file system (S3ChunkUploader). The reason was the large files were deemed security-sensitive and the containers were limited to 2GB in disk space. Just uploading 4 500MB files at the same time would be an attack vector.

thewakalix · on Jan 3, 2021

Ah, so it's like Haskell's streaming (de)compression functions?

Examples:

* https://hackage.haskell.org/package/conduit-extra-1.1.7.3/do...

* http://hackage.haskell.org/package/streaming-utils-0.2.0.0/d...

CodesInChaos · on Jan 3, 2021

The functions you link look like they're for simple deflate streams (i.e. a single file), while the OP appears to be about streaming zip archives which can contain multiple files with metadata.

tyingq · on Jan 3, 2021

I believe the comparison is just to the bundled zipfile module and BytesIO, which would be the quick and dirty way to make a zipfile without creating actual files, but would be memory intensive.

cozzyd · on Jan 3, 2021

not a safe assumption with Python packages!

MaxBarraclough · on Jan 2, 2021

Can someone explain what it's doing? Is it using an algorithm with far superior space complexity than the usual algorithm?

Python seems a curious choice. Compression is computationally intensive.

da_big_ghey · on Jan 2, 2021

Looks like it just splits by 16MB chunks, so just standard deflate. Actual compression is handled by the python zipfile module, which is probably C code underneath.

jzwinck · on Jan 3, 2021

zipfile uses zlib which is C. But it's even better than that: it releases the GIL, so it gives linear speedup with multiple threads. If you need to (de)compress a bunch of files, you can do them all at once quite easily using e.g. concurrent.futures.

If you want that speedup on the command line without Python, check out pigz. It's gzip with parallelism. Easy 10-20x speedup for some jobs.

AzN1337c0d3r · on Jan 4, 2021

Note that pigz only parallelizes compression, not decompression.

julik · on Jan 3, 2021

Appreciate the quotes from the zip_tricks README as well as the resemblances between Buzon and WeTransfer. Glad some of the work we did proved inspirational ;-)

2bluesc · on Jan 3, 2021

I built a streaming zip app using nothing more then the Python stdlib zip implementation and some os primitives.

It runs on a small embedded device that can stream zip archives many times larger then the disk or system ram without any issue.

Example Python Falcon Proof of Concept:

https://gist.github.com/kylemanna/1e22bbf31b7e5ae84bbdfa32c6...

Other then what Python's zipfile buffers in memory, my implementation shouldn't use much more then a os.pipe()'s buffer (typically 64kB?).

ejwhite · on Jan 3, 2021

Interesting.

I need to open a very large CSV file in Python, which is around 25GB in .zip format. Any idea how to do this in a streaming way, i.e. stopping after reading the first few thousand rows?

2bluesc · on Jan 3, 2021

> I need to open a very large CSV file in Python, which is around 25GB in .zip format. Any idea how to do this in a streaming way, i.e. stopping after reading the first few thousand rows?

Replace the `file_paths` list in my proof of concept with your large file(s), delete the rest (lines 61-68, 77-79) and it should just work.

johndough · on Jan 3, 2021

Works fine with Python's standard library. Files in a ZipFile can be read in a streaming manner. There is no need to store all the data in memory.

    import io, csv, zipfile

    max_lines = 10
    with zipfile.ZipFile("data.zip") as z:
        for info in z.infolist():
            with z.open(info.filename) as f:
                reader = csv.reader(io.TextIOWrapper(f))
                for i_line, line in enumerate(reader):
                    if i_line >= max_lines: break
                    print(line)

2bluesc · on Jan 4, 2021

This is true when writing to a file. The goal of my PoC was to not write a file and instead to stream to the web browser.

leiserfg · on Jan 2, 2021

How does it compares to https://github.com/kbbdy/zipstream?

lern_too_spel · on Jan 3, 2021

Poorly. It needlessly writes the compressed data to a file instead of simply using `zlib.compressobj`.

spockz · on Jan 2, 2021

Is there something like this for the JVM? I’m not sure whether with https://github.com/srikanth-lingala/zip4j#adding-entries-wit... will keep everything it is possible to keep it in constrained memory.

taeric · on Jan 2, 2021

The standard zip tools in Java should be fine. I regularly compress gigs of data in an aws lambda environment. Streaming from and to s3.

the8472 · on Jan 3, 2021

Even the JRE-builtin ZipOutputStream would do the job, it's a proper streaming implementation that doesn't keep more state than necessary.

amelius · on Jan 2, 2021

Does anyone know of a tar equivalent which performs deduplication?

TheChaplain · on Jan 2, 2021

Should not be too complicated in Python. Just calculate the sha1/sha256 on the file before adding it to the tar-archive, skip any duplicates.

amelius · on Jan 3, 2021

Yes, but I don't want to let it evolve into yet another side-project.