I have questions about the code. Why do you need to say int('0x1', 16) and int('0x2', 16)? Why not just write 0x1 and 0x2? Or just plain 1 and 2?
I'm also perplexed by the goal as this seems to just call zipfile.write under the hood, which already streams to a zip file without accumulating a memory buffer?
I think the appeal is that it's a generator, so that if you need to encapsulate/cram bytes of the zip over some other transport that you can just naturally ask for a few more each time without having to accumulate it in memory.
Of course, by crafting a special file-like object you could avoid this too, but perhaps a bit less elegantly.
I'm a little perplexed by the "marketing" around this --- all the archivers I know of don't require more memory than the compression state (which AFAIK for ZIP/deflate is not much more than a 64k window), since it is natural that files can be larger than available RAM.
I think it's meant for a pretty narrow use-case: serving compressed files through frameworks(as mentioned, for example Django or Flask) that expect to serve file objects, but without writing to disk.
The "usual"/naive solution (if you stay within the python ecosystem) is to compress the files and write to a BytesIO or other in-memory file like object, and then have your framework serve it. The naive solution leads to writing the whole file to memory before serving (thus memory inflation).
This library just looks like a pretty straightforward way to implement the same idea, but with chunking to bound memory usage. At the bottom, it's doing the same thing, but using generators to yield chunks at a time.
It's a useful utility for that context. Nothing groundbreaking, it's something that most intermediate and higher developers could stitch together in probably a few days (especially if they had to brush on on DEFLATE and generator protocol), but it's nice to not have to.
This is a valid concern and a good enough reason to have such a library. I've written a similar thing for uploading large files to S3 (via Django) by streaming them without the file ever touching the file system (S3ChunkUploader). The reason was the large files were deemed security-sensitive and the containers were limited to 2GB in disk space. Just uploading 4 500MB files at the same time would be an attack vector.
The functions you link look like they're for simple deflate streams (i.e. a single file), while the OP appears to be about streaming zip archives which can contain multiple files with metadata.
I believe the comparison is just to the bundled zipfile module and BytesIO, which would be the quick and dirty way to make a zipfile without creating actual files, but would be memory intensive.
Looks like it just splits by 16MB chunks, so just standard deflate. Actual compression is handled by the python zipfile module, which is probably C code underneath.
zipfile uses zlib which is C. But it's even better than that: it releases the GIL, so it gives linear speedup with multiple threads. If you need to (de)compress a bunch of files, you can do them all at once quite easily using e.g. concurrent.futures.
If you want that speedup on the command line without Python, check out pigz. It's gzip with parallelism. Easy 10-20x speedup for some jobs.
Appreciate the quotes from the zip_tricks README as well as the resemblances between Buzon and WeTransfer. Glad some of the work we did proved inspirational ;-)
I need to open a very large CSV file in Python, which is around 25GB in .zip format. Any idea how to do this in a streaming way, i.e. stopping after reading the first few thousand rows?
> I need to open a very large CSV file in Python, which is around 25GB in .zip format. Any idea how to do this in a streaming way, i.e. stopping after reading the first few thousand rows?
Replace the `file_paths` list in my proof of concept with your large file(s), delete the rest (lines 61-68, 77-79) and it should just work.
Works fine with Python's standard library. Files in a ZipFile can be read in a streaming manner. There is no need to store all the data in memory.
import io, csv, zipfile
max_lines = 10
with zipfile.ZipFile("data.zip") as z:
for info in z.infolist():
with z.open(info.filename) as f:
reader = csv.reader(io.TextIOWrapper(f))
for i_line, line in enumerate(reader):
if i_line >= max_lines: break
print(line)
I'm also perplexed by the goal as this seems to just call zipfile.write under the hood, which already streams to a zip file without accumulating a memory buffer?
[0] https://github.com/BuzonIO/zipfly/blob/master/zipfly/zipfly....