Actually, it does not compress less efficiently, but you do not learn this fact ...

spullara · on Aug 22, 2016

I'll check it out. However, if you have to wait for the previous block to compress the following block, I don't see how you can parallelize it completely. My assumption is that you would have to shard the file at a higher level and still compress those shards independently. That should get close to the same results as using a sequential compressor but for small file sizes both this effect and Amdahl's law would start to show its head. I suppose you could get around that by not parallelizing compression at some minimum size automatically.

pwg · on Aug 23, 2016

> However, if you have to wait for the previous block to compress the following block, I don't see how you can parallelize it completely.

That above is not what the man-page says. The block size for any given session is fixed, so you know the boundaries of each block prior to compression.

Each block has a copy of the last 32kbytes of data at the end of the prior block.

The algorithm used by gzip compresses by finding repetitive strings in the last seen 32kbyte window of uncompressed data, so there is no compression dependency between blocks, even with a copy of the last 32k of uncompressed data from a prior block being present for the current block.

spullara · on Aug 24, 2016

So the lossage there is that you don't get to compress and generate that dictionary at the same time. Interesting.

pwg · on Aug 28, 2016

There's no "generating" of a dictionary. The gzip algorithm is based upon the "dictionary" being the last seen 32k bytes of uncompressed data based upon where in the file the compressor is presently working (technically in compression circles it is a 'windowed' algorithm, not a 'dictionary' algorithm). It compresses by finding a repetitive string in that 32k window that matches a string at the current location, and outputting an instruction that says (in effect): "seek backwards in the uncompressed data 200 bytes, then copy 100 bytes from there to here".

So as long as each parallel block has pre-pended the final 32k of the prior block, the output from the compression algorithm will be identical between a straight sequential compression and a pigz parallel compression. Because at byte 0 of the current block, the 32k window of the uncompressed prior block is available to perform matching against, just as if it were running sequentially.

The only growth from pigz comes from needing to round each parallel compressed block up to a multiple of 8 bits (the huffman codes that are output are bitstrings that don't match with 8-bit byte boundaries). But worst case that is 7 bits per block for each parallel block. Given the performance gains on multiple CPU systems, a few hundred bytes net increase is not likely to matter. If those few hundred bytes did matter, then one should use bzip2 or lzip or xz and get much higher compression ratios (at the expense of much longer time).