I learned about pigz in the High Performance MySQL O'Reilly book appendix. I used it, and other techniques there to improve our MySQL backup/restore time by 7x. This in turn won first place at a company hackathon.
I love it! I just wish somebody can make a GPU based compression library (doesn't have to be gzip or bzip), every mobile device today is shipping with a GPU, there are some techniques out there like this (http://on-demand.gputechconf.com/gtc/2014/presentations/S445...), but I am still waiting for a solid implementation.
Used extensively while a system engineer at a hosting company; 10/10, would use again. Excellent utility if you have the cpu cycles to spare and need to cut time; gzip is almost always your bottleneck. Didn't seem to quite scale linearly, but what does.
pigz is extremely fast and very capable, plus it's packed up and provided for almost every mainstream linux distribution, and can act as a drop in replacement for gzip as it supports the same flag syntax.
pbzip2 only does parallel compression. lbzip2 can parallelize compression and decompression.
With both pbzip2 and lbzip2 I got errors every now and then while everything worked with bzip2. YMMV.
Also note that bzip2 is block based (due to bwt) and thus does not compromise compression ratios like parallelized gzip implantations.
At 32 cores you can saturate Gbit links (even with good compression ratios!). I hope we will see some more bzip2 love in the future due to it's perfect fit for parallelism.
pbzip2 produces a separate bzip2 stream for every compressed block, which can't be read properly by many popular decoders, usually ones that are based around the bzip2 high-level interface [0]. There's a warning about this in Python 2's documentation, which is a better offering than most [1].
Many decoders simply stop reading after decompressing the first 900k without so much as a peep that something that may astonish you has happened.
There is also pixz for Xzip format. Which can be a godsend since xzip is such a slow compression. Just make sure you have the CPU to spare since it's a hog.
This feature is present in Solaris 11.3 and later. Actually, it is in some later patches to 11.2 as well. I added it to speed up suspend and resume of kernel zones.
My guess is that this compresses less efficiently as you would have to shard the dictionaries. Might be close though for large files. I was surprised that there were no speed or efficiency comparisons in the README.
The max window size for zlib is 32 KB, so I don't think the default sharding at 128 KB would change much. You can pass the -b parameter if you find out a bigger shard works better on your data.
If you are looking for details of the design of pigz, there is a very well-documented overview in the source of pigz.c:
I tested it on a 680 MB of text. gzip compresses to 246.0 MB. pigz compresses to 245.5 MB. I see similar percent change on a 3.8 MB text file. So they are approximately equivalent.
First, thanks for the numbers, it's useful to see real world examples.
Second, and this isn't meant to be a critique (I'm just trying to understand phenomena I see), is there a reason you prefer presenting it as a percentage decrease? Every time I read "X% decrease" I feel obliged to read the source numbers because I'm never sure if the person is using the terminology correctly or not (you are), since so often people mess that up. For myself, I generally use "X ran in Y% of the time Z took." specifically because I don't want people to misinterpret. Is the "X% decrease" presentation preferred/taught, or considered standard? Am I alone in feeling it's more likely to be misinterpreted?
(Sorry your comment is the one I brought this up on, I've just been wondering this for a while.)
In my experience, it is more typical to use percent change or relative change in the physical sciences and this is how I was taught. Just to be clear, if you have values t1 and t2, the relative change is (t1-t2)/t1. There is a 1 to 1 correspondence with what you described which is t2/t1.
I think teej explained it well. If I say "the new value is +20% or -20%", it is immediately obvious those have the same magnitude and opposite direction. But, for some people, when I say "the new value is 120% or 80% of the old value", it is not immediately obvious that they have the same magnitude. It requires a small extra step for the reader to realize that this means the same amount of relative change.
I always state as relative change. I find that people can get really confused if you were to say "X ran in 110% in the time of Y" even though it is stated in a clear way.
My preferred way of communicating this concept is "we observed a +10% change in X compared to Y." I always use a +/- sign and this helps signal that I'm talking about a relative change.
If I am comparing percents, I'll always specify "relative" or "absolute" change though I prefer to use relative change. Occasionally if the change is small, I will use basis points instead of percents to communicate absolute change.
The input blocks, while compressed independently, have the last 32K of the previous block loaded as a preset dictionary to preserve the compression effectiveness of deflating in a single thread.
I'll check it out. However, if you have to wait for the previous block to compress the following block, I don't see how you can parallelize it completely. My assumption is that you would have to shard the file at a higher level and still compress those shards independently. That should get close to the same results as using a sequential compressor but for small file sizes both this effect and Amdahl's law would start to show its head. I suppose you could get around that by not parallelizing compression at some minimum size automatically.
> However, if you have to wait for the previous block to compress the following block, I don't see how you can parallelize it completely.
That above is not what the man-page says. The block size for any given session is fixed, so you know the boundaries of each block prior to compression.
Each block has a copy of the last 32kbytes of data at the end of the prior block.
The algorithm used by gzip compresses by finding repetitive strings in the last seen 32kbyte window of uncompressed data, so there is no compression dependency between blocks, even with a copy of the last 32k of uncompressed data from a prior block being present for the current block.
There's no "generating" of a dictionary. The gzip algorithm is based upon the "dictionary" being the last seen 32k bytes of uncompressed data based upon where in the file the compressor is presently working (technically in compression circles it is a 'windowed' algorithm, not a 'dictionary' algorithm). It compresses by finding a repetitive string in that 32k window that matches a string at the current location, and outputting an instruction that says (in effect): "seek backwards in the uncompressed data 200 bytes, then copy 100 bytes from there to here".
So as long as each parallel block has pre-pended the final 32k of the prior block, the output from the compression algorithm will be identical between a straight sequential compression and a pigz parallel compression. Because at byte 0 of the current block, the 32k window of the uncompressed prior block is available to perform matching against, just as if it were running sequentially.
The only growth from pigz comes from needing to round each parallel compressed block up to a multiple of 8 bits (the huffman codes that are output are bitstrings that don't match with 8-bit byte boundaries). But worst case that is 7 bits per block for each parallel block. Given the performance gains on multiple CPU systems, a few hundred bytes net increase is not likely to matter. If those few hundred bytes did matter, then one should use bzip2 or lzip or xz and get much higher compression ratios (at the expense of much longer time).
The first stable version of xz with threading was 5.2.0, released in December 2014. A lot of people are running older packages, or were until recently, and habitually use pixz instead.
Current versions of Debian and Ubuntu ship 5.1.1 fwiw, so it's still quite a large number of people. In fact 5.2.0 doesn't even seem to be in Debian unstable yet. Not 100% sure why, but browsing through the wishlist bug open for years [1], it seems to be due to an unfortunately common reason: even widely used open-source software sometimes has surprisingly few maintainers, in this case seemingly one person, who was maintaining the xz package for years but got busy with other things, and nobody else has picked it up. The good-ish news is that the 5.1.x version now seems to be growing old enough that it's starting to block other packages' upgrades (because their upstream now assumes a newer version), which will probably cause enough other Debian maintainers to notice for the situation to be sorted out. Though I did of course use passive voice in that last sentence, as I wait for "the situation to be sorted out".
Upstream xz supports it too, but not until a relatively recent release. If you're interested, I have a PPA for Ubuntu Trusty which includes up-to-date xz-utils + dpkg-deb set up to use it. This is a lifesaver if you deploy large bundles of software in the deb format:
The main purpose of this "pixz" appears to be its chunking of the compressed data so that is it partially decompressible (i.e. random access). "xz" has -T/--threads= already for multithreaded processing (although it does seem like pixz has a different default value "all cores" instead of "1 thread") .
The --index option that I added (see another comment) to pigz allows random access so that uncompression could be multi-threaded.
The workload I was interested in (compressing virtual machine memory during suspend) tends to be I/O bound. pigz struck a good balance to make it so that cpus and disks both stayed busy in an important subset of the machines of interest. This balanced saturation seemed to be optimal to minimize the suspend time. If the suspend image was on fast storage (that is, could saturate a 10 gig link) multi-threading uncompression for resume made a big difference.
I tried pixz and I think pbzip2 and found that while the compress ratio was better than pigz offers, the delta in CPU-bound run time during compression was unacceptable.
The scales may tip in favor of pixz when compression speed is less important than (e.g.) minimizing the amount of bandwidth required for 1000's of downloads of the compressed content.
xz in multithreaded mode supports random access too, at least theoretically. But there's no reasonable way with xz to actually find the file in a tarball you want to access, it's that bit that pixz provides.
Another nice thing about pixz is it does parallel decompression, as well as compression.
I was thinking about that "no reasonable way" comment. When you uncompress the first block, you will find the first tar header. From that you can know the uncompressed offset of the next tar header. If the compressed stream does support random access, you should be able to uncompress a block (assuming uncompressed block size was a multitple of 512 bytes) to get to the next tar header. You can repeat this until you get to the file you are looking for.
With large files, this approach would be of huge value. If the files tend to be no larger than block_size - 512, there will be no speedup.
Of course, this would need to be implemented directly in tar, not by piping the output of a decompression command through tar.
Why doesn't parallelized gzip get more attention? For larger files, it can be quite a pain to see a multicore machine sit mostly idle. Something like this works pretty well, but I've never seen it done outside of my own stuffs:
I was doing that too, until I tried lbzip2 ( http://lbzip2.org/). Does basically the same, also a drop-in replacement, but is even faster by a noticeable amount!
If you know that what you compress is alphabetised text, and space is more important to you than time, then please use lzips over gzips. For a parallel version:
Is this gunzip-compatible? If not, then as long as gzip remains the only viable/compatible compression mode for HTTP I think advances in gzip compression is still valuable.
I don't think the comment you replied to suggested that gzip is not valuable, only that there's specialized compression available for a special case (alphabetized data).
It's amazing to see what somebody like Mark Adler can achieve when they focus on a specific niche. You often hear about the importance of specialization when positioning a business, but it isn't mentioned as much for individuals. His work serves as an ideal case of specialization—if I wanted to ask an expert for advice on compression, or was looking for an outside expert on compression that is immediately who I'd think of.
I use this extensively in schlepping data around for analytics work. Any data that moves over the network going into or out of a data store gets compressed with pigz.
http://lbzip2.org/ also exists for bzip2. It works really well, especially since bz2 files are split into discreet blocks which can be un/compressed independently.
We use this in production at Graphistry, super useful for latency-sensitive dynamic media applications! (In this case, we needed to add an auto-tuner + node bindings.)
The big challenge seems to be parallel gunzip as that requires a special stream with an index to work, with no general purpose solution available so far.
Looks like that has 8 cores? I would imagine it would make a huge difference depending on other loads on the system. I gave it a try on my file server at home that has an AMD E-350 processor with 2 cores and it shaved off a good 42% of the total time:
> time gzip -v xubuntu-16.04-desktop-amd64.iso
xubuntu-16.04-desktop-amd64.iso: 1.5% -- replaced with xubuntu-16.04-desktop-amd64.iso.gz
gzip -v xubuntu-16.04-desktop-amd64.iso 119.07s user 4.66s system 98% cpu 2:05.22 total
> time pigz -v xubuntu-16.04-desktop-amd64.iso
xubuntu-16.04-desktop-amd64.iso to xubuntu-16.04-desktop-amd64.iso.gz
pigz -v xubuntu-16.04-desktop-amd64.iso 128.19s user 6.64s system 184% cpu 1:12.97 total