Hacker News new | past | comments | ask | show | jobs | submit login
A parallel implementation of gzip for modern multi-processor multi-core machines (github.com/madler)
280 points by odeke-em on Aug 20, 2016 | hide | past | favorite | 67 comments



I learned about pigz in the High Performance MySQL O'Reilly book appendix. I used it, and other techniques there to improve our MySQL backup/restore time by 7x. This in turn won first place at a company hackathon.


Have you compared with mydumper / myloader?


I love it! I just wish somebody can make a GPU based compression library (doesn't have to be gzip or bzip), every mobile device today is shipping with a GPU, there are some techniques out there like this (http://on-demand.gputechconf.com/gtc/2014/presentations/S445...), but I am still waiting for a solid implementation.


You may find work from my colleagues interesting: Parallel Lossless Data Compression on the GPU: http://www.idav.ucdavis.edu/publications/print_pub?pub_id=10... Fast Parallel Suffix Array on the GPU: http://escholarship.org/uc/item/83r7w305 I think built on top of the second paper (suffix array paper), there should be fast compression library. We have one implementation in our library CUDPP 2.0, you can try it out if you like: https://github.com/cudpp/cudpp/blob/master/src/cudpp/app/sa_...


Doesn't transferring all the data to/from the GPU kill any potential speedup the GPU may offer?


Pied Piper already made it.


Used extensively while a system engineer at a hosting company; 10/10, would use again. Excellent utility if you have the cpu cycles to spare and need to cut time; gzip is almost always your bottleneck. Didn't seem to quite scale linearly, but what does.

--rsyncable support too.


pigz is extremely fast and very capable, plus it's packed up and provided for almost every mainstream linux distribution, and can act as a drop in replacement for gzip as it supports the same flag syntax.

On the bzip2 side there is pbzip2, which is also a drop in replacement, http://compression.ca/pbzip2/


pbzip2 only does parallel compression. lbzip2 can parallelize compression and decompression.

With both pbzip2 and lbzip2 I got errors every now and then while everything worked with bzip2. YMMV.

Also note that bzip2 is block based (due to bwt) and thus does not compromise compression ratios like parallelized gzip implantations.

At 32 cores you can saturate Gbit links (even with good compression ratios!). I hope we will see some more bzip2 love in the future due to it's perfect fit for parallelism.


Errors sound deeply concerning. What kind of errors? Is it too risky to use [pl]bzip2 for archival reasons?


Well, errors at a rate of one every few terabyte, and most of the time it was mainly decompression, and always caused program abort iirc.

But those have been a few years ago, so my best advice would be: retest.

You are testing your archives anyway, right? ;-)


>You are testing your archives anyway, right? ;-)

Umm no, I handle errors which I expect to be reported to me.


pbzip2 produces a separate bzip2 stream for every compressed block, which can't be read properly by many popular decoders, usually ones that are based around the bzip2 high-level interface [0]. There's a warning about this in Python 2's documentation, which is a better offering than most [1].

Many decoders simply stop reading after decompressing the first 900k without so much as a peep that something that may astonish you has happened.

  [0] http://www.bzip.org/1.0.5/bzip2-manual-1.0.5.html#hl-interface  
  [1] https://docs.python.org/2/library/bz2.html#bz2.BZ2File


There is also pixz for Xzip format. Which can be a godsend since xzip is such a slow compression. Just make sure you have the CPU to spare since it's a hog.


Xz does this natively now, with the -T parameter.


There is also pbgzip if you are into indexed access (genome data).


pigz parallelizes compression. I've made changes that make it so that it can parallelize uncompression as well.

https://github.com/mgerdts/pigz

This feature is present in Solaris 11.3 and later. Actually, it is in some later patches to 11.2 as well. I added it to speed up suspend and resume of kernel zones.


My guess is that this compresses less efficiently as you would have to shard the dictionaries. Might be close though for large files. I was surprised that there were no speed or efficiency comparisons in the README.


The max window size for zlib is 32 KB, so I don't think the default sharding at 128 KB would change much. You can pass the -b parameter if you find out a bigger shard works better on your data.

If you are looking for details of the design of pigz, there is a very well-documented overview in the source of pigz.c:

https://github.com/madler/pigz/blob/master/pigz.c#L187


had a quick scroll thru that source code - i didnt know you could implement a try/catch in C via macros...mind blown.


This book (C Interfaces and Implementations: Techniques for Creating Reusable Software) has a whole section on it:

https://www.amazon.com/Interfaces-Implementations-Techniques...


I tested it on a 680 MB of text. gzip compresses to 246.0 MB. pigz compresses to 245.5 MB. I see similar percent change on a 3.8 MB text file. So they are approximately equivalent.


What was the difference in speed between runs?


With 2 cores (4 logical) 58.9% decrease in time for the 680 MB file a.txt:

# time pigz -c /tmp/a.txt > /dev/null

real 0m19.352s user 1m16.148s sys 0m0.344s

# time gzip -c /tmp/a.txt > /dev/null

real 0m47.093s user 0m46.940s sys 0m0.104s


First, thanks for the numbers, it's useful to see real world examples.

Second, and this isn't meant to be a critique (I'm just trying to understand phenomena I see), is there a reason you prefer presenting it as a percentage decrease? Every time I read "X% decrease" I feel obliged to read the source numbers because I'm never sure if the person is using the terminology correctly or not (you are), since so often people mess that up. For myself, I generally use "X ran in Y% of the time Z took." specifically because I don't want people to misinterpret. Is the "X% decrease" presentation preferred/taught, or considered standard? Am I alone in feeling it's more likely to be misinterpreted?

(Sorry your comment is the one I brought this up on, I've just been wondering this for a while.)


In my experience, it is more typical to use percent change or relative change in the physical sciences and this is how I was taught. Just to be clear, if you have values t1 and t2, the relative change is (t1-t2)/t1. There is a 1 to 1 correspondence with what you described which is t2/t1.

I think teej explained it well. If I say "the new value is +20% or -20%", it is immediately obvious those have the same magnitude and opposite direction. But, for some people, when I say "the new value is 120% or 80% of the old value", it is not immediately obvious that they have the same magnitude. It requires a small extra step for the reader to realize that this means the same amount of relative change.


I always state as relative change. I find that people can get really confused if you were to say "X ran in 110% in the time of Y" even though it is stated in a clear way.

My preferred way of communicating this concept is "we observed a +10% change in X compared to Y." I always use a +/- sign and this helps signal that I'm talking about a relative change.

If I am comparing percents, I'll always specify "relative" or "absolute" change though I prefer to use relative change. Occasionally if the change is small, I will use basis points instead of percents to communicate absolute change.


You may be interested in these benchmarks: http://vbtechsupport.com/1614/


Actually, it does not compress less efficiently, but you do not learn this fact from the README. Instead you have to look inside the man page (https://raw.githubusercontent.com/madler/pigz/master/pigz.1):

The input blocks, while compressed independently, have the last 32K of the previous block loaded as a preset dictionary to preserve the compression effectiveness of deflating in a single thread.


I'll check it out. However, if you have to wait for the previous block to compress the following block, I don't see how you can parallelize it completely. My assumption is that you would have to shard the file at a higher level and still compress those shards independently. That should get close to the same results as using a sequential compressor but for small file sizes both this effect and Amdahl's law would start to show its head. I suppose you could get around that by not parallelizing compression at some minimum size automatically.


> However, if you have to wait for the previous block to compress the following block, I don't see how you can parallelize it completely.

That above is not what the man-page says. The block size for any given session is fixed, so you know the boundaries of each block prior to compression.

Each block has a copy of the last 32kbytes of data at the end of the prior block.

The algorithm used by gzip compresses by finding repetitive strings in the last seen 32kbyte window of uncompressed data, so there is no compression dependency between blocks, even with a copy of the last 32k of uncompressed data from a prior block being present for the current block.


So the lossage there is that you don't get to compress and generate that dictionary at the same time. Interesting.


There's no "generating" of a dictionary. The gzip algorithm is based upon the "dictionary" being the last seen 32k bytes of uncompressed data based upon where in the file the compressor is presently working (technically in compression circles it is a 'windowed' algorithm, not a 'dictionary' algorithm). It compresses by finding a repetitive string in that 32k window that matches a string at the current location, and outputting an instruction that says (in effect): "seek backwards in the uncompressed data 200 bytes, then copy 100 bytes from there to here".

So as long as each parallel block has pre-pended the final 32k of the prior block, the output from the compression algorithm will be identical between a straight sequential compression and a pigz parallel compression. Because at byte 0 of the current block, the 32k window of the uncompressed prior block is available to perform matching against, just as if it were running sequentially.

The only growth from pigz comes from needing to round each parallel compressed block up to a multiple of 8 bits (the huffman codes that are output are bitstrings that don't match with 8-bit byte boundaries). But worst case that is 7 bits per block for each parallel block. Given the performance gains on multiple CPU systems, a few hundred bytes net increase is not likely to matter. If those few hundred bytes did matter, then one should use bzip2 or lzip or xz and get much higher compression ratios (at the expense of much longer time).


And the same for LZMA: https://github.com/vasi/pixz

(it's relatively easy to remember those commands)


Doesn't xz already support threads out of the box, by the --threads flag?


The first stable version of xz with threading was 5.2.0, released in December 2014. A lot of people are running older packages, or were until recently, and habitually use pixz instead.


Current versions of Debian and Ubuntu ship 5.1.1 fwiw, so it's still quite a large number of people. In fact 5.2.0 doesn't even seem to be in Debian unstable yet. Not 100% sure why, but browsing through the wishlist bug open for years [1], it seems to be due to an unfortunately common reason: even widely used open-source software sometimes has surprisingly few maintainers, in this case seemingly one person, who was maintaining the xz package for years but got busy with other things, and nobody else has picked it up. The good-ish news is that the 5.1.x version now seems to be growing old enough that it's starting to block other packages' upgrades (because their upstream now assumes a newer version), which will probably cause enough other Debian maintainers to notice for the situation to be sorted out. Though I did of course use passive voice in that last sentence, as I wait for "the situation to be sorted out".

[1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=731634


Upstream xz supports it too, but not until a relatively recent release. If you're interested, I have a PPA for Ubuntu Trusty which includes up-to-date xz-utils + dpkg-deb set up to use it. This is a lifesaver if you deploy large bundles of software in the deb format:

https://launchpad.net/~mikepurvis/+archive/ubuntu/dpkg


The main purpose of this "pixz" appears to be its chunking of the compressed data so that is it partially decompressible (i.e. random access). "xz" has -T/--threads= already for multithreaded processing (although it does seem like pixz has a different default value "all cores" instead of "1 thread") .


The --index option that I added (see another comment) to pigz allows random access so that uncompression could be multi-threaded.

The workload I was interested in (compressing virtual machine memory during suspend) tends to be I/O bound. pigz struck a good balance to make it so that cpus and disks both stayed busy in an important subset of the machines of interest. This balanced saturation seemed to be optimal to minimize the suspend time. If the suspend image was on fast storage (that is, could saturate a 10 gig link) multi-threading uncompression for resume made a big difference.

I tried pixz and I think pbzip2 and found that while the compress ratio was better than pigz offers, the delta in CPU-bound run time during compression was unacceptable.

The scales may tip in favor of pixz when compression speed is less important than (e.g.) minimizing the amount of bandwidth required for 1000's of downloads of the compressed content.


xz in multithreaded mode supports random access too, at least theoretically. But there's no reasonable way with xz to actually find the file in a tarball you want to access, it's that bit that pixz provides.

Another nice thing about pixz is it does parallel decompression, as well as compression.

(Disclaimer: I'm the original author of pixz.)


I was thinking about that "no reasonable way" comment. When you uncompress the first block, you will find the first tar header. From that you can know the uncompressed offset of the next tar header. If the compressed stream does support random access, you should be able to uncompress a block (assuming uncompressed block size was a multitple of 512 bytes) to get to the next tar header. You can repeat this until you get to the file you are looking for.

With large files, this approach would be of huge value. If the files tend to be no larger than block_size - 512, there will be no speedup.

Of course, this would need to be implemented directly in tar, not by piping the output of a decompression command through tar.


Why doesn't parallelized gzip get more attention? For larger files, it can be quite a pain to see a multicore machine sit mostly idle. Something like this works pretty well, but I've never seen it done outside of my own stuffs:

split -l1000000 --filter='gzip > $FILE.gz' <(zcat large_fille.txt.gz)


I'm using pbzip2 to solve this problem. http://compression.ca/pbzip2/


I was doing that too, until I tried lbzip2 ( http://lbzip2.org/). Does basically the same, also a drop-in replacement, but is even faster by a noticeable amount!


Previous discussion 6 years ago: https://news.ycombinator.com/item?id=1233317


If you know that what you compress is alphabetised text, and space is more important to you than time, then please use lzips over gzips. For a parallel version:

http://www.nongnu.org/lzip/plzip.html


Is this gunzip-compatible? If not, then as long as gzip remains the only viable/compatible compression mode for HTTP I think advances in gzip compression is still valuable.


No it's not. Lzip is more useful for storing text long term, or send very large text file as email attachments.

Text log file and database dump in text usually generate very large text file, lzip can reduce their size by a factor of 5 to 10, compared to gzips.


I don't think the comment you replied to suggested that gzip is not valuable, only that there's specialized compression available for a special case (alphabetized data).


It's amazing to see what somebody like Mark Adler can achieve when they focus on a specific niche. You often hear about the importance of specialization when positioning a business, but it isn't mentioned as much for individuals. His work serves as an ideal case of specialization—if I wanted to ask an expert for advice on compression, or was looking for an outside expert on compression that is immediately who I'd think of.


You can also use 7-zip for multithreaded gz/xz/bz2/7z/zip compression.


Why not just use GNU parallel?

`parallel gzip ::: file1 file2`

I wish there was a standard for this type of stuff. So that an app will check for existing child spawns and act accordingly (IPC).


That doesn't help if you have only one big file or stream.


cat bigfile | parallel --pipe -k gzip > out.gz

parallel --pipepart -a bigfile -k gzip > out.gz

The last is faster.


I use this extensively in schlepping data around for analytics work. Any data that moves over the network going into or out of a data store gets compressed with pigz.


How does the strategy for parallelization of gzip in this project differ from the strategy used for LZMA2 parallelism as implemented by 7z?


http://lbzip2.org/ also exists for bzip2. It works really well, especially since bz2 files are split into discreet blocks which can be un/compressed independently.


We use this in production at Graphistry, super useful for latency-sensitive dynamic media applications! (In this case, we needed to add an auto-tuner + node bindings.)


The big challenge seems to be parallel gunzip as that requires a special stream with an index to work, with no general purpose solution available so far.


Probably useful for servers more than anything where perf really does matter at scale and content type is gzip.


Web servers have multiple connections so this isn't so applicable. Maybe special purpose ones zipping large datasets on the fly?


We combine both strategies for content: multicore native encoding + decoding :)


For bzip2 lovers the tool is pbzip2 ;)


Huh. I was living under the incorrect assumption that gzip was inherently single threaded


Any ideas how this would compare to gzip while on a Microserver? I'm thinking of Atom C2750 bare metal from packet.


Looks like that has 8 cores? I would imagine it would make a huge difference depending on other loads on the system. I gave it a try on my file server at home that has an AMD E-350 processor with 2 cores and it shaved off a good 42% of the total time:

  > time gzip -v xubuntu-16.04-desktop-amd64.iso 
  xubuntu-16.04-desktop-amd64.iso:	  1.5% -- replaced with xubuntu-16.04-desktop-amd64.iso.gz
  gzip -v xubuntu-16.04-desktop-amd64.iso  119.07s user 4.66s system 98% cpu 2:05.22 total

  > time pigz -v xubuntu-16.04-desktop-amd64.iso           
  xubuntu-16.04-desktop-amd64.iso to xubuntu-16.04-desktop-amd64.iso.gz 
  pigz -v xubuntu-16.04-desktop-amd64.iso  128.19s user 6.64s system 184% cpu 1:12.97 total




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: