Precomp: Further compress files that are already compressed

mappu · on Oct 31, 2020

Precomp works by brute-forcing the zlib/.../ parameters used to get byte-identical input; then decompressing; and (optionally) recompressing with a stronger compressor.

It was closed-source for a long time and AntiX was an open source version.

One interesting use case is for archiving Android ROMs for projects such as LineageOS. Each ROM is a zip file, they differ in bytes almost entirely and storing a long amount of history takes a lot of space. But under precomp, the differences between two nightly builds are quite small, and they can be packed together into a solid archive for a significant (90%+) space saving.

schnaader · on Nov 1, 2020

Author of Precomp here. First of all, thanks for the attention, the sudden rise of the GitHub stats surprised me, but now I know the reason. So some comments from my side here, I'll answer some of the threads, feel free to ask questions.

The project has been around quite long. I started with it in 2006, but as I'm basically working on it alone (though it got better with the change to open source) and don't have much spare time (studying, work, father of two kids) updates are less frequent than I'd wish to.

The upside of this long time is that the program itself is quite stable, so e.g. it's hard to find data that leads to a crash or incorrect decompression that is not specially crafted for this purpose.

The biggest challenge at the moment is the codebase that is very monolithic (one big precomp.cpp) and mixes newer parts (e.g. the preflate integretation) with old C style code. On the plus side of things, the code is platform independent (in the branch, it even compiles on ARM) and compiling should be no problem using CMake.

Another thing missing because of not-much-time is the documentation. There's some basic information in the README and the program syntax reveals the meaning of most parameters, but there could be much more doc. Very much information can be found at the encode.su forum, but of course, this is very unstructured and often related to bugs, questions about the program/algorithm or problems on certain files.

That said, just throw your data at Precomp and see how it performs. Both ratio and duration heavily depends on what data is fed in, but since some of the supported streams like zLib/deflate or jpg are used everywhere, there are many (sometimes surprising) examples like APK packages and Linux Distribution images where it outperforms the usual compressors like 7-Zip. And last, but not least, the usual GitHub things apply: feel free to check out the existing issues, create new ones, play with the source code, fork it, create pull requests.

cl3misch · on Oct 31, 2020

Shouldn't the name be "postcomp" if it's for files which already are compressed?

scottlamb · on Oct 31, 2020

Apparently precompressor is a general term [1] that applies here:

> A precompressor is a computer program, which alters file content so that a real lossless compression program will achieve better results than without precompressing. This is possible by creating such patterns that compression programs will recognize.

In this case the altered content is (quoting the README) "decompressed data from the original file together with reconstruction data"). So confusingly enough its precompressing includes decompressing.

I don't think "precomp" is a real accurate name for the whole tool, but that might be where it came from.

With no flags it will precompress (including decompressing) and compress. With -cn, it will precompress alone. With -r, it will decompress and...I dunno, postcompress (including compressing)?

[1] https://findwords.info/term/precompressor

mlok · on Oct 31, 2020

No it takes decompressed data and tries to recompress it better than it was by adding a pre-compressor to the compression process.

AlotOfReading · on Oct 31, 2020

The terminology in the readme is strangely inverted. 'Compress' seems to mean decompress and vice versa. This extends to the source code itself. Perhaps a language barrier?

est31 · on Oct 31, 2020

> Perhaps a language barrier?

The two words in German are Komprimieren (compress) vs Dekomprimieren (decompress). Not much potential to mix up here.

schnaader · on Nov 1, 2020

Yeah, the name was better for the initial versions of Precomp that basically only did the decompression part (like using "-cn" now) and prepared the data before further compressing it with other tools. It also has the downside of being a common term (abbrevation of precomposition) in image/video editing.

Also, for other stream types like JPG and MP3, even "-cn" directly compresses the data without being able to see the intermediate decompressed representation, so in these cases Precomp (re)compresses.

But well, it's just a name and I decided to stick with it, changing it several times would've lead to confusion, too.

m3kw9 · on Oct 31, 2020

Recomp

pimlottc · on Nov 1, 2020

Haven't tried this, but another related tool that works great for re-compressing existing files is AdvanceCOMP [0]. It can lossless optimize zip, png and mng files in-place, which is particularly handy for zip files.

I used this on a project to drastic reduce the size of generated Powerpoint decks, in order to stay below the email attachment limits for our distro list recipients. Very handy!

0: http://www.advancemame.it/comp-readme

anderskaseorg · on Nov 1, 2020

That kind of tool is useful too, but for different purposes. Precomp lets you recover the original compressed data stream byte-for-byte, not just a differently compressed version of the same uncompressed data.

SNosTrAnDbLe · on Oct 31, 2020

I could not find any stats or compression ratios in there. Also, the backward compatibility disclaimer makes me worried. If I use a v1 to precompress the file, then I will need to keep the v1 version alongwith the precompressed file or I risk loosing the data.

schnaader · on Nov 1, 2020

Some mixed results for older versions (0.4 and 0.4.3) can be found here: http://schnaader.info/precomp_results.php

But both ratio and time heavily depend on if there is data in the file that can processed by Precomp. Since deflate/zlib/zip/gzip is used for other filetypes and containers (.docx/.apk/.jar/...) and stuff like game resource files, so results can be surprisingly well, especially when comparing to traditional compressors that can't handle the compressed data at all. On the other hand, newer compression methods like .xz/.7z/.rar/brotli) aren't supported (yet), so Precomp can't do much about them.

Not being backward compatible is more of a protection in alpha phase and to avoid confusion, the error Precomp gives when trying to process files created with older versions could be changed to a warning. In most cases, the version check could be patched out and the data of older .pcf files would be restored correctly with newer versions.

SNosTrAnDbLe · on Nov 2, 2020

Thanks for the clarifications!

davvid · on Oct 31, 2020

Curious, where did you read the backwards compatibility disclaimer? I couldn't find it, but did find this note:

> The different versions are completely compatible, PCF files are exchangeable between Windows/Linux/*nix/macOS systems.

anderskaseorg · on Oct 31, 2020

At the “About” link http://schnaader.info/precomp.php: “Precomp is not backwards compatible. If you want to recompress some PCF file made with a different version of Precomp, you'll have to use the binaries of the older versions”.

olliej · on Oct 31, 2020

So this is essentially an optimizer (eg no bogus recursive compression claims). It decompressed specific regions of certain file formats with better algorithms.

A hypothetical (simpler less powerful) equivalent would be a program that read gzip files that have the lowest compression ratio, and then recompressed with gzip set to the highest level.

wolf550e · on Oct 31, 2020

s/decompressed/recompressed/g

olliej · on Oct 31, 2020

Fixed, thanks!

pabs3 · on Nov 1, 2020

This reminds me of Debian's pristine-tar:

https://kitenet.net/~joey/code/pristine-tar/ https://joeyh.name/blog/entry/generating_pristine_tarballs_f...

mchusma · on Oct 31, 2020

Is the novelty that it only works on certain filetypes, so is optimized for those?

scottlamb · on Oct 31, 2020

Sort of. If I'm understanding correctly:

The filetypes it's designed for are ones that specify the "deflate" compression algorithm: they can be thought of as headers+deflate(raw). It applies the better (higher-compression-ratio) lzma algorithm to the raw data. That's better than applying it to the deflate form as a .zip file compressor would try to do. Ie, zip might try lzma(deflate(raw)) where this tries lzma(raw).

So the novel things are:

* It understands these specific file formats well enough to pick out deflate part that it should uncompress (when it applies its own compression) and recompress (when it applies its decompression).

* I think it promises to exactly reproduce the original deflate(raw) form. You can't necessarily do that from lzma(raw) alone. I'm not an expert, but from a wikipedia article skim, apparently deflate works with a combination of dictionary encoding and Huffman coding. This program probably stores the original tables for those (with the value sides as ranges of the original file rather than repeating them), and maybe some other details to handle this. Like, if the originally compressing program called flush more frequently, it produces a different stream, and this needs to be able to reconstruct that. Maybe it stores like a list of deflate operations rather than either the deflate-compressed result or the raw data. This strikes me as kind of a tricky problem, not in any theoretical computer science sense, but in a practical getting-all-the-details-right software engineering sense.

toast0 · on Oct 31, 2020

If that's the case, it sounds like Dropbox's Lepton[1], which recompresses jpegs, but can be losslesly decompressed back to the original jpeg file.

[1] https://dropbox.tech/infrastructure/lepton-image-compression...

wolf550e · on Oct 31, 2020

To match the original zlib/gzip stream, it needs to match the exact hash table and parsing algorithm used by the exact zlib implementation used originally, to choose the same matches. It would not do to use a stronger deflate compressor that produces optimal parsing to generate smaller deflate streams but then not be able to match the original file. The hash of the compressed data and even the length of the compressed data might be important to the application using the file.

scottlamb · on Oct 31, 2020

Looks like it uses the preflate library [1] which back-references precomp in its README:

> "precomp" is a tool which used to be able to do the bit-correct reconstruction very efficiently, but only for deflate streams that were created by the ZLIB library. (It only needed to store the three relevant ZLIB parameters to allow reconstruction.) It bailed out for anything created by 7zip, kzip, or similar tools. Of course, "precomp" also handles JPG, PNG, ZIP, GIF, PDF, MP3, etc, which makes a very nice tool, and it is open source. UPDATE The latest development version of "precomp" incorporates the "preflate" library.

so precomp used to work as you describe: it depended on having exactly the same zlib algorithm as the one that produced the original data. Now it uses preflate, which has some hybrid approach:

> for deflate streams created by ZLIB at any compression level, we want to be able to reconstruct it with only a few bytes of reconstruction information (like "precomp")

> for all other deflate streams, we want to be able to reconstruct them with reconstruction information that is not much larger than "reflate"

> "reflate" can reconstruct any deflate stream (also 7zip, kzip, etc), but it is only close to perfect for streams that were created by ZLIB, compression level 9. All lower compression levels of ZLIB require increasing reconstruction info, the further from level 9, the bigger the required reconstruction data. "reflate" only handles deflate streams, and is not open source. As far as I know, it is also part of PowerArchiver.

I'd be curious to learn a little more about the "reconstruction information" but I don't see a write-up. So far I'm not quite curious enough to dig through preflate's source code.

[1] https://github.com/deus-libri/preflate