Would you recommend it as a learning exercise?

andrewf · on Feb 24, 2015

Maybe. There are lots of finicky details, like inconsistent bit ordering and arbitrary lookup tables. I think it's a long week of work at least, probably more.

On the other hand, it is probably the simplest compression format in widespread use that uses variable-length encoding, and there's a few implementation around for you to look at. Things like bzip2 and LZMA2 really don't have independent specifications and multiple implementations, and are more complicated to boot.

Byte-oriented Lempel-Ziv formats like LZ4, LZO and LZJB would be a good way to dip your toes into a production compression format. The LZW encoding used by GIF files is also "simpler" in principle but I personally find it harder to wrap my head around.

If you want to experiment with binary formats, it might also be interesting to write a decoder for a simple image format, like PCX files, or even for old game file formats like Doom's [http://doom.wikia.com/wiki/WAD].

unwind · on Feb 24, 2015

LZJB is fantastic. Here's my take on a Python implementation of both compression and decompression: https://github.com/unwind/python-lzjb. It's ~150 lines for both, and BSD 2-clause-licensed. It's not very high-performant (I wrote it for fairly small amounts of data) but hopefully clear enough to learn from.