Asymptotically it doesn't matter - that's the genius of all LZ schemes (LZ77 =~ ...

TD-Linux · on Aug 22, 2016

No, it still does - LZW assumes flat probabilities for each entry in its dictionary. After a while, both the encoder and decoder will know more accurate probabilities, but LZW has no way to take advantage of that.

beagle3 · on Aug 22, 2016

But asymptotically it doesn't matter because the LZ78/LZW code words represent longer and longer sequences, with the more probable sequences getting shorter codes. The additional letter (per each code word, in the case of LZ78) doesn't matter asymptotically. Practically, of course, it does.

gpvos · on Aug 22, 2016

How do the more probable sequences get shorter codes?

beagle3 · on Aug 22, 2016

That's basically the objective of every compression algorithm. Huffman coding does it directly by construction, Lempel Ziv do it indirectly - in LZ78/LZW, you get essentially and asymptotically (though randomly, per individual sequence) code length proportional to log_2 prob(sequence), because the sequences added to the dictionary are added by order of probability.

In LZ77/LZSS, it is basically the same, except the dictionary is not constructed explicitly but rather implicitly from the history. In both cases, you need to refer to Lempel and Ziv's proof that indeed, the codeword length converges to log_2 P(sequence), which is optimal by the Shannon Source Code Theorem.

I highly recommend Cover & Thomas book on information theory if you find that interesting and are not afraid of math. And Shannon's original 1949 paper ("A theory of communication") is surprisingly and exceptionally readable - though it doesn't cover LZ for obvious reasons.

gpvos · on Aug 23, 2016

Okay, so it's because the most probable sequences are more likely to be encountered first, which assumes the data is more or less uniform. (I.e., it doesn't work as well when the data starts with a large blob of random data, and the more regular data comes afterwards. But I guess that's true for most general compression algorithms. At least LZW has a reset code.)

I'll see if I can find the C&T book. Shannon's paper was part of the course material at uni, IIRC; at least I think I read it long ago.

beagle3 · on Aug 23, 2016

It's not just a matter of encountered first, it's of encountered more often ("encountered first" is a special case of "more often"). But you really need to delve into the proof to figure out why that's enough for optimality. At the very least, _I personally_ don't know of a simple intuitive way to explain that.

It is important to remember what each compression algorithm is optimal for.

LZ77 and LZ78 are optimal for observable markov sources -- that is, sources in which given x recent consecutive outputs (where x is the markov order), the distribution of the next symbol is fixed. While this is rarely ever the case, it is reasonable to assume that e.g. English text is a 10-th order markov model (with respect to characters) or 3-rd order (with respect to words).

The source you described is NOT markov overall, although it might be markov in the tail (past the random blob). Asymptotically, of course, it is markov :)

gpvos · on Aug 23, 2016

Hmm... I thought that in LZW, once a sequence has been assigned a code, it can never be assigned a shorter code; codes are assigned in order, and they only get longer over time (until a reset is triggered). So the frequency of a particular sequence cannot influence the length of the code assigned to it. I guess I'll read up on it a bit more until I comment further. I do see that it could end up to be optimal in the asymptotic case.

beagle3 · on Aug 23, 2016

You are, of course, correct. However, in the same way that you cannot reassign codes to the letter A-Z and yet you can compress by assigning a longer code to a sequence (the code is longer but is still shorter than the individual codes needed to represent the sequence), then even though you cannot reassign a code, you will get a new code for a longer sequence that includes the shorter sequence, represents more letters, and overall is more likely to occur than the shorter sequence followed by something else.

Assuming no reset, over time the shorter codes fall out of use because the longer ones eventually include everything they represent, extended by all possibilities (with the more probable extensions getting shorter codes by virtue of appearing more often / earlier).