IEEE Medal of Honor Goes to Data Compression Pioneer Jacob Ziv

kevin_b_er · on Jan 18, 2021

As the article points out, this guy is the Z in LZ- which also forms the compression portion of the Deflate algorithm (zip/gzip). A huge portion of the lossless compression algorithms of today are generationally based on Ziv's work.

This is extremely well deserved.

userbinator · on Jan 18, 2021

...and LZ is such a simple algorithm - replacing repeated occurrences of a sequence with a shorter backreference - that I'm surprised it wasn't discovered sooner. Huffman wrote about his algorithm in the early 50s, but LZ only showed up in the late 70s.

beagle3 · on Jan 18, 2021

The important contribution is not in the simple algorithm (which I suspect others have come up with before, and I know others have independently come up with after):

Rather, it was what Lempel and Ziv proved about both of their simple algorithms -- The one published in 1977, usually referred to as LZ77 which is the basis of all <length,distance> variants, and the one published in 1978, usually referred to as LZ78 which is the basis of LZW, and other <word,char> variants.

They proved that for a very large class of sources (those described by markov models, context trees, among many others), these algorithms asymptotically give the best possible compression. In fact, some applications (also proved correct by Ziv) use the compressed datastream statistics to estimate source entropy, because it's much easier to implement than any direct measurement.

Both Ziv and Lempel did a lot of other things; However, Lempel did more in the area of switching systems and his other contributions are numerous and interesting, but were less profound in any specific field -- Whereas Ziv's contribution to Information Theory through the last 60 years or so has been much more profound.

sn41 · on Jan 18, 2021

The pointwise otimality of the LZ algorithm - the fact that it beats every other information-lossless finite-state compressor like Huffman encoding asymptotically on /every/ string, is an ingenious theorem. For versions of the algorithm (LZ77 in particular), it was open till the late 1990s when Paul Shields settled it.

The algorithm is easy, but the analysis of its effectiveness is not at all easy.

tester34 · on Jan 18, 2021

>...and LZ is such a simple algorithm -

then you have monsters like this

https://en.wikipedia.org/wiki/Asymmetric_numeral_systems

andi999 · on Jan 18, 2021

I think repeated sequences in data only became relevant when there was large image data. I mean, like Shannon was concerned about language transmitted.

beagle3 · on Jan 18, 2021

Not at all. Shannon's work is incredibly profound, and deals with distortion of analog signals, channel capacity under different constraints, and many other things. He was very much aware and discussing repeated sequences in the 1952 class in which Huffman constructed his code and proved it is optimal. (According to lore, Shannon said "hey, Fano and I have this Shannon-Fano code, which we know is close but not optimal. Improve on it and you can skip the final")

Shannon's original paper, "A mathematical theory of communication" is very readable (especially discrete sources and channels; you need some calculus for the continuous sources and channel stuff), and yet touches on many things and has many insights. At least as late as 2000, one very-well-published prof at the uni I was attending told me that whenever he wants to break out to a new direction, he rereads it 5-10 times and realizes another thing that Shannon considered trivial and that was yet unexplored.

Also, there's basically no repeated sequences in large image data. JPEG, H264 and all their friends are all about compressible approximations that are still close enough, and which rarely if ever have repeated data.

smoyer · on Jan 18, 2021

The first compression algorithm I implemented (on and IBM PC/AT in the mid '80s) was LZW. It's my favorite type of algorithm - elegant but hard to get your head wrapped around. Once you "get it" it seems so obvious.

beagle3 · on Jan 18, 2021

LZW is just a (nontrivial and very effective) trick to improve on the early behavior of LZ78; they are both asymptotically optimal.

LZ78 is incredibly simple, elegant AND easy to get your your head around. I had implemented it once in 2 lines of Python. some 25 years ago. If I can shake the rust maybe I'll be able to rederive it.

user-the-name · on Jan 18, 2021

It's a bit of a shame that the LZ78-LZW linage of algorithms basically died out at that point.

Most of modern lossless compression is instead based on the LZ77-LZSS-... linage, which has a massive number of different improvements and variations.

jimis · on Jan 18, 2021

  I had implemented it once in 2 lines of Python. some 25 years ago. If I can shake the rust maybe I'll be able to rederive it.

Please do! I would love to see a concise and readable implementation in Python.

ip26 · on Jan 18, 2021

Many good inventions are like that.

saltedonion · on Jan 18, 2021

Nearly all break through inventions seem obvious in hindsight. I guess putting the pieces together and connecting the dots looking forward is what’s incredibly hard. The search space can still be immense.

yters · on Jan 18, 2021

The LZ algorithms are like magic. LZ78 factorization can be written in 5 line of python, and then can be used in a mysteriously effective distance metric. The power of math and "useless" Kolmogorov complexity!

madmadjo · on Jan 17, 2021

Congratulations!

Back when I attended faculty, I had a project assigned to me titled "Lempel-Ziv compression of binary strings" or something like that. It was one of the most fun projects I worked on.

enthdegree · on Jan 18, 2021

aside from the obvious one of him being the z in .lzma, Wyner-Ziv compression is really important in a tiny subfield of information theory

jl6 · on Jan 17, 2021

In an alternate universe, the etymology of “ZIP” would be derived from Ziv, plus a P for Phil Katz.

phkahler · on Jan 17, 2021

Phil Katz - not sure what to say. I use him as an example of how good things dont always make ones life better.

turingbook · on Jan 18, 2021

Why does this medal go to only one pioneer, not including Abraham Lempel?

RcouF1uZ4gsC · on Jan 17, 2021

Although the family of algorithms is called LempeLl-Ziv, Ziv was the primary author, so it should really be called ziv-Lempel

GuB-42 · on Jan 18, 2021

I understand there is a tradition in papers to put the primary author first but in the end, it is the letter Z almost became synonymous with compression.

.Z, .zip, .7z, .gz, .xz, .bz2. Of course, many use LZ77 or LZMA, LZ stands for Lempel–Ziv, but only the Z remained in the format.

user-the-name · on Jan 18, 2021

To be pendantic: Nobody today uses LZ77. LZ77 is quite primitive and not very efficient. However, nearly everyone today uses an algorithm that is descended from LZ77, so "LZ77" is just a sort of shorthand for "any algorithm descended from LZ77". LZMA and Zip's deflate are both examples of LZ77 descendants.

(To be even more pedantic, all modern algorithms are descendants of LZSS, which was one of the first improvements on LZ77.)

Out of the ones you listed, only .Z and .bz2 are not LZ77-LZSS descendants.

.Z is an implementation of LZW, which is a descendant of LZ78, which is a significantly different (but still related) algorithm to LZ77. It never really caught on beyond a few implementations of LZW, though.

.bz2 is the only one completely unrelated to Ziv's work. It uses the Burrows-Wheeler transform, instead.

zmodem · on Jan 18, 2021

I've been trying to figure out where the first z comes from. At least I tend to think "z" is for compression because of ZIP, but it predates that.

.Z was used for compress(1), probably because pack(1) used .z, but where did that come from?

My best theory is that it's from Steve Zucker who wrote pack(1) at RAND. There were other versions of pack(1) too, but if his was the first, it seems possible that he choose the z extension for Zucker.

Does anyone have more info or other theories?

kevin_b_er · on Jan 18, 2021

I don't believe bzip2 is based on lz at all.

pfdietz · on Jan 18, 2021

It uses arithmetic coding, I believe.

EDIT: wikipedia says it did but changed to use Huffman encoding due to patents? But those patents have expired.

user-the-name · on Jan 18, 2021

It uses the Burrows-Wheeler transform and Huffman coding. It is one of the few lossless compressions algorithms that do not actually build on the work of Ziv.

glandium · on Jan 18, 2021

You can also add zstandard to the list.

infocollector · on Jan 17, 2021

Well deserved. Just in case you were not aware of the variants of LZ[iv] for dictionary based (or sliding window based) lossless compression -> https://tinyurl.com/y3h76c4j

codetrotter · on Jan 18, 2021

Your URL but without the tinyurl redirect: https://ethw.org/w/images/thumb/6/6d/Compression_hierarchy.p...

kleiba · on Jan 18, 2021

Wouldn't it be cool if his name was actually Jacob Zip?