Hacker News new | past | comments | ask | show | jobs | submit login
IEEE Medal of Honor Goes to Data Compression Pioneer Jacob Ziv (ieee.org)
201 points by tosh on Jan 17, 2021 | hide | past | favorite | 30 comments



As the article points out, this guy is the Z in LZ- which also forms the compression portion of the Deflate algorithm (zip/gzip). A huge portion of the lossless compression algorithms of today are generationally based on Ziv's work.

This is extremely well deserved.


...and LZ is such a simple algorithm - replacing repeated occurrences of a sequence with a shorter backreference - that I'm surprised it wasn't discovered sooner. Huffman wrote about his algorithm in the early 50s, but LZ only showed up in the late 70s.


The important contribution is not in the simple algorithm (which I suspect others have come up with before, and I know others have independently come up with after):

Rather, it was what Lempel and Ziv proved about both of their simple algorithms -- The one published in 1977, usually referred to as LZ77 which is the basis of all <length,distance> variants, and the one published in 1978, usually referred to as LZ78 which is the basis of LZW, and other <word,char> variants.

They proved that for a very large class of sources (those described by markov models, context trees, among many others), these algorithms asymptotically give the best possible compression. In fact, some applications (also proved correct by Ziv) use the compressed datastream statistics to estimate source entropy, because it's much easier to implement than any direct measurement.

Both Ziv and Lempel did a lot of other things; However, Lempel did more in the area of switching systems and his other contributions are numerous and interesting, but were less profound in any specific field -- Whereas Ziv's contribution to Information Theory through the last 60 years or so has been much more profound.


The pointwise otimality of the LZ algorithm - the fact that it beats every other information-lossless finite-state compressor like Huffman encoding asymptotically on /every/ string, is an ingenious theorem. For versions of the algorithm (LZ77 in particular), it was open till the late 1990s when Paul Shields settled it.

The algorithm is easy, but the analysis of its effectiveness is not at all easy.


>...and LZ is such a simple algorithm -

then you have monsters like this

https://en.wikipedia.org/wiki/Asymmetric_numeral_systems


I think repeated sequences in data only became relevant when there was large image data. I mean, like Shannon was concerned about language transmitted.


Not at all. Shannon's work is incredibly profound, and deals with distortion of analog signals, channel capacity under different constraints, and many other things. He was very much aware and discussing repeated sequences in the 1952 class in which Huffman constructed his code and proved it is optimal. (According to lore, Shannon said "hey, Fano and I have this Shannon-Fano code, which we know is close but not optimal. Improve on it and you can skip the final")

Shannon's original paper, "A mathematical theory of communication" is very readable (especially discrete sources and channels; you need some calculus for the continuous sources and channel stuff), and yet touches on many things and has many insights. At least as late as 2000, one very-well-published prof at the uni I was attending told me that whenever he wants to break out to a new direction, he rereads it 5-10 times and realizes another thing that Shannon considered trivial and that was yet unexplored.

Also, there's basically no repeated sequences in large image data. JPEG, H264 and all their friends are all about compressible approximations that are still close enough, and which rarely if ever have repeated data.


The first compression algorithm I implemented (on and IBM PC/AT in the mid '80s) was LZW. It's my favorite type of algorithm - elegant but hard to get your head wrapped around. Once you "get it" it seems so obvious.


LZW is just a (nontrivial and very effective) trick to improve on the early behavior of LZ78; they are both asymptotically optimal.

LZ78 is incredibly simple, elegant AND easy to get your your head around. I had implemented it once in 2 lines of Python. some 25 years ago. If I can shake the rust maybe I'll be able to rederive it.


It's a bit of a shame that the LZ78-LZW linage of algorithms basically died out at that point.

Most of modern lossless compression is instead based on the LZ77-LZSS-... linage, which has a massive number of different improvements and variations.


  I had implemented it once in 2 lines of Python. some 25 years ago. If I can shake the rust maybe I'll be able to rederive it. 
Please do! I would love to see a concise and readable implementation in Python.


Many good inventions are like that.


Nearly all break through inventions seem obvious in hindsight. I guess putting the pieces together and connecting the dots looking forward is what’s incredibly hard. The search space can still be immense.


The LZ algorithms are like magic. LZ78 factorization can be written in 5 line of python, and then can be used in a mysteriously effective distance metric. The power of math and "useless" Kolmogorov complexity!


Congratulations!

Back when I attended faculty, I had a project assigned to me titled "Lempel-Ziv compression of binary strings" or something like that. It was one of the most fun projects I worked on.


aside from the obvious one of him being the z in .lzma, Wyner-Ziv compression is really important in a tiny subfield of information theory


In an alternate universe, the etymology of “ZIP” would be derived from Ziv, plus a P for Phil Katz.


Phil Katz - not sure what to say. I use him as an example of how good things dont always make ones life better.


Why does this medal go to only one pioneer, not including Abraham Lempel?


Although the family of algorithms is called LempeLl-Ziv, Ziv was the primary author, so it should really be called ziv-Lempel


I understand there is a tradition in papers to put the primary author first but in the end, it is the letter Z almost became synonymous with compression.

.Z, .zip, .7z, .gz, .xz, .bz2. Of course, many use LZ77 or LZMA, LZ stands for Lempel–Ziv, but only the Z remained in the format.


To be pendantic: Nobody today uses LZ77. LZ77 is quite primitive and not very efficient. However, nearly everyone today uses an algorithm that is descended from LZ77, so "LZ77" is just a sort of shorthand for "any algorithm descended from LZ77". LZMA and Zip's deflate are both examples of LZ77 descendants.

(To be even more pedantic, all modern algorithms are descendants of LZSS, which was one of the first improvements on LZ77.)

Out of the ones you listed, only .Z and .bz2 are not LZ77-LZSS descendants.

.Z is an implementation of LZW, which is a descendant of LZ78, which is a significantly different (but still related) algorithm to LZ77. It never really caught on beyond a few implementations of LZW, though.

.bz2 is the only one completely unrelated to Ziv's work. It uses the Burrows-Wheeler transform, instead.


I've been trying to figure out where the first z comes from. At least I tend to think "z" is for compression because of ZIP, but it predates that.

.Z was used for compress(1), probably because pack(1) used .z, but where did that come from?

My best theory is that it's from Steve Zucker who wrote pack(1) at RAND. There were other versions of pack(1) too, but if his was the first, it seems possible that he choose the z extension for Zucker.

Does anyone have more info or other theories?


I don't believe bzip2 is based on lz at all.


It uses arithmetic coding, I believe.

EDIT: wikipedia says it did but changed to use Huffman encoding due to patents? But those patents have expired.


It uses the Burrows-Wheeler transform and Huffman coding. It is one of the few lossless compressions algorithms that do not actually build on the work of Ziv.


You can also add zstandard to the list.


Well deserved. Just in case you were not aware of the variants of LZ[iv] for dictionary based (or sliding window based) lossless compression -> https://tinyurl.com/y3h76c4j


Your URL but without the tinyurl redirect: https://ethw.org/w/images/thumb/6/6d/Compression_hierarchy.p...


Wouldn't it be cool if his name was actually Jacob Zip?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: