I have mixed feelings about this. It's always good to teach people things they don't know, and compression is a worthy subject.
But the author seems to lament that not much of interest is happening in lossless compression. That's true, but there's a good reason for it: Lossless compression is just about as good as it's ever going to get, and it's easy to prove that. Seeking dramatic improvements in general lossless compression is like seeking faster-than-light travel: Ain't gonna happen. Except for specialized niches, any improvements to lossless compression are going to be tiny and incremental. Tiny and incremental might well matter for petabyte data stores like Google manages, but it won't matter for most people.
Lossy compression is a different story. There's still lots of unexplored territory in finding an optimum bit rate that matches the information processing capabilities of human eyes and ears.
I had a similar feeling a decade ago, but I now think otherwise. We may have reached the good-enough compression ratio but not the good-enough performance, and modern algorithms now rentlessly improve on this. On the other hand lossy compression always tends to exploit the currently available computing power; the ability to continue doing so for decades implies that improvements on lossy compression are more "incremental" than those on lossless compression.
I think the greatest breakthrough in lossless compression would come when the development of compute outpaces storage density. Maximum possible compression (entropy i think?) isn’t impossible per say, but it is computationally impossible with today’s computing resources.
Of course, largely hypothetical now, but I envision a time in the distant future where we produce so much data that we must compress it to its principled minimum.
I do agree that current lossless compression is pretty pressed for further innovation given current resource constraints.
What if compute capability became so powerful, that the easiest way to store something was to create it, store it for a millisecond, and then simply record the exact place and time since the big bang that the record was stored, as a comparably short set of coordinates, and then delete it. Then accessing the file is as simple as simulating the universe up until that millisecond and reading the file.
Probably easier to just have every computer connected on a much faster network. Query the internet for a hash, have it sent to you after a 100ms lookup at 1TBps.
Seems impossible considering in my lifetime I’ve basically seen ISP speeds go from 5/1 on my first cable connection to 20/768 20 years later on DSL. I always find it a bit weird knowing how important the internet is to commerce and the economy and my only real way to access it is through the same phone line that has been around forever.
You are right, although we might be able to make it lossless in practical terms with probabilistic approaches like combining multiple algorithms, selecting the hash length, and using heuristics to pick the most likely source.
I somewhat agree. I think that Shannon's theorem about information entropy comes into play and AFAIK we are very close already so improvements are bound to be small as we try to get closer and closer to the theoretical limit.
Still, I am reminded of the Huffman compression: there researchers already thought they reached the optimum. They were proven fenomemally wrong when the researchers relized they should look at compressing sequences rather than individual bytes (I believe it was LZ77 or LZW algorithm that was the first).
If we take that into account and following the same thought, improvements may still be possible if we could compress based on a larger body of data. For example, headers of the same file type contain the same common information. In theory, these could only be stored once on the hard drive. This is something deduplicated storage is already trying to achieve, so nothing new here, just trying to point to one avenue that can still yield future improvements :)
If you only care about compressing a megabyte a minute there's not a lot left to do. But in practice speed is usually a major consideration and better methods let you fit significantly better compression into your time budget.
Also going one byte at a time with minimal context is pretty good for text but pretty bad for many other kinds of data. There's a lot of room to improve on those fronts.
How do we know that we've hit a wall with lossless compression? As a non-expert, I've been really impressed with the likes of zstd and lz4 which are relatively new
I should clarify: There's still room for improvement in the speed of lossless compression/decompression operations, and that's where most improvements happen nowadays. But there's little room for improvement in compression ratio, because most modern algorithms produce outputs that are very close to random already.
But the author seems to lament that not much of interest is happening in lossless compression. That's true, but there's a good reason for it: Lossless compression is just about as good as it's ever going to get, and it's easy to prove that. Seeking dramatic improvements in general lossless compression is like seeking faster-than-light travel: Ain't gonna happen. Except for specialized niches, any improvements to lossless compression are going to be tiny and incremental. Tiny and incremental might well matter for petabyte data stores like Google manages, but it won't matter for most people.
Lossy compression is a different story. There's still lots of unexplored territory in finding an optimum bit rate that matches the information processing capabilities of human eyes and ears.