My adventures in random data compression

jsnell · on June 3, 2020

This is pure nonsense. Random data is provably not compressible. So all the complex schemes the author came up with are useless, and it's not just due to a lack of CPU power that it failed to show results.

magicalhippo · on June 3, 2020

Reminds me of when I was about 13 and I had this brilliant idea for an insanely good compression scheme that could handle random data.

I was teaching myself programming using QBasic and I had just discovered the random number function. My idea was essentially to find a random seed which would make the random number generator generate the same byte sequence as the input block. Then you just had to store the random seed, huge win!

Of course, it didn't take me too long until I realized just how long it would take to compress, and I hadn't even thought about the limited number of byte sequences a 16-bit random seed could generate or the fact that the random numbers it would generate aren't all that random...

But for a few hours there I felt really, really clever :)

filmfact · on June 3, 2020

Exactly.

A more specific objection is that the ‘pigeonhole principle‘ applies.

Another way of looking at it is reducing it to absurdity: if it worked, you could repeatedly apply such a compression scheme until you are left with just a single bit representing the original data.

jonstewart · on June 3, 2020

As I started reading this, I thought, well, he means “high entropy” when he says random, and the scheme will be about finding bits of structure here and there , and he brought up Golomb Coding, and that seems like a reasonable place to start... and then I realized he really did mean random and he really did think the blockchain was the way to go and he really did use a cobbled-together Spark installation rather than learn SIMD/GPU programming... and, yeah, it’s all nonsense, and the proof unsurprisingly does not fit in the margin.

rambojazz · on June 3, 2020

Well he's not the first nor he will be the last one trying this nonsense. There was even a patent for compressing random data! Maybe he will learn something useful in the process, like, why he's wrong. Why am I not surprised to see blockchains linked with this :)

jere · on June 3, 2020

I'm fairly certain compressing random data is nonsensical. In fact there's a cash prize challenge for doing it that has been unmet for 20 years. https://marknelson.us/posts/2012/10/09/the-random-compressio...

Not sure I can explain it properly myself but it think it's impossible by definition. Intuitively, if you could compress any random data then you should be able to compress 2 bits into 1, which is clearly impossible.

alkonaut · on June 3, 2020

It's possible to compress (some) random data, based on just luck. It's not generally possible to compress any random data.

If I generate a thousand random 1024 byte sequences, I'm sure I could pick a few of them that I can compress, so long as I don't have to compress all of them.

It's also possible that the total of my compressed subset and the uncompressed rest, is smaller than the sum of the sequences! This still doesn't mean I could compress any random sequence, just specific ones. Ones that had lower entropy, by pure luck.

To translate to your analogy: I can make a compressor that compresses 2 bits into 1 BUT it works only works for 50% of all two bit sequences...

edf13 · on June 3, 2020

You're right... it's impossible.

This quote from the competition sums it up:

"Again, I will emphasize that Challenge 2 is at its heart provably impossible to beat. No program can compress all files of a given size, and the chances of any program being able to compress 1,000 different files of this length is so vanishingly small that it can safely be ruled out, even if every computer on earth was working on it from now until we are engulfed in flames during Sol’s red giant phase."

Uptrenda · on June 4, 2020

Author here.

I think most of the criticism on random data compression is too far sweeping and absurd. For example saying that it's impossible to compress random data because if it were possible you could compress data forever and make any file a single bit... is basically reductio ad absurdum.

I don't see anyone making that claim (and I certainly didn't in my article where I specified a limit of 1k.) People like to bring up the pigeonhole principle as if it's a mathematical proof that there can never be a practical algorithm for compressing random data. But the truth is if there's so many different ways to represent data then we might never run into a situation where it's actually relevant.

The other fault I see in these comments is the knee-jerk reaction over using a PoW-inspired algorithm for set reduction. Why is that? Could it be because blockchain technology is no longer in your favourite hype cycle and you need to signal to all your tech buddies that you're in the know by bandwagoning like everyone else? PoW is far superior to other ways to reduce the sets and has the added bonus that nonces are more compressible.

The point that I got to with this research is having 14 bits left over per set to save offsets after filtering results. I was running some code yesterday and noticed that using 3 x 3 bit fields to save the number of zero bit information in the hashes produced low offsets. I think that this could be compressed with the same simple quotient idea I already wrote about to fit in the 14 bits so that you could quickly filter and recover the sets and dereference the original data.

This would reduce 1024 bytes to around 1015~ which would be more than enough to solve the 20+ year compression challenge and refute claims that random data can't be compressed. Of course, the cluster that I built to run my code is too slow and I'm getting tired of waiting. So I will need a break from this. In the mean time you can stick to the insults and other predictions in the comments. As we all know from experience: hacker news is a positive and constructive community that is excellent at predicting the future.

filmfact · on June 4, 2020

Lets restructure the problem slightly, so the objection is clearer.

Instead of binary data, think of decimal numbers. We’ll take a five digit number and compress using function f to a four digit one.

We could come up with many different ways of doing this, but lets just throw away the most significant digit.

F(00123) = 0123 F(01234) = 1234

Great - but now what happens with a five digit number?

F(12345) = 2345

We lose information. We can’t encode the fifth digit in any of the other digits because that would change their meaning.

It should be evident that there is no scheme where you can encode a five digit number as a four digit number without either being lossy:

For some X: F’(F(X)) != X

Or having undefined ranges i.e. F’(F(X)) = X, only where digit 5 = 0

Increasing the number of digits just obfuscates the problem by moving it from the realm of numbers, where it’s clear there can be no 1-1 bi-directional mapping between two sets of unique numbers where the cardinality of the sets is not equal, to the realm of ‘data’ where we can confuse the issue.

At the end of the day, a function that only takes 2^(1015 * 8) Unique input values can have at most 2^(1015 * 8) unique outputs. And while thats a lot of outputs, it’s far fewer than 2^(1024 *8) that you need to represent to be able to compress any random input.

Ill leave you with a quote from Oscar Wilde: “A true friend stabs you in the front, not the back”

twhitmore · on June 4, 2020

Exactly. Each compressed input can only decompress to one output.

But the number of input values possible is far smaller than the possible output space. Therefore many outputs cannot be reproduced.

Real compression algorithms compressing structured data preserve the above properties. All outputs are representable, but they devote shorter codes (and thus most of the coding symbol space) to the most common patterns. Since coding symbol space is finite, random or less compressible data are left with longer symbols/ bitstrings than if conveyed uncompressed. For a reasonably structured file, the overall outcome is compression. However random files experience a small increase in size due to coding overhead.