Hacker News new | past | comments | ask | show | jobs | submit login

From the link I posted in my other comment, a thought experiment... Yes, some random files can be compressed by a given program, but not all random files. The proof is fairly simple, once you think it through:

Theorem: No program can compress without loss all files of size >= N bits, for any given integer N >= 0.

Proof: Assume that the program can compress without loss all files of size >= N bits. Compress with this program all the 2^N files which have exactly N bits. All compressed files have at most N-1 bits, so there are at most (2^N)-1 different compressed files [2^(N-1) files of size N-1, 2^(N-2) of size N-2, and so on, down to 1 file of size 0]. So at least two different input files must compress to the same output file. Hence the compression program cannot be lossless.




The challenge was to provide a decompressor for one file, not any or all files.

This proof of the impossibility of a compressor that can compress any file has been known for decades.


Yes, but you don't get to see the file until after you've sent in the $100. So in order for this to be a good bet, you'd need a method that can compress a random file with at least 2% ($100/5000) probability. That's still quite difficult.

An example of an algorithm that does compress 2% (actually 1/32 ~= 3%) of random files: just drop the initial five bits of the file, and replace them with zeros upon decompression. There's a 1/32 chance they were all zeros, in which case you win. The difficult part is encoding the decompression procedure into a program of less than five bits. :-)


>you'd need a method that can compress a random file with at least 2% ($100/5000) probability

No, you'd need to be able to create a custom algo for that file with that probability. Quite a difference.


The "custom algorithm" thing is a red herring. You'll need to use some method to come up with your custom algorithm, once you see the input file, and whatever method you choose is itself a compression procedure, targeted at a fixed decoder which is essentially a Bash shell with a C compiler (* ). If you intend to treat different inputs differently (e.g., "if it contains the string "AB", I'll use this algorithm, if it has more zeros than ones, I'll use this other one, etc"), this just means that your compression procedure contains some 'if' statements. The laws of information theory don't care whether your compression procedure is actually a computer program or just implicitly encoded into human actions; it's still impossible to reliably represent random data with fewer than the usual number of bits.

(* ) There's a little bit of leeway here because the challenge as stated isn't actually precise about the execution environment. You could probably sneak in a small number of illicit bits by negotiating out-of-band whether the decompressor is to be a C program, ELF binary, Perl script, etc. It's not obvious how to make this useful though.


> If you intend to treat different inputs differently (e.g., "if it contains the string "AB", I'll use this algorithm, if it has more zeros than ones, I'll use this other one, etc"), this just means that your compression procedure contains some 'if' statements. The laws of information theory don't care whether your compression procedure is actually a computer program or just implicitly encoded into human actions; it's still impossible to reliably represent random data with fewer than the usual number of bits.

While you're obviously correct about the information theory aspect of this, in this particular challenge, the contestant enjoys the privilege of pruning his decompressor to match just the one file. All the other "if" statements can be thrown away. So now a lossy algorithm can be used, that just happens not to be lossy on that one particular file.


I understand that the contestant gets to specify a "decompressor"; my point is that this is not information-theoretically relevant. The real decompressor in this challenge is the Linux/C runtime environment. The "decompression" program you send is really just part of the compressed payload, and whatever steps you, as a human, take to produce the compressed payload (encompassing both the code and data portions) are ultimately governed by information theoretic limitations.

Another way to say this is that although you have the 'privilege' of a designing a custom decompressor for this one file, this is not really a benefit since your custom decompressor counts against the space limit. (in contrast to the typical information-theoretic setting where the channel can have an arbitrary pre-specified decoder).


Yeah, it's not a benefit, except if the challenge-giver makes a mistake and offers a file that happens to have nicely-compressible characteristics. Basically you're spending $100 on the hopes that the file you're tasked with compressing just happens to have low enough entropy that it can be successfully compressed by more than the decompressor's size.

Not a good bet to take.

> my point is that this is not information-theoretically relevant.

Exactly. This whole post is about the challenger trying to be clever with the rules and trying to sidestep the core information theory.


Continuing that logic:

You could get 6% if you can also do the same on the end of the file. There's a 1 in 16 chance that any random sequence of data will start or end with 5 zeros. The same trick could be done for 1s to bring it up to 12%. In other words, 12% of sequences will either begin or end with 5 ones or 5 zeros.

Since you can inspect the data before writing the decompressor, if we can figure out how to pad the beginning or end of a byte sequence with five 1s or five 0s in 5 bytes or less, then your expected outcome is $625 (minus $100 to play). Using 6 bytes would give you an expected outcome of $312.5, 7 bytes would give you an expected outcome of $156.

7*'0'+a

That's 7 bytes. It's not really a executable decompressor, but it does capture the necessary steps required to decompress.


Unfortunately you're confusing bits and bytes (I did the same thing too at first). Fitting a decompressor in five bits is a lot harder. :-)


I think there are two types of communication we need to distinguish. Pre-file, the communication we do before gettting the file, and post-file, the communication we do post file.

Pre-file we can communicate as much as we want without being penalized. We can say for instance that we want our bitstring to be interpreted as an elf binary to be run in ubuntu 14, with specific libraries installed. Or we can say that it will be a zip file. Or we can say that it should be pre-concatenated with 0's. We can specify hundreds of thousands of lines of code that should be compiled now, and then later be executed with the bitstring as input.

Then we get the file, and now every bit we communicate is penalized. If we have asked for a turing-complete format we may now send a decoder specific for the file.


as pointed out... bits vs bytes... I think the overall idea here is interesting. It is "cheating" because the implicit "EOF" and "start of file" allow us to forego indexing the data. This "cheat" is similar to the technique used in the article.

Similar cheats would be to leverage hardware instructions (imagine if x86 had a "pad with N zeros" instruction that fit in 7 bits). That's still cheating because the information is hiding in the architecture.


Sure. It's possible to win, and it's also possible to lose, depending on the specific file provided. This is much better than reformulating the problem to an impossible one (compress any file, instead of this one specific file)


Actually even that is not feasible:

Assume that you split a m byte string into k byte decompressor and l byte input data. If k+l < m, that is you actually compress, then there are only 2^(k+l) possible output strings. And since

    sum_{i=0}^{m-1} 2^i = 2^m -1
there exists at least one string in the space of m character strings which can not be compressed by any algorithm.


Yes, knowing which string(s) to choose which have this property as the challenge giver is the hard part.


The problem wasn't well formulated, since it didn't clearly define what can or cannot be done to the decompresser based on knowledge of the output file.

But consider a more strongly formulated problem, where you must provide the decompresser and then a file is randomly selected to be put through it - as long as it's still a matter of compressing one particular file it is statistically possible to win (especially at $5000:$100) - The idea being that you can reinterpret data to make certain outputs cheaper while others more expensive.


It was well formulated; the whole point is that you can do whatever you want to the decompresser.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: