I didn't see anything that mentions run-time in the challenge. I think a good compression challenge should mention about run-time. Theoretically, it would be possible to hash parts of the file and then brute force the hash in the decompressor. This would take a lot of time but would work.
There are an unbounded number of inputs which may result in the same hash digest. Saving just the hashes (digests) and then finding the preimages would not guarantee the same result. You would find collisions but not necessarily the right collision.
In fact, as the data being hashed increases in size, the amount of data required to identify which preimage is the correct preimage must be greater than or equal to the difference in size between the digest and the preimage.
For example, let's say you are using a perfect hash function, with a 64-bit digest. If you feed data into the hash function in 64 bits chucks, and then try to brute-force the results, each preimage you find will be the right one, and you can assemble the original file, but you have exactly as many bytes as when you started!
Now lets say you feed in 65 bits to the perfect hash function, saving 1/65 space in the resulting list of digests. But unfortunately, there are two 65-bit preimages which will result in each of your stored digests, so you need a bit to decide which one is correct.
No it wouldn't work! Hashes do not defy information theory, they lose information. Such brute forcing would only find hash collisions and "decompress" to a different text than the original.
I endorse the creativity in your approach, but you are mistaken; this will not (deterministically) work.
Given a hash function that hashes an input I (of size N, comprised of N arbitrary bytes) to a digest D (of size M), then assuming that M is a fixed value, then for each output digest D_0, there will be 2^(N-M) values that hash to that D_0. How will you tell which is the "right" one?
Except if the hash is smaller than the source data, then with a good hash, there will be multiple source datasets that hash to the same result, which makes your decompression program unreliable. You could well bruteforce the wrong answer.
Do you mean that there could be two 3K files with the same sha256 hash and the probability of hitting the collision is greater than the probability of finding the correct hash for the file ?
Let's divide the 3 mb file into 1000 parts, each having a ~3k size. Take sha256 hash of all 1000 parts and sha256 hash of the actual file. The size of these hashes are less than the actual file + leaves ample room for a decompressor. Now start brute-forcing and assume we have all the run-time power and time. Would the probability of collision happening in all 1000 parts be very low then ? Given a good hashing function could it be so low that we can disregard it ? If 1000 is not enough, can we increase the splits to 2000 or 3000 ?
It's not that there might be a collision, but that collisions are guaranteed. How many 3Mb files are there? 2^(3M8) = 2^25165824. How many distinct hashes are there? 2^(2561001) = 2^256256.
By the pigeonhole principle, we can't fit 2^25165824 objects into 2^256256 holes; indeed each file will have on average 2^24909568 other files that share the same set of 1001 sha256 hashes.
The reason that we are able to work with hashes, and that compression works in practice, is that we don't have to deal with every 3M file. Most of these files are gibberish and will never be seen, and finding two files that match even one hash is incredibly difficult. But once we start talking about brute-forcing, we start encountering problems -- and having to dedicate an awful lot of processing power to the problem isn't the biggest one...
There is just one 3MB file and we divide that file into 1000 parts. I agree there will be collision per hash (for each part). But I'm skeptical that all 1000 hashes will produce such bits so that the final file will cause collision on the hash of the original file (remember we do have hash of the original file). If the final hash does not match the hash of original file we would have to recompute all the hashes again by randomly generating the bits for each 1000 file-parts.
Do you mean to say that the collisions are guaranteed and that the collision inside any file-part will also cause a collision in the original file hash when the parts are combined ?
Not every collection of 1000 correctly hashed parts will make a correctly hashed whole, but there are an awful lot of different collections of parts that will hash correctly (2^24909824 permutations of them) and of those, one in 2^256 will also match the full-file hash.