Hacker News new | past | comments | ask | show | jobs | submit login

Except if the hash is smaller than the source data, then with a good hash, there will be multiple source datasets that hash to the same result, which makes your decompression program unreliable. You could well bruteforce the wrong answer.



Do you mean that there could be two 3K files with the same sha256 hash and the probability of hitting the collision is greater than the probability of finding the correct hash for the file ? Let's divide the 3 mb file into 1000 parts, each having a ~3k size. Take sha256 hash of all 1000 parts and sha256 hash of the actual file. The size of these hashes are less than the actual file + leaves ample room for a decompressor. Now start brute-forcing and assume we have all the run-time power and time. Would the probability of collision happening in all 1000 parts be very low then ? Given a good hashing function could it be so low that we can disregard it ? If 1000 is not enough, can we increase the splits to 2000 or 3000 ?


It's not that there might be a collision, but that collisions are guaranteed. How many 3Mb files are there? 2^(3M8) = 2^25165824. How many distinct hashes are there? 2^(2561001) = 2^256256.

By the pigeonhole principle, we can't fit 2^25165824 objects into 2^256256 holes; indeed each file will have on average 2^24909568 other files that share the same set of 1001 sha256 hashes.

The reason that we are able to work with hashes, and that compression works in practice, is that we don't have to deal with every 3M file. Most of these files are gibberish and will never be seen, and finding two files that match even one hash is incredibly difficult. But once we start talking about brute-forcing, we start encountering problems -- and having to dedicate an awful lot of processing power to the problem isn't the biggest one...


There is just one 3MB file and we divide that file into 1000 parts. I agree there will be collision per hash (for each part). But I'm skeptical that all 1000 hashes will produce such bits so that the final file will cause collision on the hash of the original file (remember we do have hash of the original file). If the final hash does not match the hash of original file we would have to recompute all the hashes again by randomly generating the bits for each 1000 file-parts. Do you mean to say that the collisions are guaranteed and that the collision inside any file-part will also cause a collision in the original file hash when the parts are combined ?


Not every collection of 1000 correctly hashed parts will make a correctly hashed whole, but there are an awful lot of different collections of parts that will hash correctly (2^24909824 permutations of them) and of those, one in 2^256 will also match the full-file hash.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: