Hacker News new | past | comments | ask | show | jobs | submit login

Quote from Mike:

    > Rather, you simply split the file into 218 parts ending with the
    > character "5" and then stripped that final character from each part.  Thus the
    > "decompressor" is nothing more than a reassembler,
    > concatenating the parts and reappending
    > the character "5" after each.
Well, that's exactly the definition of lossless compression. Look at e.g. how js crunch works: you create a dictionary of common sequences, split the file on those sequences recursively and then reassemble it by joining in reverse. Gzip, bzip2, &c, &c, it's all the same thing. Split the file by a common sequence and reassemble it by that. Patrick just created a customized compressor that went only 1 level deep.

Normally you'd need a delimiter to separate those chunks, a delimiter that doesn't occur in the chunks e.g. through padding or escaping. That, in turn, increases the filesize, and now you're in trouble. What Patrick did was to use EOF as a new fresh "delimiter" that doesn't occur anywhere, and at a cost of zero bytes, no less.

Cheating, or inventive.




The EOF is not at a cost of zero bytes; it costs as much as storing the length of each constituent file. The extra space used is in the file system accounting.


It's at a cost of 0 competition score bytes. Mike screwed up by allowing an alphabet of 257 symbols and then only counting 256 of them. Pretty much any compression or repacking algorithm could have been used at that point.


Dylan16807, that's a very concise way to put it, thanks for making that comment.


In return, Patrick screwed up in wanting to use the extra filesystem metadata, but he failed to actually ensure that it would be left unaltered. Only the file contents were to be left untouched.


It's inventive, but what if instead of doing it at the byte level, we did it at the bit level? Instead of splitting files on every "5" encountered, we split them at every 1 in binary, which gives about 50% of "compression".


That would essentially be a kind of run-length encoding. I doubt if a 50% compression ratio is possible, though.


I wonder how you got a downvote for that, it's quite accurate. A version of simplistic token replacement even has a wikipedia page http://en.wikipedia.org/wiki/Byte_pair_encoding




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: