Hacker News new | past | comments | ask | show | jobs | submit login

This is my blog post, if anyone has any questions.

I'll add that since I wrote these two blog posts, other people have sent me their other interesting work:

(1) I link to this at the end of the post (using zstd dictionaries): https://github.com/cyrilou242/ftcc

(2) today someone sent me this (bag-of-words better than gzip): https://arxiv.org/abs/2307.15002v1




No questions from me. Just want to say: Thank you for doing all this work!


Your conclusion: “using ideas from text compression for text classification tasks is an interesting idea and may lead to other interesting research.”

Would you say this idea is interesting enough for you personally to research it further?


Text-similarity embeddings aren't very interesting and will correlate with gzip, especially when the test is text similarity, especially when they're distinct vocabularies being tested.

The really useful ones are based on SBERT, and measure the likelihood that the answer is contained in the text that was embedded.

ex. from my unit tests: "what is my safe passcode?" has a strong match with "my lockbox pin is 1234", but a very weak match to 'my jewelry is stored safely in the safe'

I learned this from https://news.ycombinator.com/item?id=35377935: thank you to whoever posted this, blew my mind and gave me a powerful differentiator


Found this fascinating, thanks. I’ve been circling SBERT over the last few weeks (along with a range of other techniques to improve the quality of retrieval). Reading your comment and the linked post and comments has really cemented for me that we’re on the right track.


For me, no. Mainly because "text classification" is a pretty limited application and one I don't plan to spend much time on. For NLP tasks that require a deeper "understanding", I don't see how compression algorithms can help much (at least directly).


Just conceptually, compression is an analog of understanding

To be able to compress something, you need to understand it first

We use this everyday, we compress things by naming them

Once we name something, we don’t need to explain or describe, we can just use the name instead

That allows us to compress our communications and it directly affects the parties understanding of the information

That’s just conceptually. At a math/algorithm level I don’t really know the specifics of your research or the paper in question


One could say that you need to understand something about the artifact you are compressing, but, to be clear, you can compress text without understanding anything about its semantic content, and this is what gzip does. The only understanding needed for that level of compression is that the thing to be compressed is a string in a binary alphabet.


Of course, which is why gzip is a good baseline for "better" compressors that do have semantic understanding.

The whole idea of an autoencoder is conceptual compression. You take a concept (say: human faces) and create a compressor that is so overfit to that concept that when given complete goobldygook (random seed data) it decompresses that to something with semantic meaning!


It may sound strange out of context, but the most memorable quote I've encountered in any book or any piece of writing anywhere, at least in terms of informing my own understanding of language and the construction of meaning through communication, came in a book on screen writing by William Goldman. The guy who wrote The Princess Bride, of all things.

The sentence was simply, (and in capitals in the original), "POETRY IS COMPRESSION."


Would make a good haiku line 2


Yes, I agree. That's why I said directly (with regards to compression algorithms used for understanding). Indirectly, yes, compression and intelligence/understanding are closely related.


Thanks for linking to these other results. I found them very interesting. The latter is simply doing set intersection counts to measure distance, and it works well relative to the original technique. Has anyone compared the accuracy of these to naive bayes?


I'm idly curious how much of a speedup you achieved.


I don't have complete numbers on this (I think it depends a lot on the size of training set), but for one dataset, normalized time for a batch:

    original    : 1.000
    precomputed : 0.644 (first improvement)
    gziplength  : 0.428 (+ 2nd improvement)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: