This is my blog post, if anyone has any questions. I'll add that since I wrote t...

cs702 · on July 29, 2023

No questions from me. Just want to say: Thank you for doing all this work!

p1esk · on July 29, 2023

Your conclusion: “using ideas from text compression for text classification tasks is an interesting idea and may lead to other interesting research.”

Would you say this idea is interesting enough for you personally to research it further?

refulgentis · on July 29, 2023

Text-similarity embeddings aren't very interesting and will correlate with gzip, especially when the test is text similarity, especially when they're distinct vocabularies being tested.

The really useful ones are based on SBERT, and measure the likelihood that the answer is contained in the text that was embedded.

ex. from my unit tests: "what is my safe passcode?" has a strong match with "my lockbox pin is 1234", but a very weak match to 'my jewelry is stored safely in the safe'

I learned this from https://news.ycombinator.com/item?id=35377935: thank you to whoever posted this, blew my mind and gave me a powerful differentiator

darkteflon · on July 30, 2023

Found this fascinating, thanks. I’ve been circling SBERT over the last few weeks (along with a range of other techniques to improve the quality of retrieval). Reading your comment and the linked post and comments has really cemented for me that we’re on the right track.

ks2048 · on July 29, 2023

For me, no. Mainly because "text classification" is a pretty limited application and one I don't plan to spend much time on. For NLP tasks that require a deeper "understanding", I don't see how compression algorithms can help much (at least directly).

nico · on July 29, 2023

Just conceptually, compression is an analog of understanding

To be able to compress something, you need to understand it first

We use this everyday, we compress things by naming them

Once we name something, we don’t need to explain or describe, we can just use the name instead

That allows us to compress our communications and it directly affects the parties understanding of the information

That’s just conceptually. At a math/algorithm level I don’t really know the specifics of your research or the paper in question

mannykannot · on July 29, 2023

One could say that you need to understand something about the artifact you are compressing, but, to be clear, you can compress text without understanding anything about its semantic content, and this is what gzip does. The only understanding needed for that level of compression is that the thing to be compressed is a string in a binary alphabet.

joshuamorton · on July 29, 2023

Of course, which is why gzip is a good baseline for "better" compressors that do have semantic understanding.

The whole idea of an autoencoder is conceptual compression. You take a concept (say: human faces) and create a compressor that is so overfit to that concept that when given complete goobldygook (random seed data) it decompresses that to something with semantic meaning!

ChainOfFools · on July 29, 2023

It may sound strange out of context, but the most memorable quote I've encountered in any book or any piece of writing anywhere, at least in terms of informing my own understanding of language and the construction of meaning through communication, came in a book on screen writing by William Goldman. The guy who wrote The Princess Bride, of all things.

The sentence was simply, (and in capitals in the original), "POETRY IS COMPRESSION."

quickthrower2 · on July 29, 2023

Would make a good haiku line 2

ks2048 · on July 29, 2023

Yes, I agree. That's why I said directly (with regards to compression algorithms used for understanding). Indirectly, yes, compression and intelligence/understanding are closely related.

WoodenChair · on July 30, 2023

Thanks for linking to these other results. I found them very interesting. The latter is simply doing set intersection counts to measure distance, and it works well relative to the original technique. Has anyone compared the accuracy of these to naive bayes?

phyzome · on July 29, 2023

I'm idly curious how much of a speedup you achieved.

ks2048 · on July 29, 2023

I don't have complete numbers on this (I think it depends a lot on the size of training set), but for one dataset, normalized time for a batch:

    original    : 1.000
    precomputed : 0.644 (first improvement)
    gziplength  : 0.428 (+ 2nd improvement)