Text-similarity embeddings aren't very interesting and will correlate with gzip, especially when the test is text similarity, especially when they're distinct vocabularies being tested.
The really useful ones are based on SBERT, and measure the likelihood that the answer is contained in the text that was embedded.
ex. from my unit tests:
"what is my safe passcode?" has a strong match with "my lockbox pin is 1234", but a very weak match to 'my jewelry is stored safely in the safe'
Found this fascinating, thanks. I’ve been circling SBERT over the last few weeks (along with a range of other techniques to improve the quality of retrieval). Reading your comment and the linked post and comments has really cemented for me that we’re on the right track.
For me, no. Mainly because "text classification" is a pretty limited application and one I don't plan to spend much time on. For NLP tasks that require a deeper "understanding", I don't see how compression algorithms can help much (at least directly).
One could say that you need to understand something about the artifact you are compressing, but, to be clear, you can compress text without understanding anything about its semantic content, and this is what gzip does. The only understanding needed for that level of compression is that the thing to be compressed is a string in a binary alphabet.
Of course, which is why gzip is a good baseline for "better" compressors that do have semantic understanding.
The whole idea of an autoencoder is conceptual compression. You take a concept (say: human faces) and create a compressor that is so overfit to that concept that when given complete goobldygook (random seed data) it decompresses that to something with semantic meaning!
It may sound strange out of context, but the most memorable quote I've encountered in any book or any piece of writing anywhere, at least in terms of informing my own understanding of language and the construction of meaning through communication, came in a book on screen writing by William Goldman. The guy who wrote The Princess Bride, of all things.
The sentence was simply, (and in capitals in the original), "POETRY IS COMPRESSION."
Yes, I agree. That's why I said directly (with regards to compression algorithms used for understanding). Indirectly, yes, compression and intelligence/understanding are closely related.
Thanks for linking to these other results. I found them very interesting. The latter is simply doing set intersection counts to measure distance, and it works well relative to the original technique. Has anyone compared the accuracy of these to naive bayes?
I'll add that since I wrote these two blog posts, other people have sent me their other interesting work:
(1) I link to this at the end of the post (using zstd dictionaries): https://github.com/cyrilou242/ftcc
(2) today someone sent me this (bag-of-words better than gzip): https://arxiv.org/abs/2307.15002v1