Hacker News new | past | comments | ask | show | jobs | submit login

Switching to Word2Vec embeddings led to a substantial improvement in my cosine similarity evaluations for text similarity, but granted I was looking for actual similarity, not relevance. I tried many different methods and had lots of mediocre results initially.

code: https://github.com/jimmc414/document_intelligence/blob/main/... https://github.com/jimmc414/document_intelligence




Interesting, do you happen to have some quantitative results on this/additional insights/etc?

I've interpreted transformer vector similarity as 'likelihood to be followed by the same thing' which is close to word2vec's 'sum of likelihoods of all words to be replaced by the other set' (kinda), but also very different in some contexts.


There's no simplified definition like that, vectors can even capture logical properties, it's all down to what the model was tuned for: https://www.sbert.net/examples/training/nli/README.html


this is very interesting. you had better results here than the openai ada02 and other embeddings like bge ?


As opposed to sentencebert or what?


DistilBERT and RoBERTa




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: