Would you have some in links in mind of models that were pulled from LLMs? The l...

minimaxir · 2024-03-12T06:37:13

BERT embeddings were what proved that taking embeddings using the last hidden state works (in that case, using the [CLS] token representation which is IMO silly). Most of the top embedding models on the MTEB leaderboard are mean-pooled LLMs: https://huggingface.co/spaces/mteb/leaderboard

The not-quite-large embedding model I like to use now is nomic-embed-text-v1.5 (based on a BERT architecture), which supports a 8192 context window and MRL for reducing the dimensionality if needed: https://huggingface.co/nomic-ai/nomic-embed-text-v1.5

_t89y · 2024-03-12T07:04:12

Works for what? The leaderboards? BPE tiktokens, BPE GPT-2 tokens, SentencePiece, GloVe, word2vec, ..., take your pick, they all end up in a latent space of arbitrary dimensionality and arbitrary vocab size where they can be mapped onto images. This is never going to work for language. The only thing the leaderboards are good for is enabling you to charge more for your model than everyone else for a month or two. The only meaning hyperparameters like dimensionality and vocab size have is in their message that more is always better and scaling up is what matters.

yorwba · 2024-03-12T09:59:05

Works for:

Bitext mining: Given a sentence in one language, find its translation in a collection of sentences in another language using the cosine similarity of embeddings.

Classification: identify the kind of text you're dealing with using logistic regression on the embeddings.

Clustering: group similar texts together using k-means clustering on the embeddings.

Pair Classification: determine whether two texts are paraphrases of each other by using a binary threshold on the cosine similarity of the embeddings.

Reranking: given a query and a list of potential results, sort relevant results ahead of irrelevant ones by sorting according to the cosine similarity of embeddings.

Etc etc.

These are MTEB benchmark tasks https://arxiv.org/pdf/2210.07316.pdf . If you have no need for something like that, good for you, you don't need to care how well embeddings work for these tasks.

_t89y · 2024-03-12T15:58:30

Easy there, Firthmiester. I'm familiar with the canon. If getting some desirable behavior in your application is good enough for you then feel free to ignore what I'm saying.

minimaxir · 2024-03-12T17:05:41

Embeddings and vector stores wouldn't have taken off in the way that they did if they didn't actually work.

_t89y · 2024-03-14T22:06:00

They've taken off because they have utility in information retrieval systems. They work for getting info into Google (Stanford) Knowledge Panels. I don't think it really goes any further than that. They are most useful to the few orgs that went from dominating NLP research to controlling it outright by convincing everyone scale is the only way forward and owning scale. Alternatives to word embeddings aren't even considered or discussed. They are assumed as a starting point for pretty much all work in NLP today even though they are as uninteresting today as they were when word2vec was published in 2013. They do not and will not work for language.

Nowado · 2024-03-12T09:05:38

I'm not sure if it's the most modern setup there is, but https://www.youtube.com/watch?v=UPtG_38Oq8o gives exceptionally friendly explanation.