IMO we are well past peak cosine-similarity-search as a service. Most people I t...

peterstjohn · 2024-01-26T13:40:57 1706276457

I no longer work there, but Lucidworks has had embedding training as a first-class feature in Fusion since January 2020 (I know because I wrapped up adding it just as COVID became a thing). We definitely saw that even with just slightly out-of-band use of language - e.g. in e-commerce, things like "RD TSHRT XS", embedding search with open (and closed) models would fall below bog-standard* BM25 lexical search. Once you trained a model, performance would kick up above lexical search…and if you combined lexical _and_ vector search, things were great.

Also, a member on our team developed an amazing RNN-based model that still today beats the pants off most embedding models when it comes to speed, and is no slouch on CPU either…

(* I'm being harsh on BM25 - it is a baseline that people often forget in vector search, but it can be a tough one to beat at times)

softwaredoug · 2024-01-27T02:45:31 1706323531

Heh. A lot of what search people have known for a while, is suddenly being re-learned by the population at large, in the context of RAG, etc :)

mvkel · 2024-01-27T03:35:43 1706326543

The thing with tech is, if you're too early, it's not like you eventually get discovered and adopted.

When the time is finally right, people just "invent" what you made all over again.

softwaredoug · 2024-01-27T14:45:12 1706366712

Totally. And this has even happened in search. Open source search engines like Elasticsearch, etc did this... Google etc did this in the early Web days, and so on :)

data_maan · 2024-01-28T08:54:15 1706432055

Sorry, what is it that people in search _have_ known?

I know nothing about search, but a bit about ML, so I'm curious

softwaredoug · 2024-01-28T13:00:32 1706446832

That ranking is a lot more complicated than cosine similarity on embeddings

az226 · 2024-01-27T14:38:21 1706366301

What’s the model?

jn2clark · 2024-01-26T08:37:48 1706258268

We (Marqo) are doing a lot on 1 and 2. There is a huge amount to be done on the ML side of vector search and we are investing heavily in it. I think it has not quite sunk in that vector search systems are ML systems and everything that comes with that. I would love to chat about 1 and 2 so feel free to email me (email is in my profile).

jamesblonde · 2024-01-26T15:15:01 1706282101

> 1. Fine-tuning OSS embedding models on your real-world query patterns

This is not as easy as you make it sound :) Typically, the embeddings are multi-modal: the query string maps to a relevant document that I want to add as context to my prompt. If i collect lots of new query strings, i need to know the ground truth "relevant document" it maps to. Then I can use the two-tower embedding model to learn the "correct" document/context for a query.

I have thought about this problem for LLMs that do function calling. And what you can do is collect query strings and the function calling results, and ask GPT-4 - "is this a 'good' answer?". GPT-4 can be a teacher model for collecting training data for my two-tower embedding model.

Reference: https://www.hopsworks.ai/dictionary/two-tower-embedding-mode...

reissbaker · 2024-01-27T02:37:28 1706323048

I think the fact that finetuning embeddings well isn't easy is why it's a more useful service than hosted cosine similarity search ;)

DrAnshumali · 2024-01-26T16:52:09 1706287929

For Both 1 and 2, read https://medium.com/thirdai-blog/rag-challenge-dataset-i-can-...

batch12 · 2024-01-26T13:01:28 1706274088

I've been working on (3) embeddings translation with the goal being to translate something like OpenAI embeddings to UAE-Large. So far, I have had success using them for cosine similarity with around a 99.99% validation rate, but only 80% using Euclidean distance.

brookst · 2024-01-26T14:24:11 1706279051

I’m fascinated by embeddings translations and compatible embeddings with different numbers of dimensions. Can you share more about your work / findings?

sdenton4 · 2024-01-26T15:52:08 1706284328

I mean, the simplest answer is a matmul... Given embedding x, y, find M such that Mx ~= y. Easy to train so long as you've got access to both models to compute embedding over whatever you're interested in...

(easy to extend to two layers mlp as needed. maybe ensure that x and y are zero mean and unit length to make training the matmul a bit easier.)

hoerzu · 2024-01-26T16:40:42 1706287242

This sounds to me like what https://rungalileo.io is offering

latchkey · 2024-01-26T07:01:13 1706252473

Question, does this require specialized hardware at all? GPUs?

ipsum2 · 2024-01-26T08:12:16 1706256736

It doesn't require it in theory, but in practice its required bc CPUs are too slow at fine-tuning and computing embeddings.

phreeza · 2024-01-26T08:12:22 1706256742

Are you aware of any service or OSS solution for this?