Hacker News new | past | comments | ask | show | jobs | submit login

IMO we are well past peak cosine-similarity-search as a service. Most people I talk to in the space don't bother using specialized vector DBs for that.

I think there's space for a much more interesting product that is longer-lived (since it's harder to implement than just cosine-similarity-search on vectors), which is:

1. Fine-tuning OSS embedding models on your real-world query patterns

2. Storing and recomputing embeddings for your data as you update the fine-tuned models.

MTEB averages are fine, but hardly anyone uses the average result: most use cases are specialized (i.e. classification vs clustering vs retrieval). The best models try to be decent at all of those, but I'd bet that finetuning on a specific use case would beat a general-purpose model, especially on your own dataset (your retrieval is probably meaningfully different than someone else's: code retrieval vs document Q&A, for example). And your queries are usually specialized! People using embeddings for RAG are generally not also trying to use the same embeddings for clustering or classification; and the reverse is true too (your recommendation system is likely different than your search system).

And if you're fine-tuning new models regularly, you need storage + management, since you'll need to recompute the embeddings every time you deploy a new model.

I would pay for a service that made (1) and (2) easy.




I no longer work there, but Lucidworks has had embedding training as a first-class feature in Fusion since January 2020 (I know because I wrapped up adding it just as COVID became a thing). We definitely saw that even with just slightly out-of-band use of language - e.g. in e-commerce, things like "RD TSHRT XS", embedding search with open (and closed) models would fall below bog-standard* BM25 lexical search. Once you trained a model, performance would kick up above lexical search…and if you combined lexical _and_ vector search, things were great.

Also, a member on our team developed an amazing RNN-based model that still today beats the pants off most embedding models when it comes to speed, and is no slouch on CPU either…

(* I'm being harsh on BM25 - it is a baseline that people often forget in vector search, but it can be a tough one to beat at times)


Heh. A lot of what search people have known for a while, is suddenly being re-learned by the population at large, in the context of RAG, etc :)


The thing with tech is, if you're too early, it's not like you eventually get discovered and adopted.

When the time is finally right, people just "invent" what you made all over again.


Totally. And this has even happened in search. Open source search engines like Elasticsearch, etc did this... Google etc did this in the early Web days, and so on :)


Sorry, what is it that people in search _have_ known?

I know nothing about search, but a bit about ML, so I'm curious


That ranking is a lot more complicated than cosine similarity on embeddings


What’s the model?


We (Marqo) are doing a lot on 1 and 2. There is a huge amount to be done on the ML side of vector search and we are investing heavily in it. I think it has not quite sunk in that vector search systems are ML systems and everything that comes with that. I would love to chat about 1 and 2 so feel free to email me (email is in my profile).


> 1. Fine-tuning OSS embedding models on your real-world query patterns

This is not as easy as you make it sound :) Typically, the embeddings are multi-modal: the query string maps to a relevant document that I want to add as context to my prompt. If i collect lots of new query strings, i need to know the ground truth "relevant document" it maps to. Then I can use the two-tower embedding model to learn the "correct" document/context for a query.

I have thought about this problem for LLMs that do function calling. And what you can do is collect query strings and the function calling results, and ask GPT-4 - "is this a 'good' answer?". GPT-4 can be a teacher model for collecting training data for my two-tower embedding model.

Reference: https://www.hopsworks.ai/dictionary/two-tower-embedding-mode...


I think the fact that finetuning embeddings well isn't easy is why it's a more useful service than hosted cosine similarity search ;)



I've been working on (3) embeddings translation with the goal being to translate something like OpenAI embeddings to UAE-Large. So far, I have had success using them for cosine similarity with around a 99.99% validation rate, but only 80% using Euclidean distance.


I’m fascinated by embeddings translations and compatible embeddings with different numbers of dimensions. Can you share more about your work / findings?


I mean, the simplest answer is a matmul... Given embedding x, y, find M such that Mx ~= y. Easy to train so long as you've got access to both models to compute embedding over whatever you're interested in...

(easy to extend to two layers mlp as needed. maybe ensure that x and y are zero mean and unit length to make training the matmul a bit easier.)


This sounds to me like what https://rungalileo.io is offering


Question, does this require specialized hardware at all? GPUs?


It doesn't require it in theory, but in practice its required bc CPUs are too slow at fine-tuning and computing embeddings.


Are you aware of any service or OSS solution for this?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: