Hacker News new | past | comments | ask | show | jobs | submit login
Do we think about vector storage wrong? (hachyderm.io)
139 points by softwaredoug on Sept 5, 2023 | hide | past | favorite | 89 comments



The focus on the top 10 in vector search is a product of wanting to prove value over keyword search. Keyword search is going to miss some conceptual matches. You can try to work around that with tokenization and complex queries with all variations but it's not easy.

Vector search isn't all that new a concept. For example, the annoy library (https://github.com/spotify/annoy) has been around since 2014. It was one of the first open source approximate nearest neighbor libraries. Recommendations have always been a good use case for vector similarity.

Recommendations are a natural extension of search and transformers models made building the vectors for natural language possible. To prove the worth of vector search over keyword search, the focus was always on showing how the top N matches include results not possible with keyword search.

In 2023, there has been a shift towards acknowledging keyword search also has value and that a combination of vector + keyword search (aka hybrid search) operates in the sweet spot. Once again this is validated through the same benchmarks which focus on the top 10.

On top of all this, there is also the reality that the vector database space is very crowded and some want to use their performance benchmarks for marketing.

Disclaimer: I am the author of txtai (https://github.com/neuml/txtai), an open source embeddings database


Amusingly, vector search is conceptually nearly as old as keyword search.

Most older IR textbooks promote the idea of viewing keyword search as a maximization problem of the weighted bit vector product between the query and the document, where the weight matrix is something like BM-25.


I was interested in dense vector search in 2004 and a bit depressed at how the indexing algorithms weren’t that good, Around 2013 I was involved with an actual product which was a search engine for patents.

The thing about vector search today is that it is an absolute gold rush, Pinecone had the right idea but the wrong product (sorry SAAS is toxic), many of the latecomers are… late, and will find the whole startup and VC process will cost precious months when customers want to build products right now.

Note for that patent search engine we had a quite a different idea than is fashionable today, that is, like very old TREC we care about the quality of results you got 1000 or 2000 results in in the assumption that a patent researcher wanted to be comprehensive. When I was first into IR it was in the context of “how is Google so much better than other search engines?” and one part of it was that based on the way search engines were being evaluated at TREC, Google was not a better search engine* because it was so focused on the first page.

It has come full circle now, I have talked to people recently who complain that current TREC evaluations are too focused on the first few results.


I assume we're not talking about the Texas Real Estate Commission?



> The focus on the top 10 in vector search is a product of wanting to prove value over keyword search

I'm surprised to see less mention of the use of vector search in RAG. The reason top 10 (or less) is so important in RAG is that you only have so much context to work with (and even in large context models, you still would prefer to minimize the total amount of context you're sending the model).

Everyone I know working seriously on vector search is using it to plug into a prompt for an LLM. For this use case top 10 is a fairly important metric.


I agree. my experience is that hybrid search does provide better results in many cases, and is honestly not as easy to implement as may seem at first. In general, getting search right can be complicated today and the common thinking of "hey I'm going to put up a vector DB and use that" is simplistic.

Disclaimer: I'm with Vectara (https://vectara.com), we provide an end-to-end platform for building GenAI products.


Isn't one big problem of vector search that it is bounded by the complexity of the knn problem in multiple dimensions, whereas keyword search like bm25 is very fast and scalable because of inverted indices? Annoy and others need to sacrifice accuracy for speed I believe. Hybrid search can be used to first filter a smaller number of candidates and then rerank them using vector search. Performance wise, it is naturally pretty slow and hard to scale compared to keyword search.


The elephant in the room is the Curse of Dimensionality. As the number of vector dimensions increase, "nearest neighbor" search behaves in an surprising manner. In a high-dimensional cube randomly sampled vectors are mostly near the edge of the cube, not at all what you expect based on two or three dimensions. There's other hitches, too.

When using vector search, you're making two leaps - that your document can really be summarized in a vector (keeping its important distinguishing semantic features), and that it's possible to retrieve similar documents using high-dimensional vectors. Both of these are lossy and have all sorts of caveats.

What would be great is if these new vector DB companies show examples where the top-ten "search engine results page" based on semantic embedding / vector search is clearly better than a conventionally-tuned, keyword-based system like ElasticSearch. I haven't see that, but I have seen tons of numbers about precision/recall/etc. What's the point of taking 10ns to retrieve good vector results if its unclear if they're solving a real problem?

https://en.wikipedia.org/wiki/Curse_of_dimensionality


These vectors are lower-dimensional than traditional vectors though, aren't they? Vector embeddings are in the hundreds to low thousands range of dimensions (roughly between 128-1024), whereas TF-IDF has the same dimension as your vocabulary. It's also not just about being flat-out better, but about increasing the recall of queries, as you're grabbing content that doesn't contain the keywords directly, but is still relevant. You are also free to mix the two approaches together in one result set, which gives the best of both.


The problems with dimensionality certainly show up even with 256 dimensions. PCA-ing down to a few hundred dimensions is still a problem, and then you have to deal with PCA lossiness too!


Nobody used TF-IDF for vector lookups without applying a PCA first though.


You can try the 3 methods: keyword, knn, and hybrid on a collection in the AI domain here https://search.zeta-alpha.com/. YMMV sometimes knn has more interesting results, sometimes keyword is closer to what you wanted. Hybrid is sometimes also a good middle ground. But it fails when one of the retrievers has very bad results (because it will interleave them)


Thanks, interesting take!


If you only have to sacrifice a tiny bit of accuracy, it's not really a problem is it? You can quantify these things experimentally to see whether it's acceptable for your use case.

Given that vector and hybrid search are used in many situations at scale, I think to make the kind of claims you're trying to make, it might help to quantify exactly when you think it will be too slow or what experiences you have had of struggling to make it fast enough.


In terms of accuracy loss, it's also worth considering what additional conceptual matches are available that aren't with keyword search.

In terms of performance, vector search is definitely slow but it's come a long way. There are a number of solid 384 dimension models that work well.


The dimensionality isn't really a problem if you are selecting the right features, which is usually the secret sauce for any particular application.

It's funny seeing this come up now as this was my full time job ten years ago and even then most of the research had been completed decades prior.


We (at Weaviate) support separate BM25, vector, and combined hybrid search, so I can talk about the performance aspect of all of these a lot. The tl;dr is: It's complicated. What you say, may be true in some cases, but not in others. One does not always scale better than the other.

For example, ANN indexing with an index like HNSW scales pretty much logarithmically. A bit oversimplified, but roughly we see search times double with every 10x of the search space. In addition, calculations like euclidean distance and dot product (and therefore cosine sim) can be parallelized very well at a hardware level, e.g. AVX2/AVX512/neon and similar SIMD techniques.

With BM25, this isn't quite as simple. We have algorithms such as WAND and Block-Max-WAND, which help eliminate _scoring_ elements that cannot reach a top-k spot. However, the distribution of terms plays a big role here. As a rule of thumb, the rarer a word is the fewer documents require scoring. Let's say you have 1B documents, and a 2-term query, but each term matches only 100k documents. If you AND-combine those queries, you will have at most 100k matches, if you OR-combine them, you will have at most 200k. The fact that there were 1B documents indexed played no role. But now think of two terms that each match 500M objects. Even with the aforementioned algorithms – which rely on the relative impact of each term – there is a risk we would now have to score every single document in the database. This is, again, over-simplified, but my point is the following:

ANN latency is fairly predictable. BM25 latency depends a lot on the input query. Our monitoring shows that in production cases, when running a hybrid search, sometimes the vector query is the bottleneck, sometimes the BM25 query is.

> Hybrid search can be used to first filter a smaller number of candidates and then rerank them using vector search

My post is already getting quite long, but want to quickly comment on this two-step approach. For Weaviate, that's not the case. BM25 and vector searches happen independently, then the scores are aggregated to combine a single result set.

Yes, you could also use BM25 as a filter set first, then re-rank the results with embeddings, however you wouold lose the BM25 scores. If you want to use keywords just as filters, you can do that much, much cheaper than through BM25 scoring. In Weaviate matching keywords to create a filter set, would only require AND-ing or OR-ing a few roaring bitmaps which is orders of magnitides more efficient than BM25 scoring.

Disclaimer: associated with Weaviate.


Thanks, that's very interesting!


> In 2023, there has been a shift towards acknowledging keyword search also has value and that a combination of vector + keyword search (aka hybrid search) operates in the sweet spot.

Do you have evidence that this hybrid approach is better than a vector search? That would be surprising.


Imagine "day 2" search tasks like the user wanting to filter results for a specific time range

You could force the data team to go through hoops like multimodal embeddings, multiple indexes, post-filtering... or figure it out

Near-term, it makes sense to do one thing well, so many new vector DBs are really mutable vector indexes with lackluster search. But eventually, if you are not a narrowly-scoped kernel (faiss) but the main search platform needing $$$-level marketshare for growing payrolls, you're incentivized to figure search out beyond ANN, and your competitors are advertising that they are doing the same. So you add post-filtering, and then eventually, maybe even push-down/dynamic.

Source: We have been happily using on-the-fly GPU knn embeddings for years in pygraphistry, but when we added billion-scale vector indexing to power louie.ai RAG search for real-time news & database querying, we outgrow pure ann on our first user, and I'm thankful we did diligence on our DB selection


And what did that due diligence reveal?


Most prominent for a surprising amount of the vendors we looked at last year (https://gradientflow.com/the-vector-database-index/):

- We would have paid more for storing billion+ rows in-memory than our gov & enterprise customers were paying for a full product. Since then, ivfpq etc have gotten more popular to do responsive searches for bigger-than-memory datasets. If we needed small in-memory workloads, we could have done faiss+pandas+fastapi and moved on.

- Vendors pitched themselves as managed databases yet their users complain about them going down & losing data

- Many did not support basic search operators: time filtering, ...

- Few/none had real modern security, e.g., row-level multi-tenant ABAC

I haven't been tracking on a per-vendor basis as each publicizes making piecemeal progress on these. We found one that did #1-3 without failing us yet, and we're biting the bullet to implement #4. A scalable OSS answer to #4 might be the one that causes us to switch.


Yes, we published benchmarks here: https://www.pinecone.io/blog/hybrid-search/


In relation to txtai: https://neuml.hashnode.dev/benefits-of-hybrid-search

If you run a web search for "vector hybrid search" you'll come across plenty of articles for a number of vector databases discussing the topic.


I'd actually want evidence for the opposite. Why vector retrieval for search? What's the evidence for that?

BM25 is often a baseline for improving upon. But I see improvements using lots of techniques, not just vector retrieval.


It would be good to have a BM25 baseline entry on this leaderboard: https://huggingface.co/spaces/mteb/leaderboard

And I'm sure you're aware of the BEIR paper: https://arxiv.org/abs/2306.07471. Elastic references that in this blogpost: https://www.elastic.co/blog/improving-information-retrieval-...

I agree that BM25 retrieval + vector re-ranking can work. But vector search does bring results to the table that vanilla BM25 can't, even with a large retrieval window. So I do think there is a place for both with the usual "it depends on your data/requirements" caveat.


The opposite is also true. BM25 prefers lexical matches and brings these candidates back that vector search often doesn’t.

I am not disagreeing vectors are useful, but I think benchmark based evidence is not the same as deploying a solution that must scale, be constantly updated, serve many use cases like filtering, search syntax, etc customers want.

And plus I think there’s a real danger Of herding to vector retrieval (even then one view of it) which cuts off exploration of diverse solutions.


Hybrid retrieval can possibly be the best of both worlds.

I invested a lot of time with the last txtai release adding a minimal dependency Python-based BM25 component (https://neuml.hashnode.dev/building-an-efficient-sparse-keyw...). And keyword-only indexes are supported if one desires.

I'm with you 100% on not herding to any one way for any problem. I still remember the pre-2023 world where you weren't pressured to work LLMs into everything.


Been following this space for a bit. In my view, the literature on the internet is distorted about these things because of the need for certain OSS project creators and (non OSS) companies to promote their own products. It typically goes like this:

[Insert technique name] is a good technique, but it can be further improved if you enable [some complex workflow]. [Some example of how it improved a particular case]. We at [Company] already do this at scale and you can use it out of the box if you use it. Go make better things.

Nothing against people doing this stuff. It just makes me think how difficult is it to promote open source projects today.

Regarding Vector DBs, most have advocated for weaviate and pinecone. Incidentally, all run into quantization problems at scale (1M+ vectors) if the clusters are to be adjusted, and handle it differently. i tried it for a 6000 vector product[1] and got the best results by using a pkl file rather than any of the dbs.

[1] Not sharing the product here as such. We just scraped a govt tax website to create a search engine on top of that. Helps as the data is not indexed by google, and we had no interest from customers in wanting to chat. They were interested in original source along w a summary.


Bit sad how it's going. People go straight to product and venture scaled growth.

Build protocols, not products! I tell people.

It's a terrible advice actually, unless you like eating noodles and scraping by. But it's still the right way.

We're still waiting for the antirez' Redis version of Vector DBs. Maybe he should do one! :)


The good news is that this already exists in the Redis search module [1], which allows you to do similarity search against indexed embeddings, among other features, and offers comparable performance against other ANN libraries [2], depending on your performance criteria.

I've been using it for a side project to do semantic search on books[3] and have been really happy with its performance. (Not affiliated with any of this, was mostly interested in exploring existing well-performing, fairly standard tools with low latency)

[1] https://redis.io/docs/interact/search-and-query/search/vecto... [2] https://ann-benchmarks.com/#redisearch [3] https://viberary.pizza/how


Bit of a side note on Redis Search. While performance is pretty good, it is not production quality yet, mostly around monitoring indexing errors and quirky query results. Silent indexing errors pretty much make it impossible to use it in production, especially for structured data.


Redis Enterprise edition does have Vector database.i haven't used it but am curious about other people's experiences.

https://redis.com/solutions/use-cases/vector-database/


> Regarding Vector DBs, most have advocated for weaviate and pinecone. Incidentally, all run into quantization problems at scale (1M+ vectors)...

Can you elaborate? We (at Pinecone) have dozens of customers with over a billion embeddings, still serving queries in under 300ms (p95) with index updates in <1s. If you saw performance issues with just 1M embeddings then something else is going on that we may be able to help with.


It's a specific case. Most customers won't run into it. It's to do with quantization or specifically what happens when you add enough vectors that the original clustering changes (use cases could be hit when you onboard a IM/consulting customer which has so much data to alter clusters amongst others). It is not a performance issue to be fair. Just that altering a cluster takes time.

I was just testing to see if there is a DB which does not run into such problems, before putting a business case. I am not discouraging the use, just saying that most cases do not need a vector DB. The reason we did not choose one was because our business case's number of vectors was small.


I may be misunderstanding, but I'll try to answer — quantization typically means retrieval will be slower (if referring to techniques like product quantization), but that is the case whether you're at 10K vectors or 1B vectors, afaik it doesn't really make a difference because you're only quantizing the query vector at query time (it has been awhile since I read anything on quantization, so I could be mistaken).

Maybe your question is referring to the need to have quantization at larger index sizes? In which case, yes would typically be true because you're wanting to either (1) minimize the index size when quantizing it, or (2) optimizing the query space (to search through less). Whether you want (1) or (2) will impact on the type of quantization being performed (basically 1 == product quantization, and 2 == inverted index)

So once you get to the 1M+ size, you need to consider quantization in some form - or you can go with graph-based retrieval, if you don't mind using a lot of disk space.


You likely won't see performance (speed) benefits if you have just tens of thousands of vectors and of course brute force (vector matrix multiplication) will have better precision & recall.

Vector databases make sense when you have a lot of data or some other requirements (e.g. persistence, online updates, complex filtering, etc) that you would rather not implement yourself.


exactly. brute force methods work great for thousands pieces of data of (almost) any kind. for vector data no vector db is needed; for structured data even a simple csv file with a python script will do the job and a sql db is not necessary.


The thing is, depending on the dimensions of the vectors your using the growth rate can be pretty drastic. Each openai vector is 5kb and 1500 dimensions. So sure searching 1k vectors, just brute force it. But what if you have 100k vectors and a ton of users hammering at the index searching through it? Each search is now 150,000,000 dot products. Go up one more order of magnitude and you definitely need an index. And I don't think 1 million vectors is that much.


Number of vectors are determined by (other than original dataset ofc) how you chose to chunk the data you have available. Bigger chunks work better in terms of search (empirically) and they also keep the number of vectors down. For openai, based on prevalant norms and their cookbooks, 1M vectors likely mean 1M (more like 700K) pdf pages of text (at a token size of 1000 per embedding). That is a lot of textual data for a decent size company. Enterprises might reach that stage. Consulting firms definitely would - though they already trained and announced their own models.


700k pdf pages is not a lot. Also, you might be a business serving other businesses and indexing their documents and at reasonable (again, not google scale), 700k pages is again not a lot.

Another way to look at 700k pages is 2333 300 page books.


Eh... Because keyword and BM25 do not work that well, especially for enterprise search? Many of us surely complained how bad enterprise search was, right? A company usually has a hodgepodge of documents. They don't link to each other that much so page rank does not work well. They have fewer than 1 click per day, if there is any at all. So signals based on user interactions will not be useful or will be expensive to build. We often search with keywords that show up in enough number of documents but not enough to make BM25 rank the matches high. Query understanding hardly exists in enterprise search, because building query autocompletion, query correction, query reformulation and etc all requires some level of semantic understanding. If it's not embedding based, it has to be stats-based, which requires enough volume of clicks and user evaluation that enterprise search can't afford. The list can go on and on. There is a reason that Google's search appliance never took off, and people are still complaining about quality of enterprise search.

Amazingly, a true semantic search would work well. And to do semantic search in scale, you'll need a vector database.


Dear god this is accurate. A product I worked on a few years ago wanted to add a feature searching various PDFs that were a slurry of the same words. After being disappointed with the results of keyword search, we deemed semantic search too complicated to implement and scrapped the feature. Great to see it's getting easier.


I work in an adjacent area, and this is the most useful and information-dense summary I've read in any forum in at least a week.

Word!


This is tangential, but another issue I've had with existing vector databases is write throughput. Weaviate and Pinecone build an index immediately after the creation of a "table" (or rather, the analogue thereof) and this makes write throughput abysmal at times.

We have around 2 million 768-length embeddings + metadata -- writing too fast to Weaviate quickly results in OOMs. We circumvented this by very slowly trickling our data through to Weaviate, but this causes many-hours-long writes.

I'm experimenting with dumping our data into postgres and building an hnsw index with pgvector once all writes are done. This has reduced our time to maybe 2 hours at most (the bulk of that being index building time, of course).


Hey @rvrs, I work on Weaviate and we are doing some improvements around increasing write throughput:

1. gRPC. Using gRPC to write vectors has had a really nice performance boost. It is released in Weaviate core but here is still some work on do on the clients. Feel free to get in contact if you would like to try it out.

2. Parameter tuning. lowering `efConstruction` can speed up imports.

3. We are also working on async indexing https://github.com/weaviate/weaviate/issues/3463 which will further speed things up.

In comparison with pgvector, Weaviate has more flexible query options such as hybrid search and quantization to save memory on larger datasets.


I once solved worst-case write performance in an open source MVCC data store (Apache Lucy, a “loose C” port of Lucene) like so:

1) Limit the maximum size of an update. This causes gradual degradation of the index, which is ordinarily dealt with through periodic consolidation — but now this this consolidation will be disabled in the main write mechanism.

2) Enable a background consolidation process which runs concurrently. At the end of the consolidation process a short write lock will need to be obtained while you replay all deletions that have occurred since consolidation started, but the time required isn’t problematic.

This fix required major surgery to the Lucy indexer core code, so it’s something that that would only be feasible if the data store itself supported it.


@rvrs, Have you tried Pinecone's S1 pods? Curious which indexing type you went with when experimenting (https://docs.pinecone.io/docs/indexes)

Disclaimer: I'm from Pinecone, hello!


Yeah basically all the vector "database" solutions in market have chosen data-dependent indexes, so you need the data upfront. Imagine if regular databases needed all data upfront before they could build indexes. It's kind of crazy...


Remember the "Mongo DB is web scale" craze from now over a decade ago? That's the current state of vector DBs, despite vector DBs having been around forever.

Do you have more than a terabyte of embeddings? No? Do yourself a favor and use pg_vector with a HNSW index, it'll do just fine. Operationally, it is very hard to beat Postgres.


An OpenAI embeddings vector is 1536 4 byte floats. 1 TiB is roughly 174K such embeddings vectors.


Your math is wrong. 100k 32 bit vectors is 600mb

I think your point is right though. Searching through these is requires an index of some sort at any reasonable scale (not google scale).


Hmmm, yeah, I used 2^30 instead of 2^40. Should not comment before caffeinating.


And if you just have tens of thousands, straight up SQL can work pretty well, too.


If you have tens of thousands, a `for` loop over an array in memory is plenty fast.

I've worked on a project that searched through 50 million vectors with linear brute-force search, just with a sprinkle of AVX (and it used the brute-force search, because the curse of dimensionality killed all the smarter approaches).


Wow!

Not the point of the article, but it is striking to see how mastodon is way better in terms of usability/aesthetic and general UX than Twitter or 'x'.

I've been mildly interested in mastodon for quite some time from a political perspective (remove large support of big companies where possible), but I didn't realize it was just sooo much better to use.


And there are tons of third party clients. I think Tusky is the best one I’ve seen for Android, and there’s an interesting web-based one called Elk that’s very nice. You load up https://elk.zone and then use it as a front-end to sign in to your server.


Also the mastodon web client installs great as a PWA. Just add your home server as a shortcut to your home screen - no client install needed.


I really enjoy Mastodon and believe it has staying power.

Not rocket ship “going to explode and burn out” power. There’s enough there to keep people engaged. The decentralized nature means there’s always corners of it alive and growing.


This space is thr ultimate in premature optimization and obsession or implementation details (including the classic "it's written in programming language X so it has to be better") with relatively little real world experiential knowledge in the mix


I'm not sure I understand. Isn't vector storage about storage and retrieval strictly, being a separate thing from the embedding generation itself?

My mental model is that the job of vector storage is to just fetch the closest vectors to a target vector for some proximity metric, with the model (or method , more generally) that generates embeddings being its own distinct problem. You may later do things like passage re-ranking and maximal marginal relevance to refine it, but again, different task.

I understand vector search against tons of other vectors is an approximation endeavor due to practical concerns, so it doesn't seem strange for vector store accuracy to be measured in terms of top-k retrieval. Post processing and the quality of the vectors themselves is a different thing to judge.


Anyone have insights on using Elasticsearch as a vector database as opposed to a specialised vector DB? We have a complex keyword search already, and would like to introduce knn query ability to find similar documents. ES offers that in recent versions, but I’m wondering whether that’s a good idea.


> I’m wondering whether that’s a good idea.

That depends... How fast do you need it to be, how many vectors will you be searching across, and how often will the index be updated? If the answers are anything resembling "very, many, and often" then you'll want to at least compare with a DB that's been purpose-built for vector search. I'm talking >10M vectors, <100ms, <hourly updates. If the workload is anything less than that, then just use whatever is most convenient -- which in your case could be Elastic.

(Disclosure: I'm from Pinecone. The above is based on independent testing.)


Do have have any links handy to those test results?


This video goes through some of the testing methodology and techniques: https://youtu.be/7E-eiUN9d6U?si=zSoiGH2QAlQoxxcr


I looked at this a few months ago. Elasticsearch has some limitations with respect to the vector length (1024, I think); which rules out using a lot of the popular off the shelf models. A key insight is that the performance of off the shelf models isn't great compared to a hand tuned query and bm25 (the Lucene ranking algorithm). I've seen multiple people make the point that the built in ranking is pretty hard to beat unless you specialize your models to your use case.

A key consideration for the vector size limitation is that storing and working with large amounts of huge vectors gets expensive quickly. Simply storing lots of huge embeddings can take up a lot of space.

And of course using knn with huge result sets is very expensive, especially if the vectors are large. Having the ability to filter down the result set with a regular query and then ranking the candidates with a vector query helps keep searches responsive and cost low.

If you are interested in this, you might want to look at Opensearch as well. They implemented vector search independently from Elasticsearch. Their implementation supports a few additional vector storage options and querying options via native libraries. I haven't used any of that extensively but it looks interesting.


I believe the 1024 limit has been upped in recent versions of Elasticsearch

https://github.com/elastic/elasticsearch/issues/92458


Ah, that’s neat, thank you for the input! We’re actually using a homegrown (word2vec descendant) model to build document vectors - it works well on its own, but implementing a search engine on top efficiently has proven to be futile.


It's been a minute since I've looked at embeddings in this context but isn't Johnson-Lindenstrauss going to be applicable here so that you can get away with 1024-long (or shorter) vectors?


Jimmy Lin, a researcher a U Waterloo, recently published "Lucene is All You Need" https://arxiv.org/abs/2308.14963

I know, FWIW, Elasticsearch has been investing a lot in this space lately. It's not perfect, but its bound to get better.


That paper does a terrible job of making Lucene look useful, though. 10qps from a server with 1TB of RAM is not great (and I know Lucene HNSW can perform better than that in the real world, so I am somewhat mystified that this paper is being pushed by the community).


One thing I have been wondering about is how, concretely, in practice embeddings will be used. Will companies create embeddings for most / all content that they have? Will they create an embedding for every sentence, paragraph or page of text (or all of those). Will they store hierarchies of embeddings? Do they then store the original text so they can invert the process (that seems obvious)?


User/Item/Query embeddings are the most common. That way you can generate per-user recommendations, or search results for a given query (with personalization using side information). Video will be interesting, once we have video embeddings (maybe this exists already). It depends on the use-case but a few of your ideas are certainly possible. Generally I've seen them at a coarse rather than fine level, but I'm sure that's out there too.

This looks like a good overview if you want to read about it: https://recsysml.substack.com/p/two-tower-models-for-retriev...


1. Check out this cool chat-with-my-docs app I made with LangChain!

2. Okay, let's productize!

3. Ouch, the results aren't quite production-ready...

4. Huh, the bottleneck seems to be the search & retrieval process...

5. Oh, I guess I don't know anything about search, time to catch up on the field...

6. Ok, now I need to build an inverse index or knowledge graph and augment it with embeddings.

7. Hm, actually, language models (including BERT-based, Splade) could be pretty useful in building structure out of my dataset...

IMO the best way to incorporate transformers into search has yet to be discovered, although whatever OpenEvidence.com is doing seems to work well. If I were still in grad school I would look into some kind of PageRank / graph NN approach to the problem, or maybe hyperbolic embeddings.

Reason being that for many corpora there exists some kind of internal structure or linking that can be exploited for speed and accuracy, which you don't quite get from nearest neighbors in a Euclidean space.


For starters I don't believe that people even understand or agree what "vector" even means in this context.


A great deal of the current excitement around vector search is related to embeddings.

I tried to provide a very clear explanation of embeddings here, since I know lots of people haven't yet learned what they are or why they are interesting: https://simonwillison.net/2023/Sep/4/llm-embeddings/


Yes, the word is a bit overloaded... I guess from skimming through the article that here it is vector in the sense of word2vec algorithms.


Yes it's all wrong, because: a) recall is designed to measure binary relevance, but vector scores are not good relevance judgments and they aren't binary. b) most models optimise purely for distance, which makes nDCG look great, but causes content to clump together. This loses local ranking precision and the noise from embedding order is significantly greater than the approximation in the ANN system c) bi-encoders have significantly greater error than cross-encoders. Basically every vector DB is blowing at least one order of magnitude more resources than they need to to optimise bi-encoding efficiency which is wrong anyway.

Disclaimer: I work at Algolia.


Who's focusing on top 10 recall? I recently saw someone ask if we (I work at Zilliz) can update Milvus' recall to 10 million nearest neighbors lol


Asking as a newbie in database programming.

Are vector databases built on top of rd trees?


Do we "obsess over perfect accuracy in top 10 for vector dbs"? I don't see a ton of data that points towards us (vector DB ppl) building towards traditional TREC/BEIR #s.

I do see lots of us in the vector DB space demonstrating on real data the speed and semantic retrieval capabilities of vector DBs. Not many of us publish traditional IR benchmark #s simply because vector search is a pretty different ballgame than traditional keyword/BM25 search (as you know).

However, since most search practitioners have developed their mental models around things like TFIDF and BM25 + Learning To Rank, many vector DBs tailor their language to fit that heuristic.

If you look at vector DBs on their own -- I like to look at them as a piece of infra to use in conjunction with pre/post processing + RAG + BM25 + whatever else you want -- you'll find that measuring their accuracy against traditional IR benchmarks is a bit like fitting a square peg into a round hole. It almost fits! But not quite.

I think one of the starkest differences between the vector search and traditional search when it comes to retrieval evaluation is the great difficulty involved in creating judgment lists to measure vector search performance. Since vector search inherently retrieves documents that are approximately conceptually similar to a query, the initial recall set, especially with an IRL dataset (e.g. Reddit, Shopify, etc.) is a moving target. Language evolves. Idiosyncrasies pop up. Etc.

Not to say that those ^ issues don't plague keyword search as well. But the powerful thing about vector databases IMO is that unlike traditional indices that need lots of fine tuning and IR experts, a vector DB will capture these new additions to your domain's vocabulary immediately, based on the surrounding context in the document.

Of course you can do similar things with traditional search engines by incorporating curated lists of synonyms, antonyms, stop words, etc. but that takes time (despite engines like Solr having built-in lists of these kinds, search engineers often have to curate them further, manually, because their product's domain is so niche) and lots and lots of query-stream analysis (I remember personally doing lots of the latter at Reddit, hehe).

TLDR: - I think pure vector search platforms should be evaluated differently than traditional keyword search platforms - I think vector search is a tool -- simply one of many -- that search engineers should use to make their search engine results more relevant.

Disclaimer: I'm from Pinecone (and hi, Doug!)


Speaking of tools, I want to add keyword search to my CEQ project. I’m currently just using pinecone for retrieval, but I’m not always getting results for specific niche technical terms. Do you have any advice?


If you want to add true keyword search (i.e. traditional BM25 matching-type search capabilities), you should try hybrid search. LMK if you have issues! https://docs.pinecone.io/docs/hybrid-search


Definitely!

Have you tried adding whatever terms you want retrieved back as 'metadata' attached to your vectors?

https://docs.pinecone.io/docs/metadata-filtering


Oh my god. I didn’t realize you could use the $in clause like that. I already have the text content indexed so this will work nicely. Embarrassing, but thank you!


>I don't see a ton of data that points towards us (vector DB ppl) building >towards traditional TREC/BEIR #s.

This is highly accurate, most vector database companies don't talk about the shortcomings of vector representations for search.

>TLDR: - I think pure vector search platforms should be evaluated differently than >traditional keyword search platforms - I think vector search is a tool -- simply >one of many -- that search engineers should use to make their search engine >results more relevant.

This is a contradiction. One one hand you say that you want to improve relevance, on the other hand you say that vector search as a tool cannot be evaluated as other models (tools).

We have plenty of open information retrieval datasets (both full retrieval and ranking) where you can compare different methods or tools and assess the relevance impact.


Forget RAGs and LLMs, Vim search and a little regex is all you need


It's possible that our current perspective on vector storage could be flawed or incomplete, and exploring alternative approaches may yield valuable insights and improvements in data management and computation.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: