As others have correctly pointed out, to make a vector search or recommendation application requires a lot more than similarity alone. We have seen the HNSW become commoditised and the real value lies elsewhere. Just because a database has vector functionality doesn’t mean it will actually service anything beyond “hello world” type semantic search applications. IMHO these have questionable value, much like the simple Q and A RAG applications that have proliferated. The elephant in the room with these systems is that if you are relying on machine learning models to produce the vectors you are going to need to invest heavily in the ML components of the system. Domain specific models are a must if you want to be a serious contender to an existing search system and all the usual considerations still apply regarding frequent retraining and monitoring of the models. Currently this is left as an exercise to the reader - and a very large one at that. We (https://github.com/marqo-ai/marqo, I am a co-founder) are investing heavily into making the ML production worthy and continuous learning from feedback of the models as part of the system. Lots of other things to think about in how you represent documents with multiple vectors, multimodality, late interactions, the interplay between embedding quality and HNSW graph quality (i.e. recall) and much more.
In general I find they're incredible good for being able to rapidly build out search engines for things that would it would normally be difficult to do with plain text.
The most obvious example is code search where you can describe the function's behavior and get a match. But you could also make a searchable list of recipes that would allow a user to search something like "a hearty beef dish for a cold fall night". Or searching support tickets where full text might not match, "all the cases where users had trouble signing on".
Interestingly Q & A is ultimately a (imho fairly boring) implementation of this pattern.
The really nice part is that you can implement working demos of this projects in just a few lines of code once you have the vector db set up. Once you start thinking in terms of semantic search over text matching, you realize you can build old-Google style search engines for basically any text available to you.
One thing that is a bit odd about the space is, from what I've experienced and heard, is that setup and performance on most of this products is not all that great. Given that you can implement the demo version of a vector db in a few lines of numpy, you would hope that investing in a full vector db product we get you an easily scalable solution.
Everyone I talk to who is building some vector db based thing sooner or later realizes they also care about the features of a full-text search engine.
They care about filtering, they care to some degree about direct lexical matches, they care about paging, getting groups / facet counts, etc.
Vectors, IMO, are just one feature that a regular search engine should have. IMO currently Vespa does the best job of this, though lately it seems Lucene (Elasticsearch and Opensearch) are really working hard to compete
My company is using vector search with Elasticsearch. It’s working well so far. IMO Elastic will eat most vector-first/only products because of its strength at full-text search, plus all the other stuff it does.
I tend to agree - search, and particularly search-for-humans, is really a team sport - meaning, very rarely do you have a single search algo operating in isolation. You have multiple passes, you filter results through business logic.
Having said that, I think pgvector has a chance for less scale-intense needs - embedding as a column in your existing DB and a join away from your other models is where you want search.
I don’t get why you’d want to bolt RBAC onto these new vector dbs, unless it’s because they’ve caused this problem in the first place…
They have beef with ES since they took the software, made a bunch of cash on it, then never contributed back. ES called them out and it started a feud.
I'd go on ES over Amazon-built software any day. I worked on RDS and I've used RDS at several companies, it's a mess.
Longer story:
One day one of our table went missing on Aurora, we couldn't figure out why, it was in the schema, etc. Devops panicked and restarted the instance, and then another table was missing. We ended up creating 10 empty tables and restarted it until it hit one of those.
We contacted RDS support after that, and the conclusion of their 3 month investigation is: "Yeah, it's not supposed to do that."
There's some really smart people working at Amazon, unfortunately the incentives is to push new stuff out and get promoted ASAP. If you can do that better than others and before your house of cards falls, you're safe. If the house of card crumbles after you're gone, it's their problem.
>Longer story: One day one of our table went missing on Aurora, we couldn't figure out why, it was in the schema, etc. Devops panicked and restarted the instance, and then another table was missing. We ended up creating 10 empty tables and restarted it until it hit one of those.
Are there any report this? How come this is the first time I heard of this? How can companies trust this kind of managed DB services?
We worked with dedicated support on this, but I don't think they had enough knowledge to dig deep into it and just gave up. There is a huge backlog of critical issues at most AWS services. It looks great from the outside in, but the sausage making process is extremely messy.
Amazon forked ElasticSearch into OpenSearch. When deciding which platform to go with (we are an AWS customer) I decided to stick with the company whose future depends on their search product (Elastic), not the one that could lose interest and walk away and suffer almost no consequences (AWS). If OpenSearch is still around in 5 years, and keeping pace with ElasticSearch, then maybe I'd consider it the next time I'm making this choice.
Also there's a lot more to ElasticSearch than full-text search (aggregations, lifecycle management, Kibana). Doesn't seem like Kendra is going to be a replacement for our use case.
Until very recently, “dense retrieval” was not even as good as bm25, and still is not always better.
I think a lot of people use dense retrieval in applications where sparse retrieval is still adequate and much more flexible, because it has the hype behind it. Hybrid approaches also exist and can help balance the strengths and weaknesses of each.
Vectors can also work in other tasks, but largely people seem to be using them for retrieval only, rather than applying them to multiple tasks.
A lot of these things are use-case dependent. Like the characteristics even of BM-25 varies a lot depending on whether the query is over or under specified, the nature of the query and so on.
I don't think there will ever be an answer to what is the best way of doing information retrieval for a search engine scale corpus of document that is superior for every type of queries.
more commonly you use approximate KNN vector search with LLM based embeddings, which can find many fitting documents bm25 and similar would never manage to
the tricky part if to properly combine the results
Vector search is not exclusively in the domain of text search. There is always image/video search.
But pre-filtering is important, since you want to reduce the set of items to be matched on and it feels like Elasticsearch/OpenSearch are fairing better in this regard. Mixed scoring derived from both both sparse and dense calculations is also important, which is another strength of ES/OS.
much more mature and feature rich then many of the competition listed in the article
to some degree it's more a platform you can use to efficiently and flexible build your own more complicated search system, which is both a benefit and drawback
some good parts:
- very flexible text search (bm25), more so then elastic search (or at least easier to user/better documented when it comes to advanced features)
- fast flexible enough vector search, with good filtering capabilities
- build in support for defining more complicated search piplines, including multi phase search (also known as rerankin)
- quite nice approach for more fine controlling about what kind of indices are build for which fields
- when doing schema changes has safety checks to make sure you don't accidentally brake anything, which you can override if you are sure you want that
- ton of control in a cluster about where which search system resources get allocated (e.g. which schemas get stored on which storage clusters, which cluster nodes should act as storage nodes, which should e.g. only do preprocessing or post processing steps in a search piplines and which e.g. should be used for calculating embeddings using some LLM or similar) Not something you for demos but definitly something you need once you customers have enough data.
- child documents, and document references
- multiple vectors per document
- quite a interesting set of data types for fields and related ways you can use them in a search pipline
- an flexible reasonable easy to use system for plugins/extensions (through Java only)
- support building search piplines which have sub-searches in extern potentially non vespa systems
- really well documented
Through the main benefit *and drawback* is that it's not just a vector database, but a full fledged search system platform.
generally if you have multiple embeddings for the same document you have two choices:
- create one document for each embedding and make sure non membedding specific attributes are the same across all of this document clones -- vespa makes this more convenient by having child documents
- have a field with multiple documents, i.e. there are multipel vectors in the HNSW-index which point to the same document -- vespa support this, too. It's what I meant.
vespa is currently the only vector search enabled search system which supports both in a convenient way, but then there are so many "vector databases" poping up every month that I might have missed some
Check out FeatureBase, when you get a chance. Vectors and super fast operations on sets. I'm using it for managing keyterms extracted from the text and stored along with the vectors.
I'm building a RAG for my personal use: Say I have a lot of notes on various topics I've compiled over the years. They're scattered over a lot of text files (and org nodes). I want to be able to ask questions in a natural language and have the system query my notes and give me an answer.
The approach I'm going for is to store those notes in a vector DB. When I ask my query, a search is performed and, say, the top 5 vectors are sent to GPT for parsing (along with my query). GPT will then come back with an answer.
I can build something like this, but I'm struggling in figuring out metrics for how good my system is. There are many variables (e.g. amount of content in a given vector, amount of overlap amongst vectors, number of vectors to send to GPT, and many more). I'd like to tweak them, but I also want some objective way to compare different setups. Right now all I do is ask a question, look at the answer, and try to subjectively gauge whether I think it did a good job.
Any tips on how people measure the performance/effectiveness for these types of problems?
For small personal projects its kind of hard to build metrics like this because the volume of indexed content in the database tends to be pretty low. If you're indexing paragraphs you might consistently be able to fit all relevant paragraphs in the context itself.
What I can recommend is to take the coffee tasting approach. Don't try and test and evaluate individual responses, instead lock the seed used in generation, and use the same prompt for two different runs. Change one variable and do a relative comparison of the two outputs. The variables probably worth testing for you off the top of my head:
* Choice of models and/or tunes
* System prompts
* Temperature of the model against your queries
* Threshold for similarity for document inclusions (you only want relevant documents from your RAG, set it too low and you'll get some extra distractions, too high and useful information might be left out of the context).
If you setup a system to track the comparisons either automatically or by hand that just indicates which side of the change worked better for your use case, and test that same change for a bunch of different prompts you should be able to tally up whether the control or change was more preferred.
Keep those data points! The data points are your bench log and can be invaluable later on for anything you do with the system to see what changed in aggregate, what had the most outsized impact, etc and can guide you to build useful tooling for testing or finding existing solutions out there.
I use lots and lots of domain specific test cases at several layers, numbering in the hundreds or thousands. The score is the number of test cases that pass so it requires a different approach than all or nothing tests. The layers depend on your RAG “architecture” but I test the RAG query generation and scoring (comparing ordered lists is the simplest but I also include a lot of fuzzy comparisons), the LLM scoring the relevance of retrieved snippets before feeding into the final answering prompt, and the final answer. The most annoying part is the prompt to score the final answer, since it tends out to come out looking like a CollegeBoard AP test scoring rubric.
This requires a lot of domain specific work. For example, two of my test cases are “Is it [il]legal to build an atomic bomb” run against the entire USCode [1] so I have a list of sections that are relevant to the question that I’ve scored before eventually getting an answer of “it is illegal” followdd by several prompts that evaluate nuance in the answer (“it’s illegal except for…”). I have hundreds of these test cases, approaching a thousand. It’s a slog.
[1] 42 U.S.C. 2122 is one of the “right” sections in case anyone is wondering. Another step tests whether 2121 is pulled in based on the mention in 2122
The main thing is that there's no "objective" way, but if you rank and label your own data then you can certainly get a ranking that's subjectively well performing according to you.
RAG in this case is essentially the same as a recommender system so you can approach it with the same metrics you would there.
You'll need to build a data set with known correct answers but then it's basically, NDCG (Normalized Discounted Cumulative Gain) is a good place to start, MRR (Mean Reciprocal Rank) and MAP (Mean Absolute Precision) are other options. You could also just look at the accuracy of getting your result in the top K results for various thresholds for k (which can be interpreted as the "probability of getting your result in 'k' results).
Included here is a bit of the old tried and true: NDCG/MRR/Precision @k - what you really want for measuring your information retrieval systems.
But we also talk through a bit of the "new", how to use Evals to generate the building blocks for those metrics above. You will want both hand labels and the automated Evals in the end to evaluate your system.
txtai is an all-in-one embeddings database for semantic search, LLM orchestration and language model workflows.
Embeddings databases are a union of vector indexes (sparse and dense), graph networks and relational databases. This enables vector search with SQL, topic modeling and retrieval augmented generation.
txtai adopts a local-first approach. A production-ready instance can be run locally within a single Python instance. It can also scale out when needed.
txtai can use Faiss, Hnswlib or Annoy as it's vector index backend. This is relevant in terms of the ANN-Benchmarks scores.
Hence why I’d be interested to know more about the supporting details for the different categories. It may help uncover some inadvertent errors in the analysis, but also would just serve as a useful jumping-off point for people doing their own research as well.
Totally agree with the puzzling assortment of a rubric. PostgreSQL supports role based-access control, RBAC. Not to mention, with PostgreSQL and the pgvector extension, you have a whole list of languages ready to use it:
C++ pgvector-cpp
C# pgvector-dotnet
Crystal pgvector-crystal
Dart pgvector-dart
Elixir pgvector-elixir
Go pgvector-go
Haskell pgvector-haskell
Java, Scala pgvector-java
Julia pgvector-julia
Lua pgvector-lua
Node.js pgvector-node
Perl pgvector-perl
PHP pgvector-php
Python pgvector-python
R pgvector-r
Ruby pgvector-ruby, Neighbor
Rust pgvector-rust
Swift pgvector-swift
Wonder how many of those other Vector databases play nice.
That stood out to me as well. I've been playing with pgvector, and there's no reason you can't use row/table role-based security.
I think there's an unmentioned benefit to using something like pgvector also. You don't need a separate relational database! In fact you can have foreign keys to your vectors/embeddings which is super powerful to me.
Same for Developer experience. If you used Postgres or any other relational db (which I think covers a large % of devs), you could easily argue the dev experience is 3/3 for pgvector.
Not only 3/3 but also includes full text search built in. Tables look like:
MyThingEmbedding
______
id primary key
mything_id integer -- fkey to mything table
embedding vector(1536)
fulltext tsvector
GIN index on tsvector
HSNW index on embedding
Then you can pull results that match either the tsvector AND/OR the similarity with a single query, and it's pretty performant. You can also choose at the query level whether you want exact matching or fuzzy.
I made this table to compare vector databases in order to help me choose the best one for a new project. I spent quite a few hours on it, so I wanted to share it here too in hopes it might help others as well. My main criteria when choosing vector DB were the speed, scalability, dx, community and price. You'll find all of the comparison parameters in the article.
Happy to connect. The benchmark numbers are mostly from ANN Benchmarks. For my use case, the nytimes-256 dataset was most relevant so I used that for the QPS benchmark. I also took a look at the benchmarks you've made at https://qdrant.tech/benchmarks/ and there qdrant seems to be outperforming many others. If I've gotten something wrong here, I'm glad to update the article :)
I'd love to know how vector databases compare in their ability to do hybrid queries, vector similarity filtered by metadata values. For example, find the 100 items with the closest cosine similarity where genre = jazz and publication date between 1990 and 2000.
Can the vector index operate on a subset of records? Or when searching for 100 closest matches does the database have to find 1000 matches and then apply the metadata filter, and hope that doesn't reduce the result set down to zero and exclude relevant vectors?
It seems like measuring precision and recall for hybrid queries would be illuminating.
I can't speak to the others, but pgvector indices can "break" hybrid queries. For example, if you select using a where clause specifying metadata (where genre = jazz) and order by distance from a vector (embedding of sound clip); if the index doesn't have a lot (or any) vectors in the sphere of the query vector that also match the metadata it can return no results. I discuss this in a blog post here [1].
Curious about the lack of Vespa, especially given the thoroughness of the article and its long-time reputation. OpenSearch is also missing, but perhaps it can be considered being lumped in with Elasticsearch due to them both being based on Lucene. The products are starting to diverge, so would be nice to see, especially since it is open-source.
For the performance-based columns, would be also helpful to see which versions were tested. There is so much attention lately for vector databases, that they all are making great strides forward. The Lucene updates are notable.
What advantage are vector databases providing above using an index in conjunction with a mature database? I’m not sold on this as a separate technology.
Vector search is useful, but I don’t understand why I would go out of my way when I could implement FAISS or HNSWlib as an adjunct to postgres or a document store.
Vector extensions to your current database or search engine makes far more sense than adding yet another dependency to manage and operate. The vector database folks will have to become a real database or full featured search engine to survive and compete with the incumbents that will all have good solutions for vector similarity search.
The thing is if you need a vector _database_ there is no reason why it can't be a pg extensions. And if you project is only small scale there is probably some HNSW pg extension library you could use.
But what is most times needed instead of a vector database is a efficient fast responsive vectore approximate KNN search system with fast attribute filtering which overlaps with a fast an efficient text search system (e.g. bm25 based)
And if you then go to billion vector scale things become tricky performance wise.
And then you reach the same point at which companies do things like using warehouse approach where you have a read only extremely read optimized mostly in memory variant of their db they access for searches only and changes from their main db a streamed to the read only search instance, potentially while losing snapshot views, transactions and similar.
You could say that approx. KNN vector search is the new must have feature for unstructured fuzzy text search, and while you can have unstructured fuzzy text search in pg it's also often not the go-to solution if your databse is just for getting that search.
because any production use case I'm aware of sooner or later uses both searches and combined the results
e.g. vector search is fundamentally terrible at finding keywords, but keywords search is fundamentally terrible at finding equal things which use slightly different words
Strongly disagree with PGVector's DX being worse than Chroma. Installing, configuring, and working with Chroma was infuriating -- it's alpha software and has the bugs and rough edges to prove it. The tools to support and interface with postgres are battle-tested and so much nicer by comparison; getting Chroma working took over a week, ripping it out and replacing with PGVector took a couple hours.
Also agree with this[0] article that vector search is only one type of search, and even for RAG isn't necessarily the one you want to start with.
Yeah, I had a similar experience with Chroma DB. On paper, it checked all my boxes. But yea, it's alpha software with the first non-prerelease version only coming out in July 2023 (so it's 3 months old).
I ran into some dumb issues during install like the SQLite version being incorrect, and there wasn't much guidance on how to fix these problems, so gave up after struggling for a few hours. Switched to PGVector which was much simpler to setup. I hope Chroma DB improves, but I wouldn't recommend it for now.
Thanks for your input, I've only tried Chroma a little bit so far and had a pretty good experience. What they also have going for them is a big community on discord that can be helpful.
Gonna add some information here since this isn't very descriptive.
milvus-lite is a bit like sqlite where it runs in-process. Here are some scenarios you'd want to use it in:
- You want to use Milvus directly without having it installed using Milvus - Operator, Helm, or Docker Compose etc.
- You do not want to launch any virtual machines or containers while you are using Milvus.
- You want to embed Milvus features in your Python applications.
I quickly took a look at the redisearch ANN Benchmarks and they seem to stack up against the others (more or less same level as Milvus) in the comparison when it comes to QPS and Latency.
I'm currently in the market for a self hosted DB for a personal project. The project is an app you can run on your own system and provide QA on your text files. So I'm looking for something light weight, but I'm also looking for the best possible search and ANN retrieval is just a single part of that.
Their definition about Hybrid Search is I think wrong.
Through this terms tend to not be consistently defined at all, so "wrong" is maybe the wrong word.
Their definition seem to be about filtering results during (approximate) KNN vector search.
But that is filtering, not hybrid search. Through it might sometimes be implemented as a form of hybrid search, but that's an internal implementation detail and you probably should hope it's not implemented that way.
Hybrid search is when you do both a vector search and a more classical text based search (e.g. bm25) and combine both results in a reasonable way.
The way you explain hybrid search aligns with my understanding. Pinecone has a good article about it here https://www.pinecone.io/learn/hybrid-search-intro/. From my understanding, all vector DBs support this.
This is interesting because it does not mention Vector database powered by Apache Cassandra or the hosted serverless version DataStax Astra. Here is write up we did on 5 hard problems in Vector database and how we solved them. https://thenewstack.io/5-hard-problems-in-vector-search-and-...
In full transparency: I work for DataStatx and lead engineering for Vector database.
I don't think we need specialized databases for vectors. Relational databases can easily be expanded by vector data types and operations. They will eventually catch up by supporting what was once a unique feature of the new system: https://medium.com/@magda7817/two-things-to-keep-in-mind-bef...
Yeah, this is my sense too. They will be slower to add these new requirements but they should be able to add these vector capabilities within a year or so. It's then a question of ability of smaller vector db companies to mature and add regular db capabilities, while innovating.
Agreed on pgvector being simple and a great choice for POCs and low scale, especially if you're familiar with Postgres. Our team released something new last week built for folks looking to use PostgreSQL at scale as a vector store [0], featuring a DiskANN index type.
Quick question regarding the scalability and support of multiple vector databases under a single cloud service. Suppose an enterprise Saas product served multiple customers with each requiring a unique RAG vector knowledge-base for product and company info. Do any of these solutions allow for a large number (dozens or hundreds) of small distinct Knowledge bases? Do any offer easily integrated automated pipelines for documents to be parsed and ingested?
Postgres with PGVector is the best database, plus vectors.
All of the "Vector DBs" suffer horribly when trying to do basic things.
Want to include any field that matches a field in an array of keys? Easy in SQL. Requires an entire song and dance in Pinecone or Weaviate.
After implementing Chroma, Weaviate, Pinecone, Sqlite with HNSW indices and Qdrant-- I'm not impressed. Postgres is measurably faster since so much relies on pre-filtering, joins, etc.
Strongly disagree about the Pinecone developer experience. Not that they don't have SDKs, but last I checked they didn't have documentation on how to approach local dev environments.
The implication being that you spin up a separate index for $70/mo, and then you have to upsert any relevant data yourself. Sure that's not difficult, but why do you have to do it at all? Why doesn't Pinecone make it easy to replicate data to another index for use in dev/staging?
You might like the 'Which Search Engine?' panel I ran at Buzzwords earlier this year with some of the leading contenders (Vespa, Qdrant, Elastic, Solr, Weaviate) https://www.youtube.com/watch?v=iI40L4wMtyI - vector search was obviously part of the discussion
20M vectors @768 is about 62GB, for 32bit, not even quantized. AWS RDS will put it at 83USD/m (db.t4g.small, 2vcpu 2GB RAM). But that's not with egress, backups, etc
Seems acceptable at least for a POC?
A better option if you already have the data in the same instance, but developer experience being low scares me. Anyone tried it? How did it go?
I'm interested to try some of these others next time around, but I've used qdrant self-hosted in two projects and been pleased. Milvus was recommended so I gave that a try but found it over complicated. Pgvector seems like an obvious choice if you are already using postgres and if that performance is ok.
It was a while ago now so the details have faded, but for one all of the docker services it had to spin up vs the single container that qdrant runs. I'm sure there is a reason for this, but I haven't needed it.
Latency from embedding models is still going to be the bottleneck for performance however fast the DB is going to be. Plus adding all the overhead of synthesising answers and summaries from a LLM is going to weigh you down.
If you are building a search engine or a QA bot, the embedding of the query still needs to be calculated. The results do depend on the quality of the model, and if you are using a large on it does take time.
We conducted benchmark tests on Elastic's queries per second (QPS) performance using datasets of 500,000 and 1 million vectors. Result was Zilliz is 13x and 22x faster, per number of vectors respectively. https://zilliz.com/blog/elasticsearch-cloud-vs-zilliz
We also conducted a benchmark comparing Pgvector to both Milvus (open source) and Zilliz (managed, with a free tier option). When running the OSS Milvus on 2 CPUs and 8 GiB memory, Pgvector was found to be 5 times slower. You can check out the detailed performance charts at the bottom of this blog post:
https://zilliz.com/blog/getting-started-pgvector-guide-devel...
Feel free to explore our open-source benchmarking tool, which allows you to examine our methodology and even compare it with your vector database. https://github.com/zilliztech/VectorDBBench
Yeah, that's the difference we've seen according to the QPS for the ANN Benchmarks. The same story seems to be true for other datasets too. We're looking at a 0.9 recall.
Many of them are open source and you can host them yourself. That would make it more cost effective. Also someone mentioned https://turbopuffer.com/. That seems like a good alternative if you're looking for something economical.
Somehow I felt that at least part of the articles was generated by a LLM. It’s unfortunate to see that a new bias has started to creep up. Whatever I read now I second guess and I feel it maybe partially or fully generated by LLMs.