Hacker News new | past | comments | ask | show | jobs | submit login
Vector databases: analyzing the trade-offs (thedataquarry.com)
170 points by chop on Aug 20, 2023 | hide | past | favorite | 55 comments



I work on Typesense [1] - historically considered an open source alternative to Algolia.

We then launched vector search in Jan 2023, and just last week we launched the ability to generate embeddings from within Typesense.

You'd just need to send JSON data, and Typesense can generate embeddings for your data using OpenAI, PaLM API, or built-in models like S-BERT, E-5, etc (running on a GPU if you prefer) [2]

You can then do a hybrid (keyword + semantic) search by just sending the search keywords to Typesense, and Typesense will automatically generate embeddings for you internally and return a ranked list of keyword results weaved with semantic results (using Rank Fusion).

You can also combine filtering, faceting, typo tolerance, etc - the things Typesense already had - with semantic search.

For context, we serve over 1.3B searches per month on Typesense Cloud [3]

[1] https://github.com/typesense/typesense

[2] https://typesense.org/docs/0.25.0/api/vector-search.html

[3] https://cloud.typesense.org


We store a couple million documents in typesense and the vector store is performing great so far (average search time is a fraction of overall RAG time). Didn’t realise you’ve updated to support creating the embeddings automatically; great news!


This is very difficult for me to understand. Can you explain like I'm an undergrad? What exactly does this mean? What is an embedding? What is the difference between keyword and semantic search?


Here's an example of semantic search:

Let's say your dataset has the words "Oceans are blue" in it.

With keyword search, if someone searches for "Ocean", they'll see that record, since it's a close match. But if they search for "sea" then that record won't be returned.

This is where semantic search comes in. It can automatically deduce semantic / conceptual relationships between words and return a record with "Ocean" even if the search term is "sea", because the two words are conceptually related.

The way semantic search works under the hood is using these things called embeddings, which are just a big array of floating point numbers for each record. It's an alternate way to represent words, in an N-dimensional space created by a machine learning model. Here's more information about embeddings: https://typesense.org/docs/0.25.0/api/vector-search.html#wha...

With the latest release, you essentially don't have to worry about embeddings (except may be picking one of the model names to use and experiment) and Typesense will do the semantic search for you by generating embeddings automatically.


We use Typesense for vector search as well for Struct.ai in production, it works amazingly.

I'm surprised the original post doesn't benchmark Typesense.


I'm glad we're getting away from the paradigm that the only thing that matters is recall for the speed.

SOOOO much more matters than that! Any production database is going to have a huge medley of concerns and constraints. A reasonable recall at reasonable speed, but much easier to integrate and maintain, is going to be far far preferred. Not to mention a good retrieval system needs a broad range of features than just dense vector retrieval.

It's a sign the space is maturing away from being an academic/"benchmarking" competition space to one with actual industry concerns.


> […] much easier to integrate and maintain, is going to be far far preferred

Agreed. My team recently settled on Qdrant because it was fast and painless to set up and get started using.


Adding txtai to the list: https://github.com/neuml/txtai

txtai is an all-in-one embeddings database for semantic search, LLM orchestration and language model workflows.

Embeddings databases are a union of vector indexes (sparse and dense), graph networks and relational databases. This enables vector search with SQL, topic modeling and retrieval augmented generation.

txtai adopts a local-first approach. A production-ready instance can be run locally within a single Python instance. It can also scale out when needed.


What are the benefits of a dedicated embeddings database over adding a vector index to an existing engine, like sqlite-vss or pg_vector or Elasticsearch?

Vector search still feels like more of an index type feature than a separate product to me.


pg_vector doesn't do nearly as well (relative to a dedicated vector DB) in terms of speed and accuracy. That being said, I think it's still clearly the correct choice in a lot of cases. An order of magnitude speedup may not matter at all at small or medium scale. And pg_vector (or something else) is likely to continue to improve significantly. Contrast that with the costs (in terms of time and money) of running additional infrastructure.


pg_vector doesn't perform well compared to other methods, at least according to ANN-Benchmarks (https://ann-benchmarks.com/).

txtai is more than just a vector database. It also has a built-in graph component for topic modeling that utilizes the vector index to autogenerate relationships. It can store metadata in SQLite/DuckDB with support for other databases coming. It has support for running LLM prompts right with the data, similar to a stored procedure, through workflows. And it has built-in support for vectorizing data into vectors.

For vector databases that simply store vectors, I agree that it's nothing more than just a different index type.


I find the utility of pg_vector very useful since it also acts as the default DB for a ton of other functions. Interested to see if folks think that a combination of postgres with something like qdrant is the way to go? is the benefits worth the trade off in terms of ease of use and flexibility.


I would make a distinction between adding an index to sqlite/postgres and adding one to elastic. The ANN algorithms usually require the index fits into RAM which can lead to high memory requirements on large datasets. Something like elastic will do a lot better because of its ability to horizontally scale

I assume that is the rationale behind a dedicated database but I generally feel the same way as you


Vector search seems to have very different usage patterns than a regular database, both in terms of writes and queries.


Note you can also just use pretty much any SQL db, which is not fast but can be good enough. I typically get my results in about 250 ms. If your app is already using SQL it's nice to not introduce a new dependency.


communications cost. you certainly get cache coherency and prefetching that you wouldn't get if you were following pointer. in a distributed context that's a big hit.


It was my understanding that txtai is not a vector database, but rather it uses databases. It pulls together a large set of tools that most people use together in a very nice way, such that the API is more consistent for researchers to show what they are doing.


That is a great question. txtai is indeed different than most on that list.

It's similar in that it writes data to vector index formats such as Faiss, Hnswlib. It has metadata filtering via SQLite/DuckDB to filter on additional fields.

It's different in that it can use other vector databases for it's file format. And it has significant logic via workflows for data transformation. Then there is the graph component for topic modeling.

So that's where I came up with the term "embeddings database" which I consider a vector database and much more.


I find it a little funny that Redis is considered here. We use it! We just store vectors in redis, fetch what we need, and run cosine similarity in memory. It’s very fast and works well. It’s not suitable for large amounts of data, but if your “knowledge base” can be measured in MB of vectors (instead of GB or TB) then it’s worth considering.

I’m just not sure if I’d consider it a database. It’s just a long lived cache for us.


How do you generate your vectors?

I'm working on something that needs a similar, small and fast, vector search implementation. Crucially we also need fast indexing speed for our usecase, but a bottlneck we're hitting is the time it takes to generate vector embeddings for larger documents in our dataset (a few megabytes in our case). Wondering what's the fastest way to approach that?


Are you tied to any particular transformer model? Using a smaller model, throwing more hardware at the problem, or generating embeddings in parallel are easy ways to make it faster. Depending on what you're doing with the output you may also consider truncating your documents (can be good for stuff like clustering) or breaking apart your documents (can improve search performance).

Another option if you just want search (and aren't training or tuning your own models) is a managed search offering where you aren't responsible for generating embeddings.


Thanks for the advice! We're not tied to any model, no.

Naively I guess, at first we hoped to get by using a 3rd party API. We're hosted in GCP and tried using the Vertex AI `textembedding-gecko` model initially. But now we're investigating running models on our own infra, although not sure where we've got with it yet as someone else is working on that.


If you're committed to using a 3rd-party API, then parallelizing your API calls seems like the easiest way to speed things up. The benefits of a 3rd party API are - of course - that you're likely going to be able to generate embeddings using a much more powerful model. That being said, you may not need something as powerful as PaLM and having everything go over a network might just take too long. IME (which is entirely use-case dependent) something like SentenceTransformers (even the smallest pretrained models) can get you up and running on your own infra pretty quickly and generate embeddings with reasonable performance in a reasonable amount of time on modest hardware.


It's just OpenAI embeddings. We fetch them and just push then to Redis with a 30 day TTL. The backing data that's embedded rarely changes, so we don't need to create a new embedding for very often. We batch what needs to be embedded.

The full RAG workflow - using ADA to embed the user input, deserialize embeddings, run cosine similarity, and call gpt-3.5-turbo - is about 3 seconds end-to-end to get a result.


Thanks!

OpenAI embeddings are 1 per request payload, right? Have you hit any rate limits doing that?

We have a performance budget of ~1 second for the generate-index-search pipeline, which may or may not be feasible. I discounted OpenAI because it seemed like we're guaranteed to hit the rate limit if we flood them with concurrent requests for embeddings. Typical corpus size that we need to work with is 20 concurrent documents ranging from ~100kb to ~2mb. Chunking those documents to fit the 8k token context window balloons the request count further.


You absolutely want to chunk them smaller than 8k. Have you tested different chunk strategies? It can make a huge difference for actually recalling useful information in small enough chunks to be usable.


Thanks for the tip, I haven't played around with chunk size much at all so far.


marqo.ai has excellent indexing throughput as vector generation and vector retrieval are both contained within a marqo cluster. You can use it with multi-gpu, cpu, etc. It's also horizontally scalable.


For use with LLMs, I implemented my own vector DB code in Common Lisp and Swift.

When I work in Python, my favorite is absolutely Chroma embedded.

The article is wrong about Chroma embedded being in memory only. It also works fine with write through to a local disk, including maintaining index files.


Shameless self-plug for our embedded vector database milvus-lite (https://github.com/milvus-io/milvus-lite):

    pip install milvus


Thank you, I will try it. I noticed that you put the entire implementation inside the package’s __init__.py file. Interesting, and I had not seen that done before.


I have seen that pattern before, and for me it's a bit of an antipattern. Usually you wouldn't look for substantial code there, and in most cases it is nicer to organize your code in modules. You can import from those in the __init.py__ file, this way achieving the same effect as having all code live in __init.py__. But it's a matter of preference.


Another one mentioned below to try is txtai: https://github.com/neuml/txtai

It can run embedded in a single Python instance and has no issues running in production that way.


Full disclosure: I work for Pinecone, which is conspicuously absent from the write-up despite being the first and most popular vector database.

While it’s great to see efforts to make sense of the (admittedly noisy) vector database market, I’m struggling to grok large chunks of this. For example I can’t tell what the author means by “serverless”, but given that they put a whole bunch of open-source, self-hosted solutions in that part of the diagram it’s definitely not the commmonly understood meaning.

For anyone diving into the topic, here is another introductory article to help you: https://www.pinecone.io/learn/vector-database/


Pinecone is not absent and it's definitely not the first vector database.


This little thread helps solidify my opinions towards using only the few in the small section they point out, being Qdrant, chroma, or weaviate. The fact that someone who works at pinecone didn't take the time to read an article in their field, decided to comment anyway, and then stated they didn't know what serverless or embedded may mean with regard to databases pretty much sums up the state of many vector databases, especially pubecone imo.

I don't even really love Qdrant or chroma (haven't tried weaviate, but it's at least in the right region according to the article), but at least they are embedded.

I pretty much refuse to use any DB that requires using API keys, putting data off premises, and even if it requires setting up ACL. I don't even use postgres much for the complex ACL and having to set up ports reason.

SQLite and DuckDB are truly incredible, can store gigantic databases (>2TB is still perfectly quite performant) and you can just hand any collaborator the entire DB on a disk, without having to worry about complex password junk.


I have to agree here - including pinecone in the images of major competitors but doing absolutely no analysis of the product is bizarre. I would expect almost everyone would come into the article aware of pinecone and pg_vector, and often just at a “it’s a leading vector database but that’s all I know” level. It leaves a giant question mark in my mind of how pinecone fits into the space of competitors. It’s absence (beyond an image with its name called out) is baffling.

I am also a bit surprised at the hostility to the pinecone employee here. I see a bunch of other companies and projects jumping in and everything is cool. But for some reason the pinecone dude catches a lot of grief.

What’s going on? Is there some weird toxic subculture in the vector database space that’s got it’s knives kit for pinecone for some reason?


> Is there some weird toxic subculture in the vector database space

(I work for Weaviate)

Unfortunately, some players in the space (who are on this list) are cheating and playing an unfair game. I guess that this comes with a rapidly growing space.

I hope we can all quickly go back to focussing on our respective communities and educating the market (together) on the awesome things one can do with vector DBs.


Just ctrl-f "pinecone" and got nothing.

edit: pinecone is mentioned in the earlier parts of the blog series, though.


It's in both the images discussing the players in the space.


Umm.. perhaps look at the images?


> despite being the first […]

This is false

> […] and most popular vector database

Based on what?


We conducted a benchmark to evaluate the precision, throughput (QPS), insert speed, build speed, and cost-effectiveness of Pinecone, Qdrant, MyScale, Weaviate, and Zilliz (Milvus). This information will be valuable for those seeking to choose a vector database for production purposes. The results can be found at https://myscale.github.io/benchmark/.


I work on MyScale (https://myscale.com), a fully-managed vector database based on ClickHouse.

Some unique features of MyScale:

1. This solution is built on ClickHouse and offers comprehensive SQL support. Our users leverage vector search for a wide range of interesting OLAP use cases.

2. We utilize a property vector search algorithm called the multi-tier tree graph (MSTG). This algorithm is significantly faster than HNSW for both vector index building and filtered vector searches.

3. We utilize NVMe SSDs for the vector index cache, which greatly reduces the cost of hosting millions of vectors.


Can we start getting a similar flood of tools to generate the embeddings now? That’s my bottleneck. Searching them works well on numerous databases that support arrays/vectors.


There are dozens of different models that can generate embeddings: https://docs.marqo.ai/1.0.0/Models-Reference/dense_retrieval...

Most frameworks, like Haystack, can wrap embeddings generation for you.


I work at FeatureBase and I'm storing vectors from the Instructor Large library/model into our solution. Getting good results, which I should probably quantify at some point. One thing that FeatureBase does well is allow filtering of the vector space via SQL.

I would say that most people seem to prefer an engine that embeds and stores things as a service, but using Instructor is only a few lines of code and runs locally.


Just to quickly add to ukuina's comment, marqo.ai does embedding generation and vector search end to end, so you can put in documents and the embeddings are automatically generated.


Lot of tools that can do this and they've long been around. For example, txtai has been able to generate embeddings with sentence-transformers since 2020.


SentenceTransformers all-MiniLM-L6-v2 is still your best bet since you can generate them in batches with GPU acceleration.


Marqo is an end-to-end vector search engine that handles both embedding creation and retrieval: https://github.com/marqo-ai/marqo


surprised there’s no mention of kdb+ in here. iirc it’s older and more performant than most of these


kdb doesn't support effecient vector similarity searches, or efficient storage of high-dimensional vectors. It isn't really in the same class as the vector databases discussed in this post. It's more suited for time series data.


i’m not sure this is correct, but to each his own


np.array




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: