Hacker News new | past | comments | ask | show | jobs | submit login
Are we at peak vector database? (softwaredoug.com)
235 points by softwaredoug 5 months ago | hide | past | favorite | 142 comments



IMO we are well past peak cosine-similarity-search as a service. Most people I talk to in the space don't bother using specialized vector DBs for that.

I think there's space for a much more interesting product that is longer-lived (since it's harder to implement than just cosine-similarity-search on vectors), which is:

1. Fine-tuning OSS embedding models on your real-world query patterns

2. Storing and recomputing embeddings for your data as you update the fine-tuned models.

MTEB averages are fine, but hardly anyone uses the average result: most use cases are specialized (i.e. classification vs clustering vs retrieval). The best models try to be decent at all of those, but I'd bet that finetuning on a specific use case would beat a general-purpose model, especially on your own dataset (your retrieval is probably meaningfully different than someone else's: code retrieval vs document Q&A, for example). And your queries are usually specialized! People using embeddings for RAG are generally not also trying to use the same embeddings for clustering or classification; and the reverse is true too (your recommendation system is likely different than your search system).

And if you're fine-tuning new models regularly, you need storage + management, since you'll need to recompute the embeddings every time you deploy a new model.

I would pay for a service that made (1) and (2) easy.


I no longer work there, but Lucidworks has had embedding training as a first-class feature in Fusion since January 2020 (I know because I wrapped up adding it just as COVID became a thing). We definitely saw that even with just slightly out-of-band use of language - e.g. in e-commerce, things like "RD TSHRT XS", embedding search with open (and closed) models would fall below bog-standard* BM25 lexical search. Once you trained a model, performance would kick up above lexical search…and if you combined lexical _and_ vector search, things were great.

Also, a member on our team developed an amazing RNN-based model that still today beats the pants off most embedding models when it comes to speed, and is no slouch on CPU either…

(* I'm being harsh on BM25 - it is a baseline that people often forget in vector search, but it can be a tough one to beat at times)


Heh. A lot of what search people have known for a while, is suddenly being re-learned by the population at large, in the context of RAG, etc :)


The thing with tech is, if you're too early, it's not like you eventually get discovered and adopted.

When the time is finally right, people just "invent" what you made all over again.


Totally. And this has even happened in search. Open source search engines like Elasticsearch, etc did this... Google etc did this in the early Web days, and so on :)


Sorry, what is it that people in search _have_ known?

I know nothing about search, but a bit about ML, so I'm curious


That ranking is a lot more complicated than cosine similarity on embeddings


What’s the model?


We (Marqo) are doing a lot on 1 and 2. There is a huge amount to be done on the ML side of vector search and we are investing heavily in it. I think it has not quite sunk in that vector search systems are ML systems and everything that comes with that. I would love to chat about 1 and 2 so feel free to email me (email is in my profile).


> 1. Fine-tuning OSS embedding models on your real-world query patterns

This is not as easy as you make it sound :) Typically, the embeddings are multi-modal: the query string maps to a relevant document that I want to add as context to my prompt. If i collect lots of new query strings, i need to know the ground truth "relevant document" it maps to. Then I can use the two-tower embedding model to learn the "correct" document/context for a query.

I have thought about this problem for LLMs that do function calling. And what you can do is collect query strings and the function calling results, and ask GPT-4 - "is this a 'good' answer?". GPT-4 can be a teacher model for collecting training data for my two-tower embedding model.

Reference: https://www.hopsworks.ai/dictionary/two-tower-embedding-mode...


I think the fact that finetuning embeddings well isn't easy is why it's a more useful service than hosted cosine similarity search ;)



I've been working on (3) embeddings translation with the goal being to translate something like OpenAI embeddings to UAE-Large. So far, I have had success using them for cosine similarity with around a 99.99% validation rate, but only 80% using Euclidean distance.


I’m fascinated by embeddings translations and compatible embeddings with different numbers of dimensions. Can you share more about your work / findings?


I mean, the simplest answer is a matmul... Given embedding x, y, find M such that Mx ~= y. Easy to train so long as you've got access to both models to compute embedding over whatever you're interested in...

(easy to extend to two layers mlp as needed. maybe ensure that x and y are zero mean and unit length to make training the matmul a bit easier.)


This sounds to me like what https://rungalileo.io is offering


Question, does this require specialized hardware at all? GPUs?


It doesn't require it in theory, but in practice its required bc CPUs are too slow at fine-tuning and computing embeddings.


Are you aware of any service or OSS solution for this?


Hill I am willing to die on:

Peak "$XXXXXXXX" database is when your particular flavor of DB is completely consumed into traditional RDBMSes.

Vector databases (and all other incremental or transformational improvements) are just features of regular plain traditional RDBMSes that have not been implemented in traditional RDBMSes yet.

I have seen every new DB tech subsumed by traditional databases over time as compute capability improved.

No exceptions.

The list is endless:

- object databases (e.g. blobs, JSON)

- OLAP

- in DB programming ( XX-SQL eg PL/SQL, T-SQL, ANSI-SQL)

- column-oriented data stores

- key-value

- graph databases

- No SQL

- Cloud, distributed, whatever

- statistical analysis databases

- document databases

All these used to be standalone, very expensive, specialty products but are now just one more checkbox on the Oracles/SQL-Servers/DB2s of this world.

All these have been swallowed by the borg of commercial databases without so much as a burp.

There is no winning the commercial market long term for these products. Big business buys traditional RDBMSes because they are the kitchen sink. They do EVERYTHING and they will eventually do this new hot thing, the business will just have to pay big dollars for it. Which is not a problem for big business.

There is a reason that cartoon about the Oracle org hierarchy was made (bottom right): all the company does is make product (Engineering) and protect that product. And it is very good at making good product.

https://i0.wp.com/stratechery.com/wp-content/uploads/2013/07...


Traditional DBs already kinda support vector DBs via pg_vector extensions and such.

There is a YC startup, latnern, that also built their own extension for postgres that is open source and is better for vector DB use cases: https://github.com/lanterndata/lantern

But yeah! Traditional DBs already support this, if you consider this extension to be part of Postgres.


Exactly my take, I see no moat here. If there were a way to short the vector DB startup phenomenon and I had the resources I would do it.


Literally. We do a lot of vector DB and RAG stuff (who isn't these days, right?) and after a bunch of testing and benchmarking went with pgvector integrated into our existing PostgreSQL database. Operationally simple, performs perfectly adequately. I'm sure there are some niche use-cases where the dedicated vector DBs make sense, but for anyone just getting into it, don't underestimate PostgreSQL and pgvector.


I got interested in vector search around 2004, read a lot of papers about vector search algorithms and was not really impressed with the tradeoffs involved (it's not the clear win that B-Trees are for 1-d indexing) and wound up using full scans unless I had sparse vectors.

When Pinecone came out and started blogging heavily it seemed that they'd read the same papers I did but came to the conclusion the glass was half full instead of half empty. I could have missed it but I haven't see anything in the literature that's a huge improvement over 20 year old algos.

Circa 2014 I worked on a search engine for patents and related literature that made vectors for 20 million + documents and they decided to use full scan and (i) it performed so well (in terms of accuracy) that we sold a license to the USPTO on day two after we put up the demo, and (ii) there were a lot of things about it that were slow like the build system, index building and model training but vector search wasn't one of them.

My YOShInOn RSS reader has about a million documents in 2024 and it uses vectors for classification and clustering. Using vectors for search is a clear extension and I've done some prototyping of searches with full-scan and performance is "good enough" (full scan has 'mechanical sympathy'.) I'd probably stuff my vectors into FAISS if I wanted to do anything more and forget about it.

Sending my vectors to some cloud service so they can pay AWS prices to store them? That's for the birds. I respect Pinecone for being early to the party but I think those who jumped in in 2022 were laggards.


Is your YOShInOn RSS reader available as an app or open-source library?


My experience comes from around the same time frame -- I spent about a year on an aborted spectral dimension reduction project, and I only recently realized how similar the problem still is today.

I'm not sure if that makes me more or less qualified to do vector DB's -- I tend to block out things that I learned a lot about in the past without much result.


Speaking as an author of one of the primary libraries for doing this stuff (faiss), it is not because it is still an open ended research problem on how approximate high-dimensional dense or sparse nearest neighbor should work, let alone maximum inner product search where the research story is even worse, or other non-metric space similarity measures. All of the current techniques still have quite unacceptable tradeoffs involved.

While traditional database indexing is also still an open-ended research problem (e.g., read amplification/write amplification tradeoffs and the like), it produces exact solutions. That isn't the case at all for vector indexing beyond brute-force search, or exact indexing like k-D/BSP trees which don't work well in high dimensions due to the curse of dimensionality.


Why is the research story for MIPS even worse than for ANN?


There is no good geometry to be exploited, and the query vectors might be (and are usually) distributed quite differently than the indexed vectors.

For Euclidean (L2) distance indexes where the vectors are partitioned based on geometry (e.g., pretty much every indexing type, including cell-probe like IVF, most forms of LSH, or graph based indices), query vectors can be naturally associated geometrically with candidate nearest neighbor vectors, so the distribution of queries doesn't matter as much.

For inner product, it's hard to do much better than spherical clustering (what one would usually do for cosine similarity, which is to project all vectors to the surface of a unit hypersphere, and searching for nearest neighbors via cosine similarity is exactly equivalent to L2 search). But, in general the maximum inner product in the indexed set may lie nowhere near to the projection of the query vector onto the surface of the hypersphere.

The maximum inner product for a query vector might be almost nearly perpendicular to the query vector (e.g., a very, very far out and almost perpendicular) versus a vector that is parallel to the query vector but with tiny norm. In two dimensions, an example could be (1, 0) as a query vector, but (1, 10^6) as a database vector (or vice versa). The inner product is 1 but the two vectors are very far apart in Euclidean distance. If you project the vectors to the unit 1-sphere, the query vector is still (1, 0) but the database vector now becomes (1 / sqrt(10^12 + 1), 10^6 / sqrt(10^12 + 1)) ~= (0.000000999..., 0.99999...) (apologies if there's an error here) which would also be in a very different cell if one were using a graph-based or IVF partitioning.

Neural search techniques do show some promise here though (say, using a neural net to predict which vector buckets to look at).


How much of this is due to DNNs (e.g. VAEs but also others) forcing embeddings to distribute in a Gaussianish manner? Is the data intrinsically missing geometry or could a more subtle learning algorithm give a cleaner manifold and therefore more efficiently indexable structure?


Thanks!

What kinds of use cases cause this kind of situation, where the query and indexed vectors are from different distributions?


This is a ridiculous rant. “ oh no! We have choices”. Then you list out every choice available for what is a new space people are exploring and the list is barely a half dozen long? It’s more like this is peak “claiming everything is peak”.


Author here, well yeah, I agree its probably ridiculous. Sort of testing the waters to see if I'm way off base.

I think what I mean to say is that, in my experience, practitioners and vendors alike are overly focused on "just put embeddings somewhere and do cosine similarity" and that's the only problem to solve. In fact, that's a teeny tiny part of it. Hence "peak vector DB".

So I think the market needs some education that its harder than that. That part is my rant :). I've spoken / worked on enough problems now to see that disconnect between market and reality.

Though I think "vector DB" is actually a place for capital/brainpower to concentrate to solve these other problems. And I think we'll see the vector DB vendors pivot there. It's just taking a while for the market and investors to see this...


It sounds like you’ve conflated “gold rush” with “peak”. All sorts of novel technologies had mad rushes when they’re new, but that does not mean they have peaked. The dot bomb era with its ridiculous overvalued useless startups was a gold rush, but it was in no way peak Internet.


> practitioners and vendors alike are overly focused on "just put embeddings somewhere and do cosine similarity" and that's the only problem to solve

I agree, and as one who does exactly and only this on the search side, it's also something that falls flat on its face if you don't think a little more about the data and tasks involved.

I wrote about it here[0], but the gist of it for our use case is that if we don't intentionally include what may be considered "less relevant" data then we stand a good chance at failing our main generative task.

[0]: https://phillipcarter.dev/2024/01/15/three-properties-of-dat...


Normally having a lot of choices is a good thing, but here we are facing a dozen of vector dbs with very similar features - to the root it's just some version of ANN implemented in C++/Rust/whatever, the "peak" means there's nothing new. People are flooding into this field not because there's something worth inventing, but more of fear to lag behind and miss the quick money. That's what I feel about vector DBs in Jan, 2024.


Yeah, I'm happy there's a lot of development in this area - even if it's fueled by the LLM frenzy, good nearest neighbor search solutions are useful in a lot of domains. Though I worked a little bit on this problem over 10 years ago (with an application to visual SLAM), and it is a bit amusing to see that a lot of the ideas and even the libraries are still the same!


We've hit peak peak.


Embeddings are good at capturing surface level information but can't match implicit/deeper/conclusion level information. Say you have a collection of 100,000 math problems, and you want to embed them to search problems that give result "0". Any number of problems can give this result and it is not explicit in the problem statement. But if you solve the problems you can see the data was in there, just not apparent.

In general you can see the raw text as a simulation premise that will generate inferences when "executed". The inferenced part is like the hidden part of the iceberg, you don't see it but it is there, implicit in the source text. Not just in math, but in all fields.

Embeddings are only good at superficial retrieval. The text needs to be fully analyzed with LLMs before embedding. Thus my conclusion is that we still have a long way to go, we haven't peaked.


What do you mean by fully analyzed? It’s the LLM that does the embedding.


Oh the embedding LLMs are usually lightweight BERT models with few layers and <<1B weights, while LLMs are easily 10-100x larger. The idea is to ingest the text in a LLM to extract the facets you are going to search and add those extra tokens to the original text. Then you do regular RAG.


You want to enrich the base text before embedding? With what kind of prompt?

It’s a pretty simple thing to add to a pipeline. Have you tried?


How?


What are your actionable suggestions?

I am currently testing embeddings/RAG and could use some insight on how to make the results better.


> The text needs to be fully analyzed with LLMs before embedding.

If you happen to know what kinds of questions you will be asking about your RAG index, you should pre-process the texts to add QA pairs. Otherwise you can prompt the LLM to do chain-of-thought inferences based on the source text and add them to the material.


I guess you log queries to see what is popular and then reprocess texts based on those?


Aside from a feedback loop from usage, is there a way to guess?

I guess you put put a whole doc into the I’ll and ask what questions it answers?

And then use those question plus a piece of the text and do an embedding?


When I prototype RAG systems I don’t use a “vector database.” I just use a pandas dataframe and I do an apply() with a cosine distance function that is one line of code. I’ve done it with up to 1k rows and it still takes less than a second.


This is exactly what I do. No one talks about how many GPUs you need to generate enough embeddings that you need to do something else.

Here's some back of the envelope math. Let's say you are using a 1B parameter LLM to generate the embedding. That's 2B FLOPs per token. Let's assume a modest chunk size, 2K tokens. That's 4 trillion FLOPs for one embedding.

What about the dot product in the cosine similarity? Let's assume an embedding dim of 384. That's 2 * 384 = 768.

So 4 trillion ops for the embedding vs 768 for the cosine similarity. That's a factor of about 1 billion.

So you could have a billion embeddings - brute forced - before the lookup became more expensive than generating the embedding.

What does that mean at the application level? It means that the time needed to generate millions of embeddings is measured in GPU weeks.

The time needed to lookup an embedding using an approximate nearest neighbors algorithm from millions of embeddings is measured in milliseconds.

The game changed when we switched from word2vec to LLMs to generate embeddings.

1 billion times is such a big difference that it breaks the assumptions earlier systems were designed under.


This analysis is bad.

The embedding is generated once. Search is done whenever a user inputs a query. The cosine similarity is also not done on a single embedding, it's done on millions or billions of embeddings if you are not using an index. So what the actual conclusion is, is that once you have a billion embeddings a single search operation costs as much as generating an embedding.

But then, you are not even taking into account the massive cost of keeping all of these embeddings in memory ready to be searched.


I think the context was prototyping.


Prototyping is one scenario I have seen this in. Prototyping is iterative - you experiment with the chunk size, chunk content, data sources, data pipeline, etc. every change means regenerating the embeddings

Another one is where the data is sliced based on a key, eg user id, particular document being worked on right now, etc


Everyone is piling on you but Id love to see what their companies are doing. Cosine similarity and loading a few thousand rows sounds trivial but most of the enterprise/b2b chat/copilot apps have a relatively small amount of data whose embeddings can fit in RAM. Combine that with natural sharding by customer ID and it turns out vector DBs are much more niche than an RDBMS. I suspect most people reaching for them haven’t done the calculus :/


People rushing to slap “AI” on their products don’t really know what they need? Yea that’s absolutely what’s happening now


1k rows isn't really at a point where you need any form of database. Vector or BOW, you can just bruteforce the search with such a miniscule amount of data (arguably this should be true into the low millions).

The problem is what happens when you have an additional 6 orders of magnitude of data, and the data itself is significantly larger than the system RAM, which is a very realistic case in a search engine.


1k is not much. My first RAG had over 40K docs (all short, but still...)

The one I'm working on right now has 115K docs (some quite big - I'll likely have to prune the largest 10% just to fit in my RAM).

These are all "small" - for personal use on my local machine. I'm currently RAM limited, otherwise I can think of (personal) use cases that are an order of magnitude larger.

Of course, for all I know, your method may still be as fast on those as on a vector DB.


I must be missing something -- why is the size of the documents a factor? If you embeded a document it would become a vector of ~1k floats, and 115k*1k floats is a couple hundred MB, trivial to fit in modern day RAM.


Embeddings are a type of lossy compression, so roughly speaking, using more embedding bytes for a document preserves more information about what it contains. Typically documents are broken down into chunks, then the embedding for each chunk is stored, so longer documents are represented by more embeddings.

Going further down the AI == compression path, there’s: http://prize.hutter1.net/


> Embeddings are a type of lossy compression

Always felt they're more like hashes/fingerprints for the RAG use cases.

> Typically documents are broken down into chunks

That's what I would have guessed. It's still surprising that the embeddings don't fit into RAM though.

That said (the following I just realized), even if the embeddings don't fit into RAM at the same time, you really don't need to load them all into RAM if you're just performing a linear scan and doing cosine similarity on each of them. Sure it may be slow to load tens of GB of embedding info... but at this rate I'd be wondering what kind of textual data one could feasibly have that goes into the terrabyte range. (Also, generating that many embedding requires a lot of compute!)


> Always felt they're more like hashes/fingerprints for the RAG use cases.

Yes, I see where you’re coming from. Perceptual hashes[0] are pretty similar, the key is that similar documents should have similar embeddings (unlike cryptographic hashes, where a single bit flip should produce a completely different hash).

Nice embeddings encode information spatially, a classic example of embedding arithmetic is: king - man + woman = queen[1]. “Concept Sliders” is a cool application of this to image generation [2].

Personally I’ve not had _too_ much trouble with running out of RAM due to embeddings themselves, but I did spend a fair amount of time last week profiling memory usage to make sure I didn’t run out in prod, so it is on my mind!

[0] https://en.m.wikipedia.org/wiki/Perceptual_hashing

[1] https://www.technologyreview.com/2015/09/17/166211/king-man-...

[2] https://github.com/rohitgandikota/sliders


Example from OpenAI embedding:

Each vector is 1536 numbers. I don't know how many bits per number, but I'll assume 64 bits (8 bytes). So total size is 1536 * 115K * 8 / 1024^2 gives 1.3GB.

So yes, not a lot.

I still haven't set it up so I don't know how much space it really will take, but my 40K doc one took 2-3 GB of RAM. It's not pandas DF, but in an in-memory DB so perhaps there's a lot of overhead per row? I haven't debugged.

To be clear, I'm totally fine with your approach if it works. I have very limited time so I was using txtai instead of rolling my own - it's nice to get a RAG up and running in just a few lines of code. But for sure, if the overhead of txtai is really that significant, I'll need to switch to pure pandas.


Even on the production side there is something to be said about just doing things in memory, even over larger datasets. Certainly like all things there is a possible scale issue but I would much rather spin up a dedicated machine with a lot of memory than pay some of the wildly high fees for a Vector DB.

Not sure if others have gone down this path but I have been testing out ways to store vectors to disk in files for later retrieval and then doing everything in memory. For me the tradeoff of a sligtly slower response time was worth it compared to the 4-5 figure bill I would be getting from a vector DB otherwise.


True.

Also, you are probably doing it wrong by turning a matrix to matrix multiplication into a for loop (over rows). The optimal solution results in better performance

sim = np.vstack(df.col) @ vec


There is certainly some scale at which a more sophisticated approach is needed. But your method (maybe with something faster than python/pandas) should be the go-to for demonstration and kept until it's determined that the brute force search is the bottleneck.

This issue is prevalent throughout infrastructure projects. Someone decides they need a RAG system and then the team says "let's find a vector db provider!" before they've proven value or understood how much data they have or anything. So they waste a bunch of time and money before they even know if the project is likely to work.

It's just like the old model of setting up a hadoop cluster as a first step to do "big data analytics" on what turns out to be 5GB of data that you could fit in a dataframe or process with awk https://adamdrake.com/command-line-tools-can-be-235x-faster-... (edit: actually currently on the HN front page)

It's a perfedt storm of sales led tooling where leadership is sold something they don't understand, over-engineering, and trying to apply waterfall project management to "AI" projects that have lots of uncertainty and need a re-risking based project approach where you show that it's liable to work and iterate instead of building a big foundation first.


> 5GB of data that you could fit in a dataframe or process with awk

These days anything less than 2TB should be done 100% in memory.


What’s your AWS bill like ?


Even up to 1M or so rows you can just store everything in a numpy array or PyTorch tensor and compute similarity directly between your query embedding and the entire database. Will be much faster than the apply() and still feasible to run on a laptop.


You may benefit from polars, it can multi-core better than pandas, and has some of the niceties from Arrow (which was the written / championed by the power duo of Wes and Hadley, authors of pandas and the R - tidyverse respectively).


I agree pandas or whatever data frame library you like is ideal for prototyping and exploring than setting up a bunch of infrastructure in a dev environment. Especially if you have labels and are evaluating against a ground truth.

You might be interested in SearchArray which emulates the classic search index side of things in a pandas dataframe column

https://github.com/softwaredoug/searcharray


Thanks for the article and definitely agree you are better off to start it simple like a parquet file and faiss and then test out options with your data. I say that mainly to test chunking strategies because of how big an effect it has on everything downstream whatever vector db or bert path you take -- chunking is a much bigger impact source than most people acknowledge.


I'm expecting to deploy a 6-figure "row count" RAG in the near future... with CTranslate2, matmul-based, at most lightly (like, single digits?) batched, and probably defaulting to CPU because the encoder-decoder part of the RAG process is just way more expensive and the database memory hog along with relatively poor TopK performance isn't worth the GPU.


That's kinda why I use LanceDB. It works on all three OSes, doesn't require large installs, and is quite easy to use. The files are also just Parquet, so no need to deall with SQL.


I mean, you have 1k rows and it is a "prototype".


Think about the number of flops needed for each comparison in brute force search.

You'll realize that it scales well beyond 1k.


use np.dot, takes 1 line


1k rows? Sounds like kindergarten.


up to 100k rows you don't get faster by using vector store, just use numpy


And often you have tags that filter it down even further.


What RAG systems do you prototype?


You could do it by hand at that scale too


I believe you've reached peak anything when it's been incorporated into PostgreSQL.


pgvector has you covered: https://github.com/pgvector/pgvector


Wow. Back in the day, I had to do cosine similarity indexing with pg-cube. It only did euclidean distance, so I had to store a separate column with normalized vectors.


What's cosine similarity, what do you use it for and why is it good to have it into your db instead of somewhere else (like a lib)?


Euclidean distance stops "making sense" as the number of dimensions goes up: https://stats.stackexchange.com/questions/99171/why-is-eucli...

Cosine similarity measures the angle between two vectors instead, and doesn't suffer from the curse of dimensionality.

I guess it's important to have this in your DB, so you make "nearby" queries (give me text that's similar to this other text) in an efficient way.


Cosine similarity suffers from a curse of dimensionality, just as distance does (it’s just one dimension less). The angle between two random vectors in N dimensions approaches zero with a power of N. The main reason this metric is useful in practice is because it better relates to how certain neural networks use/train their embeddings internally.


Oh yeah, dimensionality curse wasn't the reason for cosine, it was the NN output.


Exactly. Some other context implied here, these vectors were ML embeddings. In that case, like 100 dimensions that vaguely represented our input data in compact form. There was probably a better solution out there, this was just the most readily available for us.


> In the same way NoSQL forced us to rethink databases.

Did it? After using Mongo in my current job (not my choice), I'd choose Postgres again for my next project.


The thing to know about Mongo is, every database involves design choices that balance ergonomics, performance, and reliability. Every one, except Mongo which, according to their sales team, is the best at everything and has no faults, unless your technical choices are incorrect. In fact I just learned (in a lunch and learn with their team) that when you de-normalize data, inconsistency issues aren't really a problem, and joins are so unusably slow in ALL use cases anyways. Went ahead and just threw my DDIA book in the trash, as they nodded approvingly.


>Real-time recommendations, but driven by vector (and other kinds of) retrieval that looks more like a search engine - not batch computed, nightly jobs common these days.

This is already the case. Recommendations are just a fancy search where the query is a vector representing the user. Whether the learning is batched or not doesn't change the fact that it will use vector search for at least candidate generation.


We are past it, it was months ago :) The points are basically right, but a lot of folks realize all this.


Not yet. There is excitement for vector databases in some specialized areas but it hasn't really filtered out to the wider rank-and-file software engineering circles. You know it will be 'peak vector database' when you'll see blog posts on migrating your relational data to a vector database (with a follow-up 2 years later about moving back to PostgreSQL due to the shitshow that ensued).


I'll add txtai (https://github.com/neuml/txtai) to the list.

There is still plenty of room for innovation in this space. Just need to focus on the right projects that are innovating and not the ones (re)working on problems solved in 2020/2021.


I agree. Honestly even the fundamentals of vector databases aren't really "solved" in the way they are for other databases. Vector indexing, embedding generation, horizontal scaling, etc. can probably still improve a lot. And don't forget, even if Postgres and MySQL are the only traditional databases in town, every tech company had their own SQL database once. Many of them are still around too. No need to get pissy about these companies.


Agreed. For example, here is a post about integrating vector search results with semantic graphs for RAG - https://news.ycombinator.com/item?id=39141420

And here's a post on an alternative way to integrate vectors with traditional databases (Postgres, MySQL) - https://neuml.hashnode.dev/external-database-integration

As others have said in this thread, cosine similarity on arrays of vectors isn't novel. But there are many possibilities past that, many we haven't thought of yet too.


We're at the peak of blog posts listing vector databases.


I think https://vespa.ai/ has the right approach in this space by focusing on being hybrid - vectors alone aren't great for production use cases, it's the combining of vectors+text that lets you use ranking to get meaningful result.

(I'm an investor so I'm biased; but it's also the reason why I invested)


I believe the next step is an "Algolia" of sorts for cosine-similarity search.

Why bother with chunking data, synching it, and then tagging metadata to it. DB providers should be smart enough to optimize chunking strategy for the kind of content being indexed and then provide a simple API endpoint to query against their data.

"RAG in a can".


That's exactly what Vectara is (full disclosure, I work there)


So, there have been gnaw plugins for elastic search for some time. Is there really a need for a new search platform?

If there is, what api differentiates it, and why can’t this be expressed in either elasticsearch or Postgres?


Investors don't really care if it actually creates more value. They only care if the story can attract the public. They just want to profit by taking next investors' money.


Are there no distinguishing features between these vector databases? I'm not familiar with them so I was looking for any comment on that in the article, whether some make different tradeoffs than others, are easier to operate or implement, more scalable, etc. That together with their relative novelty might help explain why there are so many.


The big LLM companies are well positioned to build a lot of what a vector database is used for into their existing APIs and offerings. Both simplifying DX and devops.

Then on the other side existing databases will want to add functionality to be used as vector databases as well.

I think there’s lots of innovation ahead and it’s too soon to know what the end outcome will be.


""“how can so many vector databases need to exist?”.""

Same with languages.

Why so many languages.

Why can't we all get behind a few, do we need more than 6? For every case/problem? Put all our combined resources towards a smaller set.

We need a few DB's, a few languages, a few frameworks. Do we need hundreds?

Like everyone rolls their own everything.


Don't confuse a feature with a product. Postgres works great and you can layer in cosine similarity along with full-text search in a single query if you need to.


Why would you need a vector database when your system response time is dominated by calls to off-prem LLMs? Linear search through flat-file of embeddings. Done.


"We would say Cassandra is a columnar data store, alongside the Scylla or HBase."

Cassandra and Scylla are row based distributed key value stores.


Why not a "vector filesystem" for Linux?

I know it's subjective, but databases have started to feel like running a window manager and desktop on a server.

I feel like software needs to take a step back and rethink itself after years of putting chimps at typewriters searching for Shakespeare.

Why not a Linux kernel with a module(s) to provide the same assurances, SQL operations? Write directly to the filesystem?

Why is all the mathematical concept that we derive software from packaged into endless conceptual blobs of black box state?


Is there a good choice available for running inside a browser, client-side? Without a server to create or run inferences


Someday enterprises will actually pay someone for LLM tech and infra. Someday…


Lots of companies are paying OpenAI, so someday is yesterday?

https://www.reuters.com/technology/openai-annualized-revenue...


Most of that is recycled. They don’t break it down because it would make the obvious, obvious. Microsoft pays OpenAI but requires them to use Azure, and OpenAI pays Microsoft the same money back. This is why they continually need billions in investment, because they are far from profitable.

The same principle applies to defense. The US gives Israel and Ukraine tens of billions, but that’s a credit to buy from US defense firms. That money gets recycled right back to US weaponry.


Your logic seems highly flawed. I get what you are saying in the example, yes the government provides weapons which are paid for by the government but produced by defense companies.

But in the Microsoft example it is customers who are paying Microsoft to use OpenAI via Azure. Thats a free market of money inflows. Same with all the people using OpenAI directly. Not sure how you would even think of the money being recycled in this scenario. Yes of course there is some back scratching in the sense that Microsoft invested in OpenAI with a large portion of that investment in Azure credits which makes the investment quite nice from MSFT's side but there is still real demand for Azure services to use OpenAI apis.


For big enterprises it is either included in existing licenses (i.e. Microsoft Word has ChatGPT embedded) or pilots. No one is cutting massive checks to Microsoft specifically for OpenAI services.

This is obvious, but if you need some journalist to validate what is already logically clear:

https://www.wsj.com/tech/ai/ais-costly-buildup-could-make-ea...

https://www.wsj.com/tech/ai/ai-deals-microsoft-google-amazon...


I can see how its easy to get confused in this area but there are indeed large checks gettign written for using services like OpenAI or Anthropic.

You are really conflating too many things at once.

1) Yes, big tech is having a hard time monetizing their bespoke AI tooling within their own ecosystem.

2) Yes, big tech has made investments in the AI space where they are providing a portion of that funding as credits to use in their cloud offerings.

3) Here is where you are incorrect though. Companies are writing large checks for the raw compute/access to AI models. It is true across the spectrum of Azure OpenAI, OpenAI directly, AWS Bedrock etc, there are a lot of companies both big and small using these services heavily. To think otherwise is naive.


The investment is massive, but real tangible products that have purchasers for sustainable contracts is miniscule. We are in the experimental phase and hype cycle, and the trough of despair is next. I do think real products will come from this, but the actual productivity enhancements at the scale necessary to justify the investment have not materialized.


that's rumors from click bait subscription site Information without much proof.


To your point, the market for vector db solutions feels very undifferentiated. I am genuinely curious -- what are the types of ANN use-cases that truly require XXms lookup latency, XXX QPS, and capacity for billions of documents?


I don't know, let's ask an AI about it :)


As someone who has been using pgvector for a while and is vaguely curious about alternatives without having the bandwidth to investigate -- is there anything out there that offers truly differentiated advantages over pgvector? I'm extremely wary of non-OSS solutions in this area, it seems ripe for enshittification and attempts at vendor lock-in.


I use PgVector myself but here's the advantages to a true vector db.

- Vectors are massive data wise. In our current production database they take up 95% of the memory - should they be stored separately?

- Better support for easily re-embedding, hybrid search, certain RAG workflows

- Stronger performance once you're dealing with millions of vectors.

I would still stick with PgVector until you're dealing with non trivial scale.


I'd also start with pgvector (it's easy to switch), but the limitations around hybrid search and filtering + ANN are real and if you're doing any kind of RAG-like thing it's worth being aware of them upfront. pgvector is also an open-source project with way less manpower behind it than a bunch of venture-backed companies, so while you can expect it to pick up important features, it takes much longer (support for HNSW indices was a good example).


What is taking the most time at scale? Is this ingest, index build or lookups ?


ingest and index build can take time


What volumes are we talking about.

There are ways to speed things up dramatically. Index build just became multithreaded (see above).

We have ideas on what to do with ingest.

Also do you interest from S3 ?


np.dot is also multi-threaded, based on BLAS


If you're still in the "millions of documents" scale range, then PostgreSQL on a beefy EPYC can probably handle everything fast enough so that it doesn't make sense to spend engineering time on using a vector db which would only shave off a few ms in latency.


No


Nope


There are none that run on the edge, yet. Few more miles to go before we "peak".


Hmmm. Does pgvector count? That's supported by NeonDB, serverless compute on PostgreSQL.

https://neon.tech/docs/extensions/pgvector


(Neon CEO) It’s about to get a lot better too. Pgvector now supports multi-threaded build

https://github.com/pgvector/pgvector/issues/409#issuecomment...


Another very significant contribution to the pg ecosystem. You guys are awesome, thank you for everything you're doing.


Lol rare collaboration between neon, AWS, and supabase.

But if Postgres wins we all win!


By "edge", I was talking about mobile / IOT devices.

The closest I can see is the VSS extension[1] for Sqlite.

[1]: https://github.com/asg017/sqlite-vss


Incorrect, most of the libraries can run on edge, they're just C++.


What is the use case for this?


Running machine learning on device.

Context: I'm working on an e2ee alternative to Google Photos[1] where we have to cluster embeddings (for face recognition) and run similarity searches (for semantic search[2]) on device.

[1]: https://ente.io

[2]: https://openai.com/research/clip


hnswlib?


From a cursory glance, usearch[1] seems more portable.

[1]: https://github.com/unum-cloud/usearch


Neat!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: