When I prototype RAG systems I don’t use a “vector database.” I just use a panda...

gdiamos · 2024-01-26T09:28:58

This is exactly what I do. No one talks about how many GPUs you need to generate enough embeddings that you need to do something else.

Here's some back of the envelope math. Let's say you are using a 1B parameter LLM to generate the embedding. That's 2B FLOPs per token. Let's assume a modest chunk size, 2K tokens. That's 4 trillion FLOPs for one embedding.

What about the dot product in the cosine similarity? Let's assume an embedding dim of 384. That's 2 * 384 = 768.

So 4 trillion ops for the embedding vs 768 for the cosine similarity. That's a factor of about 1 billion.

So you could have a billion embeddings - brute forced - before the lookup became more expensive than generating the embedding.

What does that mean at the application level? It means that the time needed to generate millions of embeddings is measured in GPU weeks.

The time needed to lookup an embedding using an approximate nearest neighbors algorithm from millions of embeddings is measured in milliseconds.

The game changed when we switched from word2vec to LLMs to generate embeddings.

1 billion times is such a big difference that it breaks the assumptions earlier systems were designed under.

brigadier132 · 2024-01-26T13:45:51

This analysis is bad.

The embedding is generated once. Search is done whenever a user inputs a query. The cosine similarity is also not done on a single embedding, it's done on millions or billions of embeddings if you are not using an index. So what the actual conclusion is, is that once you have a billion embeddings a single search operation costs as much as generating an embedding.

But then, you are not even taking into account the massive cost of keeping all of these embeddings in memory ready to be searched.

jsight · 2024-01-26T14:20:31

I think the context was prototyping.

gdiamos · 2024-01-26T16:32:57

Prototyping is one scenario I have seen this in. Prototyping is iterative - you experiment with the chunk size, chunk content, data sources, data pipeline, etc. every change means regenerating the embeddings

Another one is where the data is sliced based on a key, eg user id, particular document being worked on right now, etc

omeze · 2024-01-26T05:07:24

Everyone is piling on you but Id love to see what their companies are doing. Cosine similarity and loading a few thousand rows sounds trivial but most of the enterprise/b2b chat/copilot apps have a relatively small amount of data whose embeddings can fit in RAM. Combine that with natural sharding by customer ID and it turns out vector DBs are much more niche than an RDBMS. I suspect most people reaching for them haven’t done the calculus :/

coffeebeqn · 2024-01-26T14:14:01

People rushing to slap “AI” on their products don’t really know what they need? Yea that’s absolutely what’s happening now

marginalia_nu · 2024-01-26T08:31:37

1k rows isn't really at a point where you need any form of database. Vector or BOW, you can just bruteforce the search with such a miniscule amount of data (arguably this should be true into the low millions).

The problem is what happens when you have an additional 6 orders of magnitude of data, and the data itself is significantly larger than the system RAM, which is a very realistic case in a search engine.

BeetleB · 2024-01-26T05:41:28

1k is not much. My first RAG had over 40K docs (all short, but still...)

The one I'm working on right now has 115K docs (some quite big - I'll likely have to prune the largest 10% just to fit in my RAM).

These are all "small" - for personal use on my local machine. I'm currently RAM limited, otherwise I can think of (personal) use cases that are an order of magnitude larger.

Of course, for all I know, your method may still be as fast on those as on a vector DB.

hnfong · 2024-01-26T07:06:31

I must be missing something -- why is the size of the documents a factor? If you embeded a document it would become a vector of ~1k floats, and 115k*1k floats is a couple hundred MB, trivial to fit in modern day RAM.

DougBTX · 2024-01-26T07:59:04

Embeddings are a type of lossy compression, so roughly speaking, using more embedding bytes for a document preserves more information about what it contains. Typically documents are broken down into chunks, then the embedding for each chunk is stored, so longer documents are represented by more embeddings.

Going further down the AI == compression path, there’s: http://prize.hutter1.net/

hnfong · 2024-01-26T10:55:15

> Embeddings are a type of lossy compression

Always felt they're more like hashes/fingerprints for the RAG use cases.

> Typically documents are broken down into chunks

That's what I would have guessed. It's still surprising that the embeddings don't fit into RAM though.

That said (the following I just realized), even if the embeddings don't fit into RAM at the same time, you really don't need to load them all into RAM if you're just performing a linear scan and doing cosine similarity on each of them. Sure it may be slow to load tens of GB of embedding info... but at this rate I'd be wondering what kind of textual data one could feasibly have that goes into the terrabyte range. (Also, generating that many embedding requires a lot of compute!)

DougBTX · 2024-01-27T08:27:10

> Always felt they're more like hashes/fingerprints for the RAG use cases.

Yes, I see where you’re coming from. Perceptual hashes[0] are pretty similar, the key is that similar documents should have similar embeddings (unlike cryptographic hashes, where a single bit flip should produce a completely different hash).

Nice embeddings encode information spatially, a classic example of embedding arithmetic is: king - man + woman = queen[1]. “Concept Sliders” is a cool application of this to image generation [2].

Personally I’ve not had _too_ much trouble with running out of RAM due to embeddings themselves, but I did spend a fair amount of time last week profiling memory usage to make sure I didn’t run out in prod, so it is on my mind!

[0] https://en.m.wikipedia.org/wiki/Perceptual_hashing

[1] https://www.technologyreview.com/2015/09/17/166211/king-man-...

[2] https://github.com/rohitgandikota/sliders

BeetleB · 2024-01-26T18:39:35

Example from OpenAI embedding:

Each vector is 1536 numbers. I don't know how many bits per number, but I'll assume 64 bits (8 bytes). So total size is 1536 * 115K * 8 / 1024^2 gives 1.3GB.

So yes, not a lot.

I still haven't set it up so I don't know how much space it really will take, but my 40K doc one took 2-3 GB of RAM. It's not pandas DF, but in an in-memory DB so perhaps there's a lot of overhead per row? I haven't debugged.

To be clear, I'm totally fine with your approach if it works. I have very limited time so I was using txtai instead of rolling my own - it's nice to get a RAG up and running in just a few lines of code. But for sure, if the overhead of txtai is really that significant, I'll need to switch to pure pandas.

infecto · 2024-01-26T12:51:08

Even on the production side there is something to be said about just doing things in memory, even over larger datasets. Certainly like all things there is a possible scale issue but I would much rather spin up a dedicated machine with a lot of memory than pay some of the wildly high fees for a Vector DB.

Not sure if others have gone down this path but I have been testing out ways to store vectors to disk in files for later retrieval and then doing everything in memory. For me the tradeoff of a sligtly slower response time was worth it compared to the 4-5 figure bill I would be getting from a vector DB otherwise.

ninja3925 · 2024-01-26T03:25:51

True.

Also, you are probably doing it wrong by turning a matrix to matrix multiplication into a for loop (over rows). The optimal solution results in better performance

sim = np.vstack(df.col) @ vec

andy99 · 2024-01-26T11:39:24

There is certainly some scale at which a more sophisticated approach is needed. But your method (maybe with something faster than python/pandas) should be the go-to for demonstration and kept until it's determined that the brute force search is the bottleneck.

This issue is prevalent throughout infrastructure projects. Someone decides they need a RAG system and then the team says "let's find a vector db provider!" before they've proven value or understood how much data they have or anything. So they waste a bunch of time and money before they even know if the project is likely to work.

It's just like the old model of setting up a hadoop cluster as a first step to do "big data analytics" on what turns out to be 5GB of data that you could fit in a dataframe or process with awk https://adamdrake.com/command-line-tools-can-be-235x-faster-... (edit: actually currently on the HN front page)

It's a perfedt storm of sales led tooling where leadership is sold something they don't understand, over-engineering, and trying to apply waterfall project management to "AI" projects that have lots of uncertainty and need a re-risking based project approach where you show that it's liable to work and iterate instead of building a big foundation first.

thelastparadise · 2024-01-26T12:07:23

> 5GB of data that you could fit in a dataframe or process with awk

These days anything less than 2TB should be done 100% in memory.

coffeebeqn · 2024-01-26T14:18:17

What’s your AWS bill like ?

jxmorris12 · 2024-01-26T14:20:54

Even up to 1M or so rows you can just store everything in a numpy array or PyTorch tensor and compute similarity directly between your query embedding and the entire database. Will be much faster than the apply() and still feasible to run on a laptop.

baldeagle · 2024-01-26T08:19:35

You may benefit from polars, it can multi-core better than pandas, and has some of the niceties from Arrow (which was the written / championed by the power duo of Wes and Hadley, authors of pandas and the R - tidyverse respectively).

softwaredoug · 2024-01-26T12:20:14

I agree pandas or whatever data frame library you like is ideal for prototyping and exploring than setting up a bunch of infrastructure in a dev environment. Especially if you have labels and are evaluating against a ground truth.

You might be interested in SearchArray which emulates the classic search index side of things in a pandas dataframe column

https://github.com/softwaredoug/searcharray

jimmySixDOF · 2024-01-26T16:53:49

Thanks for the article and definitely agree you are better off to start it simple like a parquet file and faiss and then test out options with your data. I say that mainly to test chunking strategies because of how big an effect it has on everything downstream whatever vector db or bert path you take -- chunking is a much bigger impact source than most people acknowledge.

namibj · 2024-01-26T11:19:15

I'm expecting to deploy a 6-figure "row count" RAG in the near future... with CTranslate2, matmul-based, at most lightly (like, single digits?) batched, and probably defaulting to CPU because the encoder-decoder part of the RAG process is just way more expensive and the database memory hog along with relatively poor TopK performance isn't worth the GPU.

zzleeper · 2024-01-26T08:19:24

That's kinda why I use LanceDB. It works on all three OSes, doesn't require large installs, and is quite easy to use. The files are also just Parquet, so no need to deall with SQL.

Uncroyable · 2024-01-26T03:34:56

I mean, you have 1k rows and it is a "prototype".

petters · 2024-01-26T07:24:02

Think about the number of flops needed for each comparison in brute force search.

You'll realize that it scales well beyond 1k.

visarga · 2024-01-26T05:09:27

use np.dot, takes 1 line

whalesalad · 2024-01-26T04:47:47

1k rows? Sounds like kindergarten.

visarga · 2024-01-26T05:10:21

up to 100k rows you don't get faster by using vector store, just use numpy

intalentive · 2024-01-26T18:26:16

And often you have tags that filter it down even further.

jjtheblunt · 2024-01-26T03:51:40

What RAG systems do you prototype?

hcks · 2024-01-26T08:11:05

You could do it by hand at that scale too