Hacker News new | past | comments | ask | show | jobs | submit login

Question: I have ~10,000 128 element query vectors, and want to find the nearest neighbor (cosine similarity) for each of them in a dataset of ~1,000,000 target vectors. I can do this using brute force search on a GPU in a few minutes, which is fast but still a serious bottleneck for me. Is this an appropriate size of dataset and task for acceleration with some sort of vector database or algorithm more intelligent than brute force search?



Use an approximate method like faiss and then do cosine similarity on the results of that.

Short answer is most of these databases uses some type of precomputation to make doing approximate nearest neighbors faster. HNSW[0], FAISS[1], SCANN[2] etc are then all methods of doing approximate nearest neighbors but make use of different techniques to speed up that approximation. For your use case it will likely result in a speed up.

[0] https://www.pinecone.io/learn/hnsw/ [1] https://engineering.fb.com/2017/03/29/data-infrastructure/fa... [2]https://ai.googleblog.com/2020/07/announcing-scann-efficient...


There's no one answer to this, but I'd say that anything past 10k vectors would benefit greatly from a vector database. A vector DB will abstract away the building of a vector index along with other core database features such as caching, failover, replication, horizontal scaling, etc. Milvus (https://milvus.io) is open-source and always my go-to choice for this (disclaimer: I'm a part of the Milvus community). An added bonus of Milvus is that it supports GPU-accelerated indexing and search in addition to batched queries and inserts.

All of this assumes you're okay with a bit of imprecision - vector search with modern indexes is inherently probabilistic, e.g. your recall may not be 100%, but it will be close. Using a flat indexing strategy is still an option, but you lose a lot of the speedup that comes with a vector database.


I hesitate to mention this, because you probably know it and are doing it this way. But some other poster mentioned "and then do cosine similarity". In this case, you're going to want to preprocess and normalize each row of both matrices to have unit norm. Then cosine similarity is simply a matrix multiply between the two matrices (one transposed), and a pass over the results to find the top-k per query using a max queue type data structure.


Could this matrix be compressed to binary form for storage in a binary index?


That wouldn't really help. Let me explain in a bit more detail. The results depends on the query matrix, which will be different for each set of queries. We have a query matrix Q that of dimension 10000x128. And we have another vector matrix A that is 1,000,000x128. We preprocess both Q and A so each row has unit norm:

    Q[i,:] /= norm(Q[i,:])
    A[k,:] /= norm(A[k,:])
So now with that preprocessing the cosine similarity of a given row i of Q and k of A is:

    cossim(i,k) = dot(Q[i,:], A[k,:])
If you multiply QxA.T (10,000 x 128)x(128, 1M) you get a result matrix (10,000 x 1M) with all the cosine similarity values for each combination of query and vector.

If you make a pass across each column with a priority queue, you can find the top-n cosine similarity values in time O(1,000,000xn).

Now you could store the resulting matrix, but Q is going to change for each call, and we really only care about the top-n values for each query, so storing it wouldn't really accomplish anything.

Edited: fixed lots of typos


I asked GPT-3 about it using an array of vectors of fragments of this page, weighted by relevance (using np.dot(v1,v2)) to the query. This is used to build the prompt for submission to the OpenAI APIs. I'm interested in storing these vectors in a very fast DB for memories.

pastel-mature-herring~> Could this matrix be compressed to binary form for storage in a binary index?

angelic-quokka|> It is possible to compress the matrix to binary form for storage in a binary index, but this would likely decrease the accuracy of the cosine similarity values.


Agreed with fzliu, you can also use https://weaviate.io (disclaimer, I'm affiliated with Weaviate). You might also like this article which describes why one might want to use a vector search engine: https://db-engines.com/en/blog_post/87


I'll look into weaviate.


SOLR is using Lucene’s approximate nearest neighbor (ANN) implementation.

This site has some nice information on how ANN performs for vector search.

http://ann-benchmarks.com/


There is more relevant benchmark of vector search engines end-to-end, not just algorithms: https://qdrant.tech/benchmarks/


Thanks!


You can fit 1M vectors in the free tier of www.pinecone.io if you want to experiment. I'm not sure how fast having that many query vectors would be. (I'm a happy Pinecone customer, but only use a single query vector.)


Huh, at one point you could have multiple queries, but it looks like that is deprecated now.

https://www.pinecone.io/docs/api/operation/query/

So maybe it wouldn't work for you use case.


What about annoy? https://github.com/spotify/annoy I used this in the past. It will probably have it's limitations but worked great for me


I just tried this using my brute force embedding search library that runs on CPUs and it does it in 171s (16-core AMD). How often do you need to do this? This may be shorter than the initial indexing time for most approximate libraries if you aren't doing ad hoc queries.

https://github.com/spullara/bfes-java




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: