Solr’s Dense Vector Search for indexing and searching dense numerical vectors

bratao · on Sept 5, 2022

Shameless plug from someone not related to the project. Try https://vespa.ai , fully open-source, very mature hybrid search with dense and approximated vector search. A breeze to deploy and maintain compared to ES and Solr. If I could name a single secret ingredient for my startup, is Vespa.

forrest2 · on Sept 5, 2022

Vespa looks pretty compelling; indexing looks like a dream.

I'd recommend basically anything else over a customized ES / Solr cluster. Some of the least fun clusters to manage. Great for simple use-cases / anything you see in a tutorial. The moment you walk off the beaten path with them, best of luck.

Just an anecdote

binarymax · on Sept 5, 2022

Solr has its quirks for sure, but I’ve seen multi-terabyte sized indices running with great relevance and performance. I would call it a mechanics search engine. It is very powerful but you need to get your hands dirty.

lmeyerov · on Sept 5, 2022

can vespa index 100M+ vectors on a regular RAM cpu server? any faster w/ a gpu (T4 / A10)?

jkb79 · on Sept 6, 2022

Yes, https://blog.vespa.ai/vespa-hybrid-billion-scale-vector-sear...

lmeyerov · on Sept 7, 2022

Super promising, thanks. Will definitely watch for production readiness - this direction is a big deal as we find most data is cold (logs, docs, ..), not hot, and finally starts becoming more relevant than just doing ES or our own!

mountainriver · on Sept 5, 2022

Vespa seemed like a total mess compared to Milvus when I picked them up

peterstjohn · on Sept 5, 2022

Two big reasons for Vespa over Milvus 1.x:

* Filtering

* String-based IDs

(a caveat that I haven't used Milvus 2.x recently, which does fix these issues, but brings in a bunch of other dependencies like Kafka or Pulsar)

kofejnik · on Sept 5, 2022

omg so cool, thank you!

binarymax · on Sept 5, 2022

Dense vector search in Solr is a welcome addition, but getting started requires a lot of pieces that aren’t included.

So I made this a couple months ago to make it super easy to get started with this tech. If you have a sitemap you can start the docker compose and index your website with one command line.

https://github.com/maxdotio/neural-solr

Enjoy!

kordlessagain · on Sept 5, 2022

Thanks for this. Very useful. Any interest in adding a crawler? https://github.com/kordless/grub-2.0

QuadmasterXLII · on Sept 5, 2022

Question: I have ~10,000 128 element query vectors, and want to find the nearest neighbor (cosine similarity) for each of them in a dataset of ~1,000,000 target vectors. I can do this using brute force search on a GPU in a few minutes, which is fast but still a serious bottleneck for me. Is this an appropriate size of dataset and task for acceleration with some sort of vector database or algorithm more intelligent than brute force search?

ianbutler · on Sept 5, 2022

Use an approximate method like faiss and then do cosine similarity on the results of that.

Short answer is most of these databases uses some type of precomputation to make doing approximate nearest neighbors faster. HNSW[0], FAISS[1], SCANN[2] etc are then all methods of doing approximate nearest neighbors but make use of different techniques to speed up that approximation. For your use case it will likely result in a speed up.

[0] https://www.pinecone.io/learn/hnsw/ [1] https://engineering.fb.com/2017/03/29/data-infrastructure/fa... [2]https://ai.googleblog.com/2020/07/announcing-scann-efficient...

fzliu · on Sept 5, 2022

There's no one answer to this, but I'd say that anything past 10k vectors would benefit greatly from a vector database. A vector DB will abstract away the building of a vector index along with other core database features such as caching, failover, replication, horizontal scaling, etc. Milvus (https://milvus.io) is open-source and always my go-to choice for this (disclaimer: I'm a part of the Milvus community). An added bonus of Milvus is that it supports GPU-accelerated indexing and search in addition to batched queries and inserts.

All of this assumes you're okay with a bit of imprecision - vector search with modern indexes is inherently probabilistic, e.g. your recall may not be 100%, but it will be close. Using a flat indexing strategy is still an option, but you lose a lot of the speedup that comes with a vector database.

cschmidt · on Sept 5, 2022

I hesitate to mention this, because you probably know it and are doing it this way. But some other poster mentioned "and then do cosine similarity". In this case, you're going to want to preprocess and normalize each row of both matrices to have unit norm. Then cosine similarity is simply a matrix multiply between the two matrices (one transposed), and a pass over the results to find the top-k per query using a max queue type data structure.

kordlessagain · on Sept 5, 2022

Could this matrix be compressed to binary form for storage in a binary index?

cschmidt · on Sept 5, 2022

That wouldn't really help. Let me explain in a bit more detail. The results depends on the query matrix, which will be different for each set of queries. We have a query matrix Q that of dimension 10000x128. And we have another vector matrix A that is 1,000,000x128. We preprocess both Q and A so each row has unit norm:

    Q[i,:] /= norm(Q[i,:])
    A[k,:] /= norm(A[k,:])

So now with that preprocessing the cosine similarity of a given row i of Q and k of A is:

    cossim(i,k) = dot(Q[i,:], A[k,:])

If you multiply QxA.T (10,000 x 128)x(128, 1M) you get a result matrix (10,000 x 1M) with all the cosine similarity values for each combination of query and vector.

If you make a pass across each column with a priority queue, you can find the top-n cosine similarity values in time O(1,000,000xn).

Now you could store the resulting matrix, but Q is going to change for each call, and we really only care about the top-n values for each query, so storing it wouldn't really accomplish anything.

Edited: fixed lots of typos

kordlessagain · on Sept 5, 2022

I asked GPT-3 about it using an array of vectors of fragments of this page, weighted by relevance (using np.dot(v1,v2)) to the query. This is used to build the prompt for submission to the OpenAI APIs. I'm interested in storing these vectors in a very fast DB for memories.

pastel-mature-herring~> Could this matrix be compressed to binary form for storage in a binary index?

angelic-quokka|> It is possible to compress the matrix to binary form for storage in a binary index, but this would likely decrease the accuracy of the cosine similarity values.

thirdtrigger · on Sept 5, 2022

Agreed with fzliu, you can also use https://weaviate.io (disclaimer, I'm affiliated with Weaviate). You might also like this article which describes why one might want to use a vector search engine: https://db-engines.com/en/blog_post/87

QuadmasterXLII · on Sept 5, 2022

I'll look into weaviate.

tarr11 · on Sept 5, 2022

SOLR is using Lucene’s approximate nearest neighbor (ANN) implementation.

This site has some nice information on how ANN performs for vector search.

http://ann-benchmarks.com/

generall · on Sept 5, 2022

There is more relevant benchmark of vector search engines end-to-end, not just algorithms: https://qdrant.tech/benchmarks/

QuadmasterXLII · on Sept 5, 2022

Thanks!

cschmidt · on Sept 5, 2022

You can fit 1M vectors in the free tier of www.pinecone.io if you want to experiment. I'm not sure how fast having that many query vectors would be. (I'm a happy Pinecone customer, but only use a single query vector.)

cschmidt · on Sept 5, 2022

Huh, at one point you could have multiple queries, but it looks like that is deprecated now.

https://www.pinecone.io/docs/api/operation/query/

So maybe it wouldn't work for you use case.

ParanoidShroom · on Sept 5, 2022

What about annoy? https://github.com/spotify/annoy I used this in the past. It will probably have it's limitations but worked great for me

spullara · on Sept 5, 2022

I just tried this using my brute force embedding search library that runs on CPUs and it does it in 171s (16-core AMD). How often do you need to do this? This may be shorter than the initial indexing time for most approximate libraries if you aren't doing ad hoc queries.

https://github.com/spullara/bfes-java

andre-z · on Sept 5, 2022

It was to expect after recent ES releases. However, dedicated vector search engines offer better performance and more advanced features. Qdrant https://github.com/qdrant/qdrant is written in Rust. Fast, stable, and super easy to deploy. (disclaimer. affiliated with the project).

jillesvangurp · on Sept 6, 2022

I don't think that there's much preventing Elasticsearch or Solr from being as fast. Vector search is expensive. Therefore combining the approach with other filtering and querying makes a lot of sense. It's just another tool for search engineers to use. Probably using vector search exclusively would go at the cost of search quality and performance.

andre-z · on Sept 6, 2022

Exactly, filtering is vital. That is why Qdrant offers extended filtering, which happens at the same stage as retrieval, not as pre- or post-filtering, with two stages and an impact on performance and quality. It is not trivial to implement vector search functionality inside a complex framework that is not initially designed for vectors. Besides that, vector search is not only about semantic search.

lovelearning · on Sept 5, 2022

A much-awaited enhancement. Saves the trouble of having to deploy a separate vector DB like Milvus.

I don't like the query syntax though. Maybe a more developer-friendly indexing+query flow is possible. Vectorize fields and queries transparently using a lib like DL4J running in the same JVM. That can further simplify both app development and deployment.

lmeyerov · on Sept 5, 2022

can this do something like a 100M+ index on a single node?

it seems like all the vc-funded oss options are targeting more like 1M-rows-per-server, which doesn't really make sense for most of our use cases..

stoicjumbotron · on Sept 5, 2022

Different from Solr I know, but thoughts on Lunr? https://github.com/olivernn/lunr.js

kordlessagain · on Sept 5, 2022

Whoosh is cool too: https://whoosh.readthedocs.io/en/latest/intro.html

ffhhj · on Sept 6, 2022

Repository at Bitbucket is dead. Last copyright year is 2012, is the project still alive?

dsign · on Sept 5, 2022

After the attempt of Apple of using "neural search" to spy on its customers, the term has been left with a bad rep.

visarga · on Sept 5, 2022

It doesn't have a bad reputation, it's cosine similarity done faster by approximation, something part of many ML papers and systems these days.