Hacker News new | past | comments | ask | show | jobs | submit login
Vald: A Highly Scalable Distributed Vector Search Engine (github.com/vdaas)
102 points by ngaut on April 8, 2021 | hide | past | favorite | 35 comments



I’m happy a lot of work is being done in this space. Lucene based search has long dominated, and search solutions like this, Vespa, Jina, etc are pushing the open source search space forward and at a minimum forcing Lucene to grow. In particular:

- Native, first class support for dense vectors first, no need to hack around the inverted index like this talk by Simon Hughes: https://haystackconf.com/2019/vectors/

- Using native programming languages, avoiding the JVM and all the memory and GC overhead

- Hopefully native languages might make it easier to deploy code that uses GPUs!

- A chance to focus on info retrieval when current Lucene solutions are either a bit crufty (Solr) or very focused on logs/analytics (Elasticsearch)

- a chance to have one system that unifies search and recommendations in one backend, not need to split them between two systems

- a chance to rethink the language used to specify query ranking factors. Solr has a very unique DSL that’s very write-only. Elasticsearch is very verbose to do anything non trivial.

Still I do wonder even if Lucene catches up a little, it’s current mindshare, reliability and install base will continue to dominate. For example see Lucene 9004 for adding dense vector search: https://issues.apache.org/jira/plugins/servlet/mobile#issue/...

The holy grail is a system that does a good job at classic full text search AND dense vector search. I wonder what system has the best chance of delivering that.


Could you share any hints on how to determine whether an application's search domain is suitable for vector-based search?

(context: I'm maintaining a recipe search engine, and it strikes me that recipes could be considered as part of a vector space, but I'm wary of adopting new technologies with something that works 'well enough' for now (Elasticsearch, in this case))


Any search domain is good for vector search if your method for producing vectors is good! Seems like it should be possible with Recipes.Are you performing search based on recipe text? It shouldn't take long to try out a simple POC with standard text embedding models like (Glove,BERT etc) and observe vectors via tSNE or just using a vector index.

PS: I work at Pinecone and you can see if any of the examples here help you : https://www.pinecone.io/docs/examples/


The vectors I have in mind could include (or perhaps be inferred from) recipe text, but they may also include things like nutritional content and corresponding freshest-in-kitchen-ingredients (maybe an ingredient x time-since-purchase vector?) to nearest-match-recipes.

The examples are helpful and well documented, thank you. I'd mention that I'm constraining the technology options to self-hostable and FLOSS software, and with a strong bias towards concepts that are already well-understood by most software engineers, but I'll keep a look out for Pinecone.


That is definitely an interesting use case. You can look for methods that combine multiple vectors to create a combined vector or you may want to filter on metadata and vectors both. Who knows maybe you might be able to come with something very unique for your example!


I’m really glad this exists and is open source. Two years ago I was working on a project that really could have benefitted from a strong ANN search engine but couldn’t find one. In the end I brute forced it (we only had 100k vectors to search, so with a few optimizations it was fine), but I always wished I could have solved it using something like this.


There was a similar situation in a project I worked on, at the time FAISS/Annoy/SPTAG were considered but we were really looking for something more managed, think ElasticSearch rather than Lucene. As luck would have it we stumbled upon https://www.milvus.io/ and it certainly checks a lot of boxes. But I'm rather excited to see some competition in this space, it's a broadly applicable trick so I'll definitely take a look at Vald next time it comes up as well!


We recognize that Milvus is a strong competitor and we think Milvus is a great platform. Like Milvus, Vald will soon be sending pull requests to ann-benchmarks to show that it is one of the fastest ANN engines available over the network.


Annoy and ScaNN are two good options - ScaNN is too new to have solved your previous problem, but Annoy would have been great.

- https://github.com/spotify/annoy

- https://ai.googleblog.com/2020/07/announcing-scann-efficient...


With only 100k vectors, why didn’t you just use Annoy? It’s been around for many years and it is fast, reliable, super easy to use and handles pretty large data quite well too.

https://github.com/spotify/annoy


I looked at Annoy at the time but from memory it required rebuilding the index every time a new vector is inserted, which was an issue for the project I was working on.

I assume that Vlad takes care of the reindexing in the background, which would save a lot of the work.


Main author and architect of Weaviate (https://github.com/semi-technologies/weaviate) here. This real-time requirement was one of the major design principles from the get-go in Weaviate.

In Weaviate, any imported vector is immediately searchable, you can update and delete your objects or the vectors attached to the objects and all results are immediately reflected. In addition every write is written to a Write-Ahead-Log, so that writes are persisted, even if the app crashes.

We wanted to make sure that Weaviate really combines the advantages of AI-based search with the comfort and guarantees you would expect from an "old school" database or search engine.



FAISS?


Also Annoy and more recently ScaNN


This might be of interest to some - Vald without the Kubernetes requirement (or integration): https://github.com/rinx/alvd


Hi! this alvd is our team member's side-project. thank you for sharing.


What’s the typical vector dimension/size that this would work well for?

I did something naively like this using array indexes of about length 20 vectors on Cloudant (CouchDB) in order to implement a text similarity search, based on something simple like LDA auto tagging and keyword frequency clustering vectors.

I would query based on input text vectorized the same way, and simply offset each element of that input vector by a fixed small value, using those offsets as start and end keys for a couch view query.

I didn’t test this on larger vectors but have always been curious on how far it could go.

Probably the matching algorithms in this project are more accurate, but this naive method worked pretty well for my use case and on a maintenance free DBaaS.

Would love to hear more about concrete usage examples in this area if anybody likes to share - thanks

Edit - found my slightly more detailed blog post about this http://splatcollision.com/page/fast-vector-similarity-querie...


How does Vald hold up to Milvus in terms of performance and storage? We make a lot of use of Milvus and they can do crazy fast 10ms like returns over a large set of vectors without even making a single core sweat a lot of the time and for about 200m vectors we only use like 40gb in storage at dim 128 after quantization and index building.


Hi, I'm kpango, the developer of Vald. I'm glad to see that many people are interested in Vald, and I'd be happy to receive feedback on any inconveniences via github Issue or PR.


Does Vald support boolean filtering? That is, can I attach a number of attributes to vectors, and then find the nearest neighbors that match a boolean filter on the attributes?


You might be interested in checking out Weaviate (https://github.com/semi-technologies/weaviate) which supports combining vector search with boolean filters. In Weaviate, the vector index is stored alongside an inverted index, so you get the best of both worlds. At the moment on a filtered search, the inverted index is hit first, so you essentially end up with an allow-list of IDs which are then fed to the vector index. If the id isn't on the allow list the next neighbor is selected and so on.

There are also plans to further optimize this based on how restrictive the boolean filter is. For example, if the filter still matches a lot of of ids (e.g. 50% of all candidates) the approach outlined above works really well. But if if your filter is extremely restrictive and maybe only matches 1k out of 100M results, then this approach is no longer ideal, as the vector index will discard most of the ids it sees. However, in this case it would be more efficient to actually perform a brute-force search on those 1k vectors. This and other optimizations around combined vector and scalar search are on the roadmap.


Thanks for the explanation. Is it true that a node in HNSW can be a jump point for the next iteration of search, which means its neighbors can well be candidates of a query even though the node itself is filtered out? If so, skipping a node because it is not in an allow-list will potentially result in loss of recall? Will it work if we can keep expanding the search front until we find enough number of results that are both close enough to the given query and match the given filter?


Since Vald is a pure ANN search engine, it does not support graph search of boolean filtering results. However, it does have post-filtering capabilities. The same use case can be achieved by using the post-filter. It is possible to implement functions such as retrieving more TopKs than the expected value of N, and then filtering the results to get N in the end. The documentation for Vald's Filter feature is in preparation, but should be available soon. There are also functions that work seamlessly with Tensorflow and ONNX, so it is possible to embed text data in vector space without Boolean filtering, and perform similar searches using only ANNs by weighting vector queries. Our ideal world would be one in which everything becomes a vector, and we could perform a variety of searches using vectors.


thanks. The challenge with post-filtering is that the K in TopK becomes unpredictable if I'd like to get a fixed page size. I may end up issuing multiple queries with progressively large Ks to get a single page of results.


Exactly. See my reply on this comment thread comment thread to see how Weaviate solves the filter-issue by using an inverted index to produce a whitelist of IDs which is then passed to the vector index where non matching IDs are simply skipped.


Can't speak for Vald, but filtering is coming in the next release for Pinecone (https://www.pinecone.io), which is a managed vector search.


Related to cloud-scale ANN: if anyone is interested in testing an embedding service that helps programmers scale NLP operations without the infra hassle, please reach out! (edward.benson@gmail.com)

This thread has been a wealth of information. I’ve had great success with Annoy so far; eager to kick the tires on Vald.



What could be possible use cases for a search engine like this?


Dense vectors are pervasive in Neural Networks. Take a layer's activations and you have a new one (which is what NN embeddings are).

Indexing them is really hard: The local neighborhood's volume to be searched grows by radius^dimension; you need a specialized engine.

Now, say you are a three letter agency and have Facebook's image data (2B faces, 100B images). You train a deep learning image classification network to identify faces. But for training time reasons you only can do it on a dataset of about 10k different faces, 10M images. The NN will identify characteristic visual features to make its classification, available at the n-1 layer as activations.

But you need to have the rest of the dataset searchable! You now index the activations of all the 100B remaining images. And with this you can search by similarity of visual features. If you have an image of someone you want to to track, get its activations and search for other images that have similar vectors.

This works for everything that a NN can model: speech, words, words in sentences, videos, etc.

----

On a another note Facebook has a feature-rich, GPU-powered, scalable indexation engine called FAISS:

https://github.com/facebookresearch/faiss


Had exactly the same question and found this short clip (from Weaviate, an alternative vector search engine): https://www.youtube.com/watch?v=SDOl9fRObVg In a distributed nature with Vlad I think the aim is to reduce a single point of failure.


Mostly everything related to similarity search, there is a GIF on the Weaviate Github that shows a few search examples: https://github.com/semi-technologies/weaviate#weaviate-

ML-models output vectors, you can use a vector search engine to store those vectors and quickly search through them. Weaviate also stores the data objects btw like a DB: https://db-engines.com/en/blog_post/87


I'd recommend Milvus. There are plenty of use cases you can find on their Medium blogs: https://blog.milvus.io.


what have you done to VLAD>>>?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: