Hacker News new | past | comments | ask | show | jobs | submit login
Faiss: A library for efficient similarity search (github.com/facebookresearch)
263 points by tosh on March 30, 2023 | hide | past | favorite | 52 comments



hnswlib (https://github.com/nmslib/hnswlib) is a strong alternative to faiss that I have enjoyed using for multiple projects. It is simple and has great performance on CPU.

After working through several projects that utilized local hnswlib and different databases for text and vector persistence, I integrated hnswlib with sqlite to create an embedded vector search engine that can easily scale up to millions of embeddings. For self-hosted situations of under 10M embeddings and less than insane throughput I think this combo is hard to beat.

https://github.com/jiggy-ai/hnsqlite


I totally agree and hnswlib is actually much faster than FAISS on CPU.

I'm really happy to see `hnswlib` as a Python dependency since I'm the one who implemented PyPI support: https://github.com/nmslib/hnswlib/pull/140


Interesting - is there a good reference to back this claim? Curious to hear what overheads Faiss would have if it's configured with similar parameters to build the HNSW graphs. Is that what you A/B-tested in practice?


http://ann-benchmarks.com

hnswlib implementation of hnsw is faster than faiss's implementation. Faiss has other index methods that are faster in some cases, but more complex as well.


Thank you for this! This project is really hnswlib-sqlite just shortened into hns(w)qlite.


I like Faiss but I tried Spotify's annoy[1] for a recent project and was pretty impressed.

Since lots of people don't seem to understand how useful these embedding libraries are here's an example. I built a thing that indexes bouldering and climbing competition videos, then builds an embedding of the climber's body position per frame. I then can automatically match different climbers on the same problem.

It works pretty well. Since the body positions are 3D it works reasonably well across camera angles.

The biggest problem is getting the embedding right. I simplified it a lot above because I actually need to embed the problem shape itself because otherwise it matches too well: you get frames of people in identical positions but on different problems!

[1] https://github.com/spotify/annoy


I looked a bit and the code, I think it would be kow hanging fruit to add additional sqlite fields except the vector ones. Even if any filtering happens kind of suboptimally in post processing.



Cool! Using it right now. Question: Why not store the hnswlib binary right within the SQLite? Then the whole index would be in one file.


yes, this is what I want to do


hnswlib supports pre-filtering


If anyone is interested in diving deeper into Faiss, we put together an unofficial manual after not finding much learning materials about it:

https://www.pinecone.io/learn/faiss/


I really like your learning series. What you’ve done for understanding conversational memory at https://www.pinecone.io/learn/langchain-conversational-memor... was truly helpful. Thanks!


Awesome. I used faiss during my phd studies on ai based lidar map building and I spent countless hours in the faiss github wiki pages, example code, and issues. Would have loved something like this back then. Bookmarked for the next time I need faiss.


Faiss is a wonderful vector search library - in particular, the ability to do hybrid indexes e.g. IVF-PQ, IVF-SQ is great. We (https://milvus.io) use it as one of the indexing options (along with Annoy, Nmslib, and DiskANN) to power our vector database.


I've actually been looking at Milvus for storing embeddings, why do y'all have options for both Pulsar and Kafka inside Milvus? Seems like unnecessary choices for what we hoped would just be a plug and play vector database


The design of Milvus 2.x follows this paper we published a while back: https://arxiv.org/abs/2206.13843. In short, we used Pulsar to implement the write-ahead log, which provides coordination and a single source of truth across all Milvus components.

You're right in that it's a bit heavyweight, so we're working to see how we can make pub/sub and other cluster components lighter and more efficient overall.


Based on some superficial research yesterday, Weaviate looks like an easier option for messing around locally, but Milvus looks better for a production use case.


There is a wip [0] on RAFT [1] integration to faiss as an implementation of cuda gpu backed indices, although you can use RAFT directly.

[0] https://github.com/facebookresearch/faiss/pull/2521 [1] https://github.com/rapidsai/raft


txtai combines Faiss and SQLite to support similarity search with SQL.

For example: SELECT id, text, date FROM txtai WHERE similar('machine learning') AND date >= '2023-03-30'

GitHub: https://github.com/neuml/txtai

This article is a deep dive on how the index format works: https://neuml.hashnode.dev/anatomy-of-a-txtai-index


Check out https://github.com/erikbern/ann-benchmarks for some benchmarks on some of the different ANN libraries out there. I'd be interested in hearing other's experiences using these libraries in production.


Faiss is awesome but note that it's an ANN library that might not be suitable for all use cases; hence vector databases exist. Two things that might be of interest: the main contributor of Faiss on the Weaviate podcast: https://youtu.be/5o1YTp1IL5o and Weaviate as vector DB itself: https://weaviate.io


Faiss supports FLAT aka not approx nearest neighbor. Also, all the vector databases use ANN too because nearest neighbor is hard.

Or are you referring to the ability to store data in addition to the vectors? In which case, you can pair any time tested DB with the index avoiding the hype of DBs that might be gone in a year.


Vector indexes only form a component within vector databases - you'd want the database bits and bobs (scalability, replication, caching, etc) surrounding it in addition to the index itself. Milvus, for example, supports flat indexes - as far as I know, Weaviate could support flat indexing as well if it doesn't do so already.

Most vector databases focus on ANN because of scale. Once you get to around a million vectors or so, it becomes prohibitively expensive to perform brute-force querying and search.


We used Weavite and Elastic in prod. Felt that Weavite was not yet prod ready wrt deployment ( sharding , clustering is murky ). We went with ES's dense_vector field which is serving us well.



Recently used Faiss to optimize a kmeans implementation. Way faster than the sklearn implementation can’t recommend enough.


Why?


Sorry a noob question, how is Faiss different from say pgvector, other than the fact that pgvector is developed mainly as a postgres plugin? I think I saw use cases of using pgvector for similarity search on openai encoidings of text (which itself is generated from a NN)


Scale, Faiss supports a bunch more algorithms. A much larger scale.


Although there is some work going on right now to add support for the type of algorithms in pgvector to alot it to scale better (and also to have better recall/speed tradeoffs).


Thank you, that makes sense


Big fan of Faiss - I've tried using several others (milvus, weaviate, opensearch, etc) but none struck the usability and configurability chord as much as Faiss did.

I especially like their index-factory models. Once you figure out how to build it properly, you can easily push beyond 100M vectors (512-dim) on a single reasonably beefy node. Memory-mapping, sub-20ms latencies on 10M+ vectors, bring-your-own training sampling strategies, configurable memory-usage, PQ, the list goes on. Once you have this, distributing it across nodes becomes trivial and allows you to keep scaling horizontally.

Not sure if others have used their GPU bindings, but being able to train about 10x faster on your data is a game-changer for rapid-experimentation and evaluation, especially when you need to aggressively quantize at this scale. Also, the fact that you have an extremely portable GPU-trained index that you can run on a lightweight-CPU (potentially even a lambda) is very compelling to me.

That said, I'd love to see Faiss ported to the browser (using WASM) - if any of this sounded useful or intriguing, DM me, would love to share notes and learn more about how folks are using Faiss today.


Interesting - could it be the case that your use case is more suited for a library like FAISS instead of a vector DB? Would love to understand this. (I’m affiliated with Weaviate).


Note of high importance : "Faiss" reads and sounds exactly like "Fesse" (butt), in French.

I see some French names in the author list. The joke was intended, well done ! :)


This may be a stupid question, but how would one generate the vectors that this uses? (I assume you can't just feed it images or 3d models?)


Good timing -- we actually just published a tutorial showing how to build semantic image search (including generating vectors with CLIP) with faiss this morning: https://blog.roboflow.com/clip-semantic-search/


It will depend on the type of data you are using, and then what you are planning on doing with it. Various open source models will take various modalities (e.g. images of models), and will create an embedding/vector for them (even if just as an internal representation). You can take these, and store them in a vector DB. This may help https://www.marqo.ai/blog/how-to-implement-text-to-image-sea...


This is typically done by taking the activations from a neural network. Activations from layer depths are generally preferred since they are more representative.


Show HN: Auto-comment with a content marketing piece on every trending HN story about your startup’s topic using GPT-4

…written in Rust.


I've wondered, what's the right way to pronounce this library? "face"?


Huggingface Datasets also has an integrated interface to Faiss ! https://huggingface.co/docs/datasets/faiss_es


How are people using vector DBs in production? Do you typically use and manage Faiss indexes alone or use something like Milvus, Pinecone, Weaviate, or Chroma?


A big difficulty in using vector DBs in production for things like embeddings or LLMs it that there is alot that goes into converting and processing raw input into a vector form (think chunking, formatting, encoding, inference, metadata, etc). DBs like pinecone just don't handle any of that and therefore you have to build out large systems to do it yourself.

There are some platforms and open source tools that handle it end to end. https://github.com/marqo-ai/marqo is one, for example that is both open source and has a cloud offering.


We put together some stories here: https://www.pinecone.io/learn/wild/

Usually folks use a vector database alongside a doc store like Postgres, Snowflake, Elastic…


We are using dense_vector field that Elastic Search offers and it's scaling quite well.


We're using Elasticsearch with binary vectors and our own sub-token filtering approach, but if we were to start over again today we'd simply use the dense vectors and ANN built into Elasticsearch, just with float vectors.


I’ve increasingly been using pgvector. It’s an excellent workflow when you can just include similarity search as another where clause in your sql query.


Any rust equivalent of faiss?


I know rust has beings to FAISS (see https://github.com/Enet4/faiss-rs), I don't know if there's anything that would be considered comparable. Alot of work has gone into FAISS


I forgot about https://github.com/qdrant/qdrant. It's a DB not a library so again may not be an exact answer for what you're looking for


Maybe https://github.com/hora-search/hora but I've never used it




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: