Faiss: A library for efficient similarity search

wskish · on March 31, 2023

hnswlib (https://github.com/nmslib/hnswlib) is a strong alternative to faiss that I have enjoyed using for multiple projects. It is simple and has great performance on CPU.

After working through several projects that utilized local hnswlib and different databases for text and vector persistence, I integrated hnswlib with sqlite to create an embedded vector search engine that can easily scale up to millions of embeddings. For self-hosted situations of under 10M embeddings and less than insane throughput I think this combo is hard to beat.

https://github.com/jiggy-ai/hnsqlite

Labo333 · on March 31, 2023

I totally agree and hnswlib is actually much faster than FAISS on CPU.

I'm really happy to see `hnswlib` as a Python dependency since I'm the one who implemented PyPI support: https://github.com/nmslib/hnswlib/pull/140

fzysingularity · on March 31, 2023

Interesting - is there a good reference to back this claim? Curious to hear what overheads Faiss would have if it's configured with similar parameters to build the HNSW graphs. Is that what you A/B-tested in practice?

wskish · on April 1, 2023

http://ann-benchmarks.com

hnswlib implementation of hnsw is faster than faiss's implementation. Faiss has other index methods that are faster in some cases, but more complex as well.

wskish · on April 1, 2023

Thank you for this! This project is really hnswlib-sqlite just shortened into hns(w)qlite.

nl · on March 31, 2023

I like Faiss but I tried Spotify's annoy[1] for a recent project and was pretty impressed.

Since lots of people don't seem to understand how useful these embedding libraries are here's an example. I built a thing that indexes bouldering and climbing competition videos, then builds an embedding of the climber's body position per frame. I then can automatically match different climbers on the same problem.

It works pretty well. Since the body positions are 3D it works reasonably well across camera angles.

The biggest problem is getting the embedding right. I simplified it a lot above because I actually need to embed the problem shape itself because otherwise it matches too well: you get frames of people in identical positions but on different problems!

[1] https://github.com/spotify/annoy

antman · on March 31, 2023

I looked a bit and the code, I think it would be kow hanging fruit to add additional sqlite fields except the vector ones. Even if any filtering happens kind of suboptimally in post processing.

antman · on March 31, 2023

Thanks Leobg!

For anyone else: you pass it directly in metadata see https://github.com/jiggy-ai/hnsqlite/blob/main/test/test_col...

https://github.com/jiggy-ai/hnsqlite/blob/main/test/test_col...

leobg · on March 31, 2023

Cool! Using it right now. Question: Why not store the hnswlib binary right within the SQLite? Then the whole index would be in one file.

wskish · on April 1, 2023

yes, this is what I want to do

leobg · on March 31, 2023

hnswlib supports pre-filtering

gk1 · on March 31, 2023

If anyone is interested in diving deeper into Faiss, we put together an unofficial manual after not finding much learning materials about it:

https://www.pinecone.io/learn/faiss/

chandureddyvari · on March 31, 2023

I really like your learning series. What you’ve done for understanding conversational memory at https://www.pinecone.io/learn/langchain-conversational-memor... was truly helpful. Thanks!

4ft4 · on March 31, 2023

Awesome. I used faiss during my phd studies on ai based lidar map building and I spent countless hours in the faiss github wiki pages, example code, and issues. Would have loved something like this back then. Bookmarked for the next time I need faiss.

fzliu · on March 30, 2023

Faiss is a wonderful vector search library - in particular, the ability to do hybrid indexes e.g. IVF-PQ, IVF-SQ is great. We (https://milvus.io) use it as one of the indexing options (along with Annoy, Nmslib, and DiskANN) to power our vector database.

mojoe · on March 30, 2023

I've actually been looking at Milvus for storing embeddings, why do y'all have options for both Pulsar and Kafka inside Milvus? Seems like unnecessary choices for what we hoped would just be a plug and play vector database

fzliu · on March 30, 2023

The design of Milvus 2.x follows this paper we published a while back: https://arxiv.org/abs/2206.13843. In short, we used Pulsar to implement the write-ahead log, which provides coordination and a single source of truth across all Milvus components.

You're right in that it's a bit heavyweight, so we're working to see how we can make pub/sub and other cluster components lighter and more efficient overall.

politician · on March 31, 2023

Based on some superficial research yesterday, Weaviate looks like an easier option for messing around locally, but Milvus looks better for a production use case.

mshachkov · on March 30, 2023

There is a wip [0] on RAFT [1] integration to faiss as an implementation of cuda gpu backed indices, although you can use RAFT directly.

[0] https://github.com/facebookresearch/faiss/pull/2521 [1] https://github.com/rapidsai/raft

txtai · on March 31, 2023

txtai combines Faiss and SQLite to support similarity search with SQL.

For example: SELECT id, text, date FROM txtai WHERE similar('machine learning') AND date >= '2023-03-30'

GitHub: https://github.com/neuml/txtai

This article is a deep dive on how the index format works: https://neuml.hashnode.dev/anatomy-of-a-txtai-index

ntonozzi · on March 30, 2023

Check out https://github.com/erikbern/ann-benchmarks for some benchmarks on some of the different ANN libraries out there. I'd be interested in hearing other's experiences using these libraries in production.

bobvanluijt · on March 30, 2023

Faiss is awesome but note that it's an ANN library that might not be suitable for all use cases; hence vector databases exist. Two things that might be of interest: the main contributor of Faiss on the Weaviate podcast: https://youtu.be/5o1YTp1IL5o and Weaviate as vector DB itself: https://weaviate.io

ttt3ts · on March 30, 2023

Faiss supports FLAT aka not approx nearest neighbor. Also, all the vector databases use ANN too because nearest neighbor is hard.

Or are you referring to the ability to store data in addition to the vectors? In which case, you can pair any time tested DB with the index avoiding the hype of DBs that might be gone in a year.

fzliu · on March 31, 2023

Vector indexes only form a component within vector databases - you'd want the database bits and bobs (scalability, replication, caching, etc) surrounding it in addition to the index itself. Milvus, for example, supports flat indexes - as far as I know, Weaviate could support flat indexing as well if it doesn't do so already.

Most vector databases focus on ANN because of scale. Once you get to around a million vectors or so, it becomes prohibitively expensive to perform brute-force querying and search.

ing33k · on March 31, 2023

We used Weavite and Elastic in prod. Felt that Weavite was not yet prod ready wrt deployment ( sharding , clustering is murky ). We went with ES's dense_vector field which is serving us well.

amrb · on March 31, 2023

Fast way to use LLM with external data!

https://python.langchain.com/en/latest/modules/indexes/vecto...

jawerty · on March 31, 2023

Recently used Faiss to optimize a kmeans implementation. Way faster than the sklearn implementation can’t recommend enough.

antman · on March 31, 2023

wanderingmind · on March 30, 2023

Sorry a noob question, how is Faiss different from say pgvector, other than the fact that pgvector is developed mainly as a postgres plugin? I think I saw use cases of using pgvector for similarity search on openai encoidings of text (which itself is generated from a NN)

ttt3ts · on March 30, 2023

Scale, Faiss supports a bunch more algorithms. A much larger scale.

jeadie · on March 31, 2023

Although there is some work going on right now to add support for the type of algorithms in pgvector to alot it to scale better (and also to have better recall/speed tradeoffs).

wanderingmind · on March 30, 2023

Thank you, that makes sense

fzysingularity · on March 31, 2023

Big fan of Faiss - I've tried using several others (milvus, weaviate, opensearch, etc) but none struck the usability and configurability chord as much as Faiss did.

I especially like their index-factory models. Once you figure out how to build it properly, you can easily push beyond 100M vectors (512-dim) on a single reasonably beefy node. Memory-mapping, sub-20ms latencies on 10M+ vectors, bring-your-own training sampling strategies, configurable memory-usage, PQ, the list goes on. Once you have this, distributing it across nodes becomes trivial and allows you to keep scaling horizontally.

Not sure if others have used their GPU bindings, but being able to train about 10x faster on your data is a game-changer for rapid-experimentation and evaluation, especially when you need to aggressively quantize at this scale. Also, the fact that you have an extremely portable GPU-trained index that you can run on a lightweight-CPU (potentially even a lambda) is very compelling to me.

That said, I'd love to see Faiss ported to the browser (using WASM) - if any of this sounded useful or intriguing, DM me, would love to share notes and learn more about how folks are using Faiss today.

bobvanluijt · on March 31, 2023

Interesting - could it be the case that your use case is more suited for a library like FAISS instead of a vector DB? Would love to understand this. (I’m affiliated with Weaviate).

Fiahil · on March 31, 2023

Note of high importance : "Faiss" reads and sounds exactly like "Fesse" (butt), in French.

I see some French names in the author list. The joke was intended, well done ! :)

RobotToaster · on March 30, 2023

This may be a stupid question, but how would one generate the vectors that this uses? (I assume you can't just feed it images or 3d models?)

yeldarb · on March 31, 2023

Good timing -- we actually just published a tutorial showing how to build semantic image search (including generating vectors with CLIP) with faiss this morning: https://blog.roboflow.com/clip-semantic-search/

jeadie · on March 31, 2023

It will depend on the type of data you are using, and then what you are planning on doing with it. Various open source models will take various modalities (e.g. images of models), and will create an embedding/vector for them (even if just as an internal representation). You can take these, and store them in a vector DB. This may help https://www.marqo.ai/blog/how-to-implement-text-to-image-sea...

fzliu · on March 30, 2023

This is typically done by taking the activations from a neural network. Activations from layer depths are generally preferred since they are more representative.

leobg · on March 31, 2023

Show HN: Auto-comment with a content marketing piece on every trending HN story about your startup’s topic using GPT-4

…written in Rust.

losteric · on March 31, 2023

I've wondered, what's the right way to pronounce this library? "face"?

make3 · on March 30, 2023

Huggingface Datasets also has an integrated interface to Faiss ! https://huggingface.co/docs/datasets/faiss_es

tim_sw · on March 31, 2023

How are people using vector DBs in production? Do you typically use and manage Faiss indexes alone or use something like Milvus, Pinecone, Weaviate, or Chroma?

jeadie · on March 31, 2023

A big difficulty in using vector DBs in production for things like embeddings or LLMs it that there is alot that goes into converting and processing raw input into a vector form (think chunking, formatting, encoding, inference, metadata, etc). DBs like pinecone just don't handle any of that and therefore you have to build out large systems to do it yourself.

There are some platforms and open source tools that handle it end to end. https://github.com/marqo-ai/marqo is one, for example that is both open source and has a cloud offering.

gk1 · on March 31, 2023

We put together some stories here: https://www.pinecone.io/learn/wild/

Usually folks use a vector database alongside a doc store like Postgres, Snowflake, Elastic…

ing33k · on March 31, 2023

We are using dense_vector field that Elastic Search offers and it's scaling quite well.

heipei · on March 31, 2023

We're using Elasticsearch with binary vectors and our own sub-token filtering approach, but if we were to start over again today we'd simply use the dense vectors and ANN built into Elasticsearch, just with float vectors.

arecurrence · on March 31, 2023

I’ve increasingly been using pgvector. It’s an excellent workflow when you can just include similarity search as another where clause in your sql query.

georgehill · on March 31, 2023

Any rust equivalent of faiss?

jeadie · on March 31, 2023

I know rust has beings to FAISS (see https://github.com/Enet4/faiss-rs), I don't know if there's anything that would be considered comparable. Alot of work has gone into FAISS

jeadie · on March 31, 2023

I forgot about https://github.com/qdrant/qdrant. It's a DB not a library so again may not be an exact answer for what you're looking for

jeadie · on March 31, 2023

Maybe https://github.com/hora-search/hora but I've never used it