Ask HN: What is the state of art approximate k-NN search algorithm today?

fzliu · 2024-01-17T17:37:28 1705513048

You need different indexing algorithms for different use cases - brute-force indexing, for example, is "SOTA" when it comes to recall (100%). If you have multiple use cases or if you might have domain shift, you'll want a vector database that supports multiple indexes.

Here's my 2¢:

- If you're just playing around with vector search locally and have a very small dataset, use brute-force search. Don't worry about indexes until later.

- If you have plenty of RAM and CPU cores and would like to squeeze out the most performance, use ScaNN or HNSW plus some form of quantization (product quantization or scalar quantization).

- If you have limited RAM, use IVF plus PQ or SQ.

- If you want to maintain reasonable latency but aren't very concerned about throughput, use a disk-based index such as DiskANN or Starling. https://arxiv.org/pdf/2401.02116.pdf

- If you have a GPU, use GPU-specific indexes. CAGRA (supported in Milvus!) seems to be one of the best. https://arxiv.org/abs/2308.15136

All of these indexes are supported in Milvus (https://milvus.io/docs/index.md), so you can pick and choose the right one for your application. Tree-based indexes such as Annoy don't seem to have a sweet spot just yet, but I think there's room for improvement in this subvertical.

ccleve · 2024-01-17T17:24:15 1705512255

There's a recently published paper on something called Starling. As far as I can tell, it's a more efficient DiskANN. It claims enormous performance benefits.

The paper is tedious and near-unreadable, so I didn't finish reading it. Maybe there's something here, maybe not. https://arxiv.org/pdf/2401.02116.pdf

glaforge · 2024-01-17T17:16:12 1705511772

DiskANN seems to be a recent & effective approach: https://github.com/microsoft/DiskANN

MrCoffee7 · 2024-01-17T21:57:23 1705528643

I would think it would depend some on the domain you were searching over. For example, here is a use case for FLSHclust: https://news.mit.edu/2023/search-algorithm-reveals-nearly-20...

jhj · 2024-01-17T17:19:15 1705511955

All approximate k-NN algorithms involve tradeoffs, with the main tradeoffs being usually fidelity (recall, vector reconstruction error, end-to-end application error with respect to some metric etc), speed (to query, in either serial or batch mode), and memory usage (vector compression, indexing overhead, etc).

There is no one algorithm which is "better" than others along all dimensions; instead, for a given setting you look and see which technique (or combination of techniques) would let you lie along the Pareto-optimal tradeoff curve between fidelity/speed/memory, and for the application, you decide which of these is most important.

LSH in industrial practice is only really useful for incredibly high-dimensional (think 100K+ dimensions) and/or massively sparse vectors (e.g., only 1K of those 100K dimensions have a non-zero value), or if latency is a very huge concern (e.g., online approximate self-attention of vectors). For dense vectors around 20-1000 dimensions, other techniques are better. The quality of results via LSH leaves much to be desired so I have not seen it used in most modern settings, but there is a huge literature on it as it is easier to prove things mathematically regarding its properties.

Graph-based methods (e.g., HNSW): These work best for intermediate sized vector databases (100K - 10M vectors or so) as they are very fast and have high quality of results, but the memory overhead is high (indexing structures). It is impractical to use it for massive-scale indexes (think 1 billion+ vectors, unless you shard the database into smaller pieces and query all together, but this is computationally expensive. If the query vector distribution differs substantially from the database distribution you tend to run into problems here too.

Cell-probe based methods (e.g., IVF): These are used for the largest databases (e.g., 1 billion - 1 trillion+ vectors; at least those where you are not simply sharding the database and still searching every shard) as they scale pretty well. The quality of results can leave a lot to be desired though, as the partitioning of the database into geometric cells and only searching some subset (n IVF cells, only searching ~sqrt(n) of those) will tend to have a hard upper limit on finding the true nearest neighbor (recall will saturate at around 30%, say). IVF is the foundation of Google ScaNN and is used heavily at FB/Meta as well.

These choices are independent of how vectors are encoded in the database (e.g., using product quantization, scalar quantization, raw floating point vectors, binarized vectors, etc).

There are some vector search problems (e.g., MIPS, or maximum inner product search, useful in recommendation problems or for things like approximate matrix multiplication/approximate self-attention) which still have no good solution (there is no metric space/geometry that makes sense here, and the query distribution usually varies quite widely from the database distribution).

"Neural search" or other ML-assisted search techniques (given a query vector, what partition of the database should you look in?) are improving over time as well.

This new paper of ours has a good overview of the tradeoffs involved https://arxiv.org/pdf/2401.08281.pdf

(I am the author of the GPU half of the Faiss library)

lundren · 2024-01-18T19:45:12 1705607112

If you are thinking of SOTA as just recall that's a bit different than if you are thinking about other performance considerations in real-world data sets.

For example, IVF generally requires that you process the entire dataset in order to build the index, so it doesn't become searchable for some time.

HNSW and DiskANN can be built incrementally so you can start searching them immediately.

Here's a short blog that talks about some hard problems in vector search and how to solve them.

As many have mentioned, DiskANN often outperforms HNSW. HNSW seems to be the standard that most vectorDBs are going with.

https://thenewstack.io/5-hard-problems-in-vector-search-and-...

waynecochran · 2024-01-17T17:14:17 1705511657

In my practical experience, the majority computer graphics and vision frameworks (e.g. point cloud processing in Point Cloud Library, feature matching in colmap) all use FLANN (https://www.cs.ubc.ca/research/flann/) which employs a variety of NN algorithms that specialize in high dimensional spaces (e.g. 128 dimensional SIFT descriptors).

prabhatjha · 2024-01-18T20:35:24 1705610124

This article does a good job of describing hard problems with ANN search and how different algorithms solve or don't solve. https://dzone.com/articles/5-hard-problems-in-vector-search-...

fotta · 2024-01-17T17:09:02 1705511342

I'm not sure if it's SOTA these days but we use ScaNN at my job. https://github.com/google-research/google-research/tree/mast...

jasonjmcghee · 2024-01-17T17:25:24 1705512324

Another worth mentioning in this thread is usearch, though not a separate algorithm, based on HNSW with a bunch of optimizations https://github.com/unum-cloud/usearch

lukasga · 2024-01-17T17:09:40 1705511380

Spotify recently open sourced Voyager (continuation of Annoy), which uses HNSW. https://github.com/spotify/voyager

brcmthrowaway · 2024-01-17T17:37:27 1705513047

What do vector databases use?