Semi-related: Anybody know what's the current state of the art for calculating n...

Radim · on Feb 11, 2018

Various approaches based on LSH (locality sensitive hashing) and random projections.

But "the best" really depends on your data profile, as the generic problem formulation, for completely generic vectors, is too… generic. See also the curse of dimensionality [0].

In practical terms, if you're looking for an actual implementation to use, here's a benchmark of some state-of-the-art implementations for high-dimensional vector search: http://radimrehurek.com/2013/12/performance-shootout-of-near... and also the updated https://github.com/erikbern/ann-benchmarks.

In particular, for something good AND open source:

- Leonid Boytsov's NMSLIB [1]

- Erik Bernhardsson's Annoy [2]

- Facebook's FAISS [3]

For a commercial engine focused more on practical use (index management, transactions, versioning, support), see ScaleText [4] (disclaimer: our product).

[0] https://en.wikipedia.org/wiki/Curse_of_dimensionality [1] https://github.com/searchivarius/nmslib [2] https://github.com/spotify/annoy [3] https://github.com/facebookresearch/faiss [4] https://scaletext.com/

photon-torpedo · on Feb 11, 2018

Very helpful, many thanks!