Why is the research story for MIPS even worse than for ANN?

jhj · 2024-01-26T19:06:18.000000Z

There is no good geometry to be exploited, and the query vectors might be (and are usually) distributed quite differently than the indexed vectors.

For Euclidean (L2) distance indexes where the vectors are partitioned based on geometry (e.g., pretty much every indexing type, including cell-probe like IVF, most forms of LSH, or graph based indices), query vectors can be naturally associated geometrically with candidate nearest neighbor vectors, so the distribution of queries doesn't matter as much.

For inner product, it's hard to do much better than spherical clustering (what one would usually do for cosine similarity, which is to project all vectors to the surface of a unit hypersphere, and searching for nearest neighbors via cosine similarity is exactly equivalent to L2 search). But, in general the maximum inner product in the indexed set may lie nowhere near to the projection of the query vector onto the surface of the hypersphere.

The maximum inner product for a query vector might be almost nearly perpendicular to the query vector (e.g., a very, very far out and almost perpendicular) versus a vector that is parallel to the query vector but with tiny norm. In two dimensions, an example could be (1, 0) as a query vector, but (1, 10^6) as a database vector (or vice versa). The inner product is 1 but the two vectors are very far apart in Euclidean distance. If you project the vectors to the unit 1-sphere, the query vector is still (1, 0) but the database vector now becomes (1 / sqrt(10^12 + 1), 10^6 / sqrt(10^12 + 1)) ~= (0.000000999..., 0.99999...) (apologies if there's an error here) which would also be in a very different cell if one were using a graph-based or IVF partitioning.

Neural search techniques do show some promise here though (say, using a neural net to predict which vector buckets to look at).

sjkoelle · 2024-01-27T01:48:01.000000Z

How much of this is due to DNNs (e.g. VAEs but also others) forcing embeddings to distribute in a Gaussianish manner? Is the data intrinsically missing geometry or could a more subtle learning algorithm give a cleaner manifold and therefore more efficiently indexable structure?

jbellis · 2024-01-26T21:48:51.000000Z

Thanks!

What kinds of use cases cause this kind of situation, where the query and indexed vectors are from different distributions?