Hacker News new | past | comments | ask | show | jobs | submit login
ScaNN: Efficient Vector Similarity Search (googleblog.com)
109 points by polm23 on July 30, 2020 | hide | past | favorite | 33 comments



A practical tip: try hard not to need this sort of thing, and you'd be surprised how often you can avoid it.

Before you start looking at approximate nearest neighbors, ask yourself the following:

1. Do I know all of the vectors ahead of time? (i.e. query vectors are not constructed at runtime, it's a fixed inventory of items being embedded)

2. Do I need to return an unlimited number of neighbors per query? (i.e. do you ever need more than like, a few thousand neighbors)

If the answer to both of these is "no", then you should probably just precompute the neighbors exactly on one or more GPUs. It's feasible for millions of terms, and may actually be faster than the index-building steps for many ANN algorithms. Example code: https://github.com/explosion/sense2vec/blob/master/scripts/0...

A cache of k neighbors for vectors of dimension d will be k/d the size of the original vector data, because one integer and one float both occupy 32-bits of data. So if your vectors are 300 dimensions, you can save 150 neighbors and serve the results with half the space of the original data. The responses are then instant and exact, and you'll never have problems with load management because the runtime is so cheap. Example demo: https://explosion.ai/demos/sense2vec

Again, this doesn't work for queries you have to construct on the fly, for instance it won't work if you need to construct a vector that incorporates information from the user's session or something. It only works if the process of constructing the vectors is already a batch job to e.g. embed the user's history. It also won't work well if you have to page through arbitrary amounts of history, as storing more than a few thousand neighbors per query would get impractical. Generally the similarity accuracy degrades to uselessness long before that point though (especially if you're using an approximation technique).


How is this similar / different to Johnson-Lindenstrauss? (https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_...)

The principle seems very similar: find a simpler representation that preserves distance.


The JL lemma is an essentially an error bound, which reassures us that dimensionality reduction (compress big vectors —> small vectors) is not a lost cause. More specifically, it says that even if you perform a random projection—the most naive DR method—to compress a high-dimensional data matrix to a low-dimensional one (i.e. same number of rows, but many fewer columns), it will only the distort the pairwise Euclidean distances between points (row vectors of the data matrix) by $this_much.

Example: if you crudely crunch MNIST from a [60000 x 784] matrix down to [60000 x 15] matrix (15 is arbitrary here—anything smaller than the original 784 works) using random projection, the JL lemma tells you how badly you mangled the similarity structure of your dataset in the worst case scenario. It turns out it’s not so bad, so we can use DR to compress our data matrix and then confidently do useful things with the vectors it contains (like search for similar ones).

ScaNN is an approximate nearest neighbor (ANN) search algorithm. There is an entire zoo of them, but they all have the same goal: given a query vector q and a data matrix X (where q has the same number of elements as a row vector in X), find the k vectors in X that are “most similar” to q, without exhaustively checking all the possible matches. Most similar could be any similarity or distance measure, but in this case it’s the simple inner product.

To do this more quickly, the algorithm uses a type of vector quantization (i.e. sort the high-D vectors into buckets), which is similar to k-means. K-means can also be construed as a dimensionality reduction technique, you are just reducing to a single dimension. That way you’re only looking in buckets where there’s hope of finding close matches.

This works great in real life with socks—if you’re trying to find a match for a stray black dress sock, then 1) make sure you presort your socks so that they are approximately binned by similarity and 2) look in the bin that contains socks most similar to your “query sock” (i.e. don’t look in the colorful athletic sock or plain white dad-sock bins). I’m not actually kidding, if you have a billion unmatched socks in a box, try this and you’ll see.

Anyways...as for this particular algorithm, someone else may be able to explain the details better. But my coarse understanding is that the loss function (which measures how good the bucketing is—how purely you pre-binned your socks) was defined in such a way that accounts for the density of the region in question—a neighbor in NYC means something different than a neighbor in Wyoming, so you need more granular buckets for denser regions. However, I thought this was essentially how almost every ANN algorithm worked, so how this one blows all the others out of the water (faster for a given recall value) is mystifying to me.

[PS—someone please correct me if any aspect of my loosey goosey explanation is inaccurate]


In case anyone else was confused why they didn't just project onto q first if they want to have <q,x'> as close as possible to the original <q,x>. Apparently q is a random query vector and their research shows how to define an alternative quantization loss where you weigh quantization loss more heavily when |<q,x>| is high [1].

It's worth noting that without this additional weighting you end up with the euclidean distance under the assumption that q is distributed symmetrically.

[1]: https://arxiv.org/abs/1908.10396


I'll definitely have to try this out, and from the benchmarks the results look impressive. In the documentation I noticed there was a dependency on tensorflow. I get why it's needed, but it would have been nice to figure a way without it. As an annoy user, it's very easy to use, from install to the API. Looking forward to the continued discoveries in approximate nearest neighbor tech.


An interesting result in an area where there are many obvious things you think would help but don't.


Unrelated, but I feel strangely sad reading this, agreeing, then going...

Wait, that’s a very arbitrary and general statement. Is this a bot?

(I know (hope, given I’ve read your comments before... :) you’re not, but still, a sign of the times I guess...)


This is definitely great foundational work, but if you click through to http://ann-benchmarks.com/, you can see that they show dominant performance on the glove-100 benchmark, but decidedly middling performance on the other benchmarks for which they provide results (glove-25 and lastfm-64). I don't see this doing anything to unseat HNSW as my go-to algorithm for its strong performance across a wide variety of benchmarks.


Personally I'm waiting for this to get added to http://ann-benchmarks.com/ to see how it measures up across datasets.

Edit: Looks like it's already there for some datasets. Interestingly it is primarily strong for the glove-100 dataset. I wonder if there are parts of scann that (accidentally) exploit structure specific to word vectors.


Look again, it’s in there.


This is great. This is already better that FLANN (by facebook), Annoy (by spotify). https://github.com/erikbern/ann-benchmarks#evaluated

Still, We are very far away from creating reverse video search engine. Also, Most of these techniques are in-memory not on disk.

I am really interested in making reverse video search engine.


Not quite the same thing, but videogrep https://github.com/antiboredom/videogrep works along, audio -> ASR -> timed transcript -> text search, lines.


I'm working on an ANN plugin for Elasticsearch. All data is stored on disk, you automatically get horizontal/distributed scaling handled by ES, and you can combine ANN queries with Elasticsearch queries. http://elastiknn.klibisz.com/

It's currently not as fast as the in-memory alternatives. Though it's not a perfect apples/apples comparison. Data is stored on disk, it's a JVM implementation rather than C/C++, and it's optimized for single queries rather than bulk.


YSK that ANN is already in the process of being added to Lucene:

https://github.com/elastic/elasticsearch/issues/42326

A different implementation is already in OpenDistro for Elasticsearch:

https://opendistro.github.io/for-elasticsearch-docs/docs/knn...


Yep, I'm aware.

The Lucene implementation seems early and slow-moving. Seems they are trying to create new storage formats and use graph-based search methods. OpenDistro wrapped a C++ binary that also uses a graph-based method. It works quite well, but only for L2 similarity and comes with the operational burden of running a rather large sidecar process completely disjoint from the JVM.

The approach I've taken is to support five similarity functions (L1, L2, Angular, Jaccard, Hamming), support sparse and dense vectors, implement everything inside the JVM with no sidecar processes and no changes to Lucene, and to use hashing-based search methods (i.e. LSH). IMO the last point has a clear advantage over using graph-based methods, because the hashes are treated just like regular text tokens which is clearly the optimal access pattern in ES/Lucene. Of course it will likely lose to a C++ implementation in terms of raw speed because it's the JVM, but IMO that matters less than making the plugin trivial to run and scale.

I don't think there's a definitively better approach yet. It's an interesting problem and it'll be interesting to see what ends up working well.


MPEG-7 has been working on this for decades. ffmpeg has an implementation you can use: https://ffmpeg.org/ffmpeg-all.html#signature-1 While this does pairwise comparisons, it should be possible to simply index the resulting fingerprints using lucene and perform inverted index search - this should scale much better than this technique mentioned.


How would a reverse video search engine work? You give it a clip and it shows you which longer clips it is in?


I would imagine it using a Shazam-like algorithm, only for downsampled video and not just the audio.

A decade ago there was some hobby implementations of these algorithms that got cease and desist letters form Shazam and that was all over HN.

https://m.youtube.com/watch?v=8Dj0rekeM7g


Yes, Exactly same as you described.


Could you take (embeddings of) frames from your query video and measure the pairwise similarity to frames from the reference videos, then rank by some summary function of that list of similarity scores?

What are the bottlenecks besides performance?


FLANN was developed at UBC. You meant FAISS.


Yes, FAISS not FLANN. my mistake. cann't correct it now. because edit time is gone.


This is really powerful, there are many profitable AI startups that could be built using this technology. I am almost mad that its on the HN front page


Like what? Doesnt seem fundamentally different to Annoy which has existed for a while now.


The basic idea of product quantization is 10 years old now and there are many examples out there [0][1][2][etc]. This looks like a high quality library though.

[0] https://github.com/yahoo/lopq

[1] https://github.com/cgtuebingen/Product-Quantization-Tree

[2] https://github.com/technicolor-research/pq-fast-scan

[3] https://github.com/matsui528/rii


I thought that k-D trees were good for that problem. With some backtracking in the tree and constructing cutting planes, can get actual nearest neighbor search.


kd-trees don't work well for high-dimensional vectors, so when working with embeddings, you typically have to use something else.


This hasn't been true for multiple decades at this point. SIFT was originally predicated on best bin first searching through kd-trees to find aproximate nearest neighbors for vectors with over 4000 dimensions.

If you prioritize the bins you search by distance, you are are more likely to find closer points sooner. If you are finding a set number of closest points, then you will have a minimum distance to be included in the point set sooner. If that minimum distance is shorter than the distance to a bin, you can skip it.


If you have heuristics fast enough to make kd-trees competitive, it would be interesting to have them added to http://ann-benchmarks.com/


What do you mean by "heuristics"? What I described was how flann works.

It is not only on those benchmarks but is one of the reasons they exist in the first place, since it instigated a lot of the research into approximate nearest neighbors.


I'm working on a project using kd-tree for high-dimensional embeddings. Could you send me some pointers?


You will have to be more specific.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: