Introduction to Locality-Sensitive Hashing (2018)

tylerneylon · on Dec 23, 2022

Hi, I'm the author of this article. I'm grateful for the positive comments! I plan to explore FALCONN (pointed to by gajjanag) and the beautiful article of Bartosz Ciechanowski pointed out by lgvld.

My original motivation for writing this was to explain (in a follow-up article) an ANN algorithm I published in 2010: https://epubs.siam.org/doi/pdf/10.1137/1.9781611973075.94

I never wrote that up (as a non-academic article), but the fact that people like this article is a nudge for me to do that.

The above paper was the first LSH algorithm which deterministically returns all close-enough points and guarantees no false positives for distant-enough points. I've heard there are also other algorithms with that property published since. If I have time, I would like to one day see how that algorithm compares against the others in Erik Bernhardsson's benchmarks.

ctxc · on Dec 23, 2022

Forgive my lack of knowledge, but I was lost the instant I found the first equation :(

H is the hash function and x are coordinates. But what is R Z?

tylerneylon · on Dec 23, 2022

R is the set of real numbers.

Z is the set of integers {.., -2, -1, 0, 1, 2, ...}.

Z is called "Z" due to influential German mathematicians who chose the letter to stand for "zahlen" ("number" in German).

As a fun fact, the letter "N" stands for "natural numbers," meaning either positive integers, or positive integers and zero. However, that letter (N) is _not_ standardized because mathematicians disagree about whether or not to include zero. When I separately told a couple of my professors this in grad school, they didn't believe me — one of them thinking it was crazy for zero to be excluded, the other thinking it was crazy to include it!

https://en.wikipedia.org/wiki/Natural_number

ctxc · on Dec 24, 2022

Thank you!

_a_a_a_ · on Dec 24, 2022

@author a question if you don't mind. I saw this article and it was very interesting but I get the impression (or heard somewhere) that LSH is actually more suitable to much higher dimensional problems than 2D. Put another way, I think LSH is perhaps inappropriate for a two-dimensional case like this example where you could solve it more simply, but that would not scale to higher dimensions easily or at all, which is where LSH would work. Is this clear, and am I correct in my understanding?

Thanks

mhuffman · on Dec 24, 2022

This is a really well-done article! Very approachable for a non-math-expert. If it is useful at all, I noticed while reading that there are two misspellings. "moives" for "moves" and "tht" for "that". I am looking forward to more of your articles.

fithisux · on Dec 24, 2022

Is there any blog where I can learn more on this fascinating area? Especially new discoveries?

I've read the part in the book of "Mining Massive Datasets and lots of old notes about "The Probabilistic Method".

But new things?

gajjanag · on Dec 23, 2022

By the way, https://github.com/FALCONN-LIB/FALCONN contains a really good LSH implementation. Also see https://www.mit.edu/~andoni/LSH/ if you want to know more about the research literature.

The fastest way for Euclidean space that I know that works well in practice is via Leech lattice decoding: https://www.semanticscholar.org/paper/Maximum-likelihood-dec... , or https://www.semanticscholar.org/paper/Efficient-bounded-dist... .

It is possible to create an implementation based on the above that decodes 24 dimensional points to the closest Leech lattice vector in < 1 microsecond per point on my AMD Ryzen laptop. Combine with some fast random projections/Johnson Lindenstrauss as described in the article to form the LSH.

This LSH family is unfortunately not present in FALCONN, but the alternatives in FALCONN are pretty good.

Source: thought extensively about LSH for my PhD.

sendfoods · on Dec 24, 2022

FALCONN doesn't look maintained, unfortunately

lgvld · on Dec 23, 2022

Thank you for sharing! Very interesting. I love visual explanations [1]. If I remember correctly, LSH can be used in computer graphics, by instance when computing fluids simulations (where _a lot_ of particles are involved).

The author also wrote an article in which he illustrates RNN in an original and stunning way: https://tylerneylon.com/a/randnn/

[1]: Not related, but you might also enjoy the work of Bartosz Ciechanowski: https://ciechanow.ski/

eruci · on Dec 23, 2022

I currently use a modified hilbert curve implementation for nn searches, will check this out to see how it compares.

my use case is here: https://geocode.xyz/2451481561372 -> https://geocode.xyz/45.17310,-75.62670 / https://geocode.xyz/2451481561373 -> https://geocode.xyz/45.17300,-75.62670

glass_of_water · on Dec 23, 2022

How do you deal with false negatives when using Hilbert curves for nearest neighbor searches? Based on my limited understanding, points close on space can be far from each other on the curve. Do you use multiple curves or something?

eruci · on Dec 24, 2022

I use two curves.

tipsytoad · on Dec 23, 2022

Fantastic article! How does this compare to ANN for the use case?

https://erikbern.com/2015/10/01/nearest-neighbors-and-vector...

gpderetta · on Dec 23, 2022

Reminds me of the random projections dimensionality reduction schema which is often used with LSH.

In fact the described forest of trees schema can probably be interpreted as an LSH.

Disclaimer: I haven't touched this stuff for more than 10 years. Don't know what's the state of the art now.

uniqueuid · on Dec 23, 2022

I agree that random hyperplanes are underpinning both, but as far as I recall, usually only one set of random planes is used in LSH (i.e. one set of trees). The granularity of approximate proximity rests in the number of identical signs, i.e. being on the same side of the plane.

There is another good and more technical explanation (using bands) in chapter 2 of mining massive datasets by Leskovec, Rajaraman and Ullman.

gpderetta · on Dec 23, 2022

As I said, it has been a long time!

I have dim memories of using random k basis vectors to convert high dimensionality feature vectors to k dimensions, and doing m times to generate multiple projections as part of a an LSH schema. Min-hashing might have been involved.

senderista · on Dec 23, 2022

IIRC, minhashing is used to approximate Jacquard similarity (a set-theoretic measure), while random hyperplanes (aka simhashing) is used to approximate cosine similarity (a geometric/algebraic measure). So they solve different problems, even though some problems can be cast in terms of either framework.

bitL · on Dec 24, 2022

Reformer NLP model uses LSH to reduce the size of the transformer/attention layer while keeping the same accuracy, being able to model much longer context lengths in NLP or do image/video generation:

https://ai.googleblog.com/2020/01/reformer-efficient-transfo...

dang · on Dec 23, 2022

Discussed at the time:

Introduction to Locality-Sensitive Hashing - https://news.ycombinator.com/item?id=27614381 - June 2021 (63 comments)

gumby · on Dec 23, 2022

Interesting concept! Until I read it I thought locality was exactly what I would not want in a hash function.

I love it when I read something that upends my assumptions and opens up new possibilities.

AR_1 · on Dec 23, 2022

A well written article and the problem of efficient peer discovery in a large network of nodes will become even more pronounced in the near future.

Another design using proximity based hashing is to use a particle swarm optimization approach [1]

[1]

Proximity-based Networking: Small world overlays optimized with particle swarm optimization

https://arxiv.org/pdf/2006.02006.pdf

abalaji · on Dec 23, 2022

This is a fantastic article! Though, for the first half, I do think folks would benefit from having a rough understanding of the nuances of hash maps, if I may plug my own rough notes [1]

[1] https://github.com/adithyabsk/datastructures/blob/main/docs/...

flashfaffe2 · on Dec 24, 2022

Maybe this is off topic but how this algorithm adapt itself to time series?

I mean you want to cluster data which might be autocorrelated... Does this case is taken into account or do you need to preprocess your data beforehand?

foobar502 · on Dec 23, 2022

Read the article and googled a bit - what are more example use cases for LSH behind those described?

tylerneylon · on Dec 23, 2022

The use case I see the most in my career is to use LSH to help solve the "ANN" problem = approximate nearest neighbors (typically with ranked results). I've seen ANN used many times for near-duplicate detection and in recommendation systems.

Although I don't have access to the proprietary code used, it's most likely that an LSH algorithm is behind the scenes in every modern search engine (to avoid serving duplicates), many modern ranking systems such as Elasticsearch (because items are typically vectorized and retrieved based on these vectors), and most recommendation systems (for similar reasons as ranking). For example, all of these pages probably have an LSH algorithm at some point (either batch processing before your request, or in some cases real-time lookups):

* Every search result page on Google * Every product page on Amazon (similar products) * All music suggestions on Spotify or similar * Every video recommendation from TikTok, YouTube, or Instagram

etc.

molodec · on Dec 23, 2022

Another interesting use case for LSH is search results caching. Used by Amazon https://www.linkedin.com/feed/update/urn:li:activity:6943348... https://www.amazon.science/blog/more-efficient-caching-for-p...

senderista · on Dec 23, 2022

Yes, e.g. many IR systems use cosine similarity to compute query vector/term vector similarity, and simhashing approximates cosine similarity. OTOH, some IR systems instead use a set-theoretic measure, Jacquard similarity, which can be approximated by minhashing.

madiator · on Dec 23, 2022

Here's an example I can think of. Suppose you have a bunch of text documents, and you know that some documents are similar but not identical (e.g. plagiarized and slightly modified). You want to find out which documents are similar.

You can first run the contents through some sort of embedding model (e.g. the recent OpenAI embedding model [1]), and then apply LSH on those embeddings. The documents that have the same LSH value would have had very similar embeddings, and thus very similar content.

[1] https://beta.openai.com/docs/guides/embeddings

zamalek · on Dec 23, 2022

Collision detection in games. This problem is O(n^2) because you have to check every object against every other object.

You can almost only check objects that inhabit the same buckets (there are caveats, usually neighboring buckets are also checked), eliminating objects that couldn't possibly collide by virtue of e.g. being on the other side of the map. Of course this is still O(n^2) because every object could be in the same bucket (but that's unlikely).

nl · on Dec 24, 2022

Face matching. You use embedding created by DeepFace of MediaPipe and put them in a LSH and then the same face ends up close to each other.

magic123_ · on Dec 24, 2022

In Stuff Made Here's most recent video, he uses LSH to solve a 4000 pieces all-white jigsaw puzzle: https://www.youtube.com/watch?v=WsPHBD5NsS0

ww520 · on Dec 24, 2022

Geohash is one use case. Latitude and longitude address is converted to a geohash which has the nice property of nested rectangles have the same hash prefix.

zbird · on Dec 23, 2022

I like meself a well-written and formatted article. Thank you.

twelve40 · on Dec 24, 2022

I ran into lunatics who were asking things like these at a programming job interview at a prominent company. There should be a moral to this but I don't know what it is.

deepsun · on Dec 26, 2022

Sometimes when we search for an expert in a particular field, not just a general programmer, we ask increasingly deep questions, to estimate how deep is the knowledge. It's expected that at some point candidate would not know the answer. The goal is to hire a person that is more experienced than we are. It doesn't mean we knew it ourselves.