Hacker News new | past | comments | ask | show | jobs | submit login
Introduction to Locality-Sensitive Hashing (2018) (tylerneylon.com)
237 points by signa11 on Dec 23, 2022 | hide | past | favorite | 36 comments



Hi, I'm the author of this article. I'm grateful for the positive comments! I plan to explore FALCONN (pointed to by gajjanag) and the beautiful article of Bartosz Ciechanowski pointed out by lgvld.

My original motivation for writing this was to explain (in a follow-up article) an ANN algorithm I published in 2010: https://epubs.siam.org/doi/pdf/10.1137/1.9781611973075.94

I never wrote that up (as a non-academic article), but the fact that people like this article is a nudge for me to do that.

The above paper was the first LSH algorithm which deterministically returns all close-enough points and guarantees no false positives for distant-enough points. I've heard there are also other algorithms with that property published since. If I have time, I would like to one day see how that algorithm compares against the others in Erik Bernhardsson's benchmarks.


Forgive my lack of knowledge, but I was lost the instant I found the first equation :(

H is the hash function and x are coordinates. But what is R Z?


R is the set of real numbers.

Z is the set of integers {.., -2, -1, 0, 1, 2, ...}.

Z is called "Z" due to influential German mathematicians who chose the letter to stand for "zahlen" ("number" in German).

As a fun fact, the letter "N" stands for "natural numbers," meaning either positive integers, or positive integers and zero. However, that letter (N) is _not_ standardized because mathematicians disagree about whether or not to include zero. When I separately told a couple of my professors this in grad school, they didn't believe me — one of them thinking it was crazy for zero to be excluded, the other thinking it was crazy to include it!

https://en.wikipedia.org/wiki/Natural_number


Thank you!


@author a question if you don't mind. I saw this article and it was very interesting but I get the impression (or heard somewhere) that LSH is actually more suitable to much higher dimensional problems than 2D. Put another way, I think LSH is perhaps inappropriate for a two-dimensional case like this example where you could solve it more simply, but that would not scale to higher dimensions easily or at all, which is where LSH would work. Is this clear, and am I correct in my understanding?

Thanks


This is a really well-done article! Very approachable for a non-math-expert. If it is useful at all, I noticed while reading that there are two misspellings. "moives" for "moves" and "tht" for "that". I am looking forward to more of your articles.


Is there any blog where I can learn more on this fascinating area? Especially new discoveries?

I've read the part in the book of "Mining Massive Datasets and lots of old notes about "The Probabilistic Method".

But new things?


By the way, https://github.com/FALCONN-LIB/FALCONN contains a really good LSH implementation. Also see https://www.mit.edu/~andoni/LSH/ if you want to know more about the research literature.

The fastest way for Euclidean space that I know that works well in practice is via Leech lattice decoding: https://www.semanticscholar.org/paper/Maximum-likelihood-dec... , or https://www.semanticscholar.org/paper/Efficient-bounded-dist... .

It is possible to create an implementation based on the above that decodes 24 dimensional points to the closest Leech lattice vector in < 1 microsecond per point on my AMD Ryzen laptop. Combine with some fast random projections/Johnson Lindenstrauss as described in the article to form the LSH.

This LSH family is unfortunately not present in FALCONN, but the alternatives in FALCONN are pretty good.

Source: thought extensively about LSH for my PhD.


FALCONN doesn't look maintained, unfortunately


Thank you for sharing! Very interesting. I love visual explanations [1]. If I remember correctly, LSH can be used in computer graphics, by instance when computing fluids simulations (where _a lot_ of particles are involved).

The author also wrote an article in which he illustrates RNN in an original and stunning way: https://tylerneylon.com/a/randnn/

[1]: Not related, but you might also enjoy the work of Bartosz Ciechanowski: https://ciechanow.ski/


I currently use a modified hilbert curve implementation for nn searches, will check this out to see how it compares.

my use case is here: https://geocode.xyz/2451481561372 -> https://geocode.xyz/45.17310,-75.62670 / https://geocode.xyz/2451481561373 -> https://geocode.xyz/45.17300,-75.62670


How do you deal with false negatives when using Hilbert curves for nearest neighbor searches? Based on my limited understanding, points close on space can be far from each other on the curve. Do you use multiple curves or something?


I use two curves.


Fantastic article! How does this compare to ANN for the use case?

https://erikbern.com/2015/10/01/nearest-neighbors-and-vector...


Reminds me of the random projections dimensionality reduction schema which is often used with LSH.

In fact the described forest of trees schema can probably be interpreted as an LSH.

Disclaimer: I haven't touched this stuff for more than 10 years. Don't know what's the state of the art now.


I agree that random hyperplanes are underpinning both, but as far as I recall, usually only one set of random planes is used in LSH (i.e. one set of trees). The granularity of approximate proximity rests in the number of identical signs, i.e. being on the same side of the plane.

There is another good and more technical explanation (using bands) in chapter 2 of mining massive datasets by Leskovec, Rajaraman and Ullman.


As I said, it has been a long time!

I have dim memories of using random k basis vectors to convert high dimensionality feature vectors to k dimensions, and doing m times to generate multiple projections as part of a an LSH schema. Min-hashing might have been involved.


IIRC, minhashing is used to approximate Jacquard similarity (a set-theoretic measure), while random hyperplanes (aka simhashing) is used to approximate cosine similarity (a geometric/algebraic measure). So they solve different problems, even though some problems can be cast in terms of either framework.


Reformer NLP model uses LSH to reduce the size of the transformer/attention layer while keeping the same accuracy, being able to model much longer context lengths in NLP or do image/video generation:

https://ai.googleblog.com/2020/01/reformer-efficient-transfo...


Discussed at the time:

Introduction to Locality-Sensitive Hashing - https://news.ycombinator.com/item?id=27614381 - June 2021 (63 comments)


Interesting concept! Until I read it I thought locality was exactly what I would not want in a hash function.

I love it when I read something that upends my assumptions and opens up new possibilities.


A well written article and the problem of efficient peer discovery in a large network of nodes will become even more pronounced in the near future.

Another design using proximity based hashing is to use a particle swarm optimization approach [1]

[1]

Proximity-based Networking: Small world overlays optimized with particle swarm optimization

https://arxiv.org/pdf/2006.02006.pdf


This is a fantastic article! Though, for the first half, I do think folks would benefit from having a rough understanding of the nuances of hash maps, if I may plug my own rough notes [1]

[1] https://github.com/adithyabsk/datastructures/blob/main/docs/...


Maybe this is off topic but how this algorithm adapt itself to time series?

I mean you want to cluster data which might be autocorrelated... Does this case is taken into account or do you need to preprocess your data beforehand?


Read the article and googled a bit - what are more example use cases for LSH behind those described?


The use case I see the most in my career is to use LSH to help solve the "ANN" problem = approximate nearest neighbors (typically with ranked results). I've seen ANN used many times for near-duplicate detection and in recommendation systems.

Although I don't have access to the proprietary code used, it's most likely that an LSH algorithm is behind the scenes in every modern search engine (to avoid serving duplicates), many modern ranking systems such as Elasticsearch (because items are typically vectorized and retrieved based on these vectors), and most recommendation systems (for similar reasons as ranking). For example, all of these pages probably have an LSH algorithm at some point (either batch processing before your request, or in some cases real-time lookups):

* Every search result page on Google * Every product page on Amazon (similar products) * All music suggestions on Spotify or similar * Every video recommendation from TikTok, YouTube, or Instagram

etc.



Yes, e.g. many IR systems use cosine similarity to compute query vector/term vector similarity, and simhashing approximates cosine similarity. OTOH, some IR systems instead use a set-theoretic measure, Jacquard similarity, which can be approximated by minhashing.


Here's an example I can think of. Suppose you have a bunch of text documents, and you know that some documents are similar but not identical (e.g. plagiarized and slightly modified). You want to find out which documents are similar.

You can first run the contents through some sort of embedding model (e.g. the recent OpenAI embedding model [1]), and then apply LSH on those embeddings. The documents that have the same LSH value would have had very similar embeddings, and thus very similar content.

[1] https://beta.openai.com/docs/guides/embeddings


Collision detection in games. This problem is O(n^2) because you have to check every object against every other object.

You can almost only check objects that inhabit the same buckets (there are caveats, usually neighboring buckets are also checked), eliminating objects that couldn't possibly collide by virtue of e.g. being on the other side of the map. Of course this is still O(n^2) because every object could be in the same bucket (but that's unlikely).


Face matching. You use embedding created by DeepFace of MediaPipe and put them in a LSH and then the same face ends up close to each other.


In Stuff Made Here's most recent video, he uses LSH to solve a 4000 pieces all-white jigsaw puzzle: https://www.youtube.com/watch?v=WsPHBD5NsS0


Geohash is one use case. Latitude and longitude address is converted to a geohash which has the nice property of nested rectangles have the same hash prefix.


I like meself a well-written and formatted article. Thank you.


I ran into lunatics who were asking things like these at a programming job interview at a prominent company. There should be a moral to this but I don't know what it is.


Sometimes when we search for an expert in a particular field, not just a general programmer, we ask increasingly deep questions, to estimate how deep is the knowledge. It's expected that at some point candidate would not know the answer. The goal is to hire a person that is more experienced than we are. It doesn't mean we knew it ourselves.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: