Hacker News new | past | comments | ask | show | jobs | submit login
Jaccard Index (wikipedia.org)
252 points by dedalus on March 19, 2023 | hide | past | favorite | 43 comments



I recently wrote a fun blog post (https://pncnmnp.github.io/blogs/odd-sketches.html) about how to estimate Jaccard Similarity using min hashing, what b-bit min hashing is, and how to improve upon its limitations using a 2014 data structure called odd sketches.

Jaccard Similarity's history is also quite interesting. From my blog:

> In the late 19th century, the United States and several European nations were focused on developing strategies for weather forecasting, particularly for storm warnings. In 1884, Sergeant John Finley of the U.S. Army Signal Corps conducted experiments aimed at creating a tornado forecasting program for 18 regions in the United States east of the Rockies. To the surprise of many, Finley claimed his programs were 95.6% to 98.6% accurate, with some areas even achieving a 100% accuracy rate. Upon publishing his findings, Finley's methods were criticized by contemporaries who pointed out flaws in his verification strategies and proposed their solutions. This sparked a renewed interest in weather prediction, which is now referred to as the "Finley Affair."

> One of these contemporaries was Grove Karl Gilbert. Just two months after Finley's publication, Gilbert pointed out that, based on Finley's strategy, a 98.2% accuracy rate could be achieved simply by forecasting no tornado warning. Gilbert then introduced an alternative strategy, which is now known as Jaccard Similarity.

> So why is it named Jaccard Similarity? As it turns out, nearly three decades after Sergeant John Finley's tornado forecasting program in the 1880s, Paul Jaccard independently developed the same concept while studying the distribution of alpine flora.


Recently I wanted to try and use this to filter out logs I don't care about, but it seemed a lot more involved than I initially thought.

I essentially wanted to use this as a way to flexibly filter out items without having to come up with a regex for every line item.

I wonder if anyone has done this before...


Could always rely on the Levenshtein distance. You have to be careful with similarity approaches though as you may end up filtering important messages because they are structurally similar to the unimportant message.


Maybe you could use a language model embedding to define some kind of semantic distance.


Just filter out logs by the file, line that generated it (i.e. that had the log statement). Even if the actual log entry changes (e.g. because of a formatted str with vars) they will always have the same source.


Bayesian filters like for emails? You mark them as important or noise and over time it will learn. These are extremely easy to put in place and you don't have to preannotate as it learns as you go.


Yes Bayesian filters work well for this.

I had an idea Splunk had them built in? But it's about 5 lines of Python anyway.


I often feel modern tools should offer that


As an aside if you find yourself having to compute them on the fly, know that the Roaring Bitmaps libraries is the way to go [1]. The bitmaps are compressed, and can be streamed directly into SIMD computations (batching boolean transformations and popcnts 256 bits wide!). The Jaccard index is just intersection_len / union_len [2] away

Of note: the author of that library is none other than Daniel Lemire [3], whose articles pop up quite often on HN

[1] https://roaringbitmap.org/

[2] https://roaringbitmap.readthedocs.io/en/latest/#roaringbitma...

[3] https://lemire.me/blog/


This is one of my favorite distance metrics* to show people!

For example, perhaps one person likes Reddit and HN, while someone else likes HN and SO.

Then their Jaccard Index would be 1/3, since they have one thing in common out of three.

* Technically it computes "similarity" (larger number == more similar), but `1 - Jaccard Index` is a distance (smaller number == more similar).


Very useful indeed. I used it to compute the similarity of patients with different diseases, e.g. if patient1 has 3 diseases and patient2 has 2 of those three, they should be treated more similar than if patient1 has only 1 disease and patient2 has another one. Euclidean distance would assign the same value in both cases.


The name may be an example of this: https://en.m.wikipedia.org/wiki/Stigler%27s_law_of_eponymy

It was developed by Grove Karl Gilbert in 1884 as his ratio of verification (v)[1] and now is frequently referred to as the Critical Success Index in meteorology.[2] It was later developed independently by Paul Jaccard…


In my field, cheminformatics, we refer to it as "Tanimoto similarity" because it was (quoting Wikipedia) "independently formulated again by T. Tanimoto."

It's an odd set of linkages to get there. First, "Dr. David J. Rogers of the New York Botanical Gardens" proposed a problem to Tanimoto, who published the writeup in an internal IBM report in 1958. (I understand there was a lot of mathematical research in taxonomy at the time.) In 1960 Rogers and Tanimoto published an updated version in Science.

In 1973 Adamson and Bush at Sheffield University developed a method for the automatic classification of chemical structures. They tried Dice, phi, and Sneath as their comparison methods but not Tanimoto. In their updated 1975 publication write "Several coefficients have been proposed based on this criterion", with a list of citations, including the Rogers and Tanimoto paper as citation 14.

In 1986, Peter Willett at Sheffield revisits this work and finds that Tanimoto gives overall better results when applied to what are now called cheminformatics "fingerprints". He uses "Tanimoto", with no direct citation for the source of that definition.

This similarity method is easy to implement, and many organizations already have pre-computed fingerprints (they are used as pre-filters for graph queries), so the concept and nomenclature takes off almost immediately, with "Tanimoto" as the preferred named.

It's not until 1991 that can find a paper in my field referring to the earlier work by Jaccard (the paper uses "Tanimoto (Jaccard)").

I have found some papers in related fields (eg, in IR and mass spectra analysis) which reference Tanimoto similarity, but nothing to the extent that my field uses it.


Another example of this sort of thing (that is vaguely related in that it's commonly used as a metric) is (what I call) the Matthew's Correlation Coefficient https://en.wikipedia.org/wiki/Phi_coefficient

> In machine learning, it is known as the Matthews correlation coefficient (MCC) ... introduced by biochemist Brian W. Matthews in 1975.[1] Introduced by Karl Pearson,[2] and also known as the Yule phi coefficient from its introduction by Udny Yule in 1912


My colleague Nate wrote about using Jaccard for search engine regression testing: https://opensourceconnections.com/blog/2021/07/23/jaccard-in... If you make changes to a search algorithm and the order of results changes too much this can alarm users, so this is a good way to reduce risk.


I've used this to find similarity between terms in a specification document and identifiers in a program. The index is computed on the bag of bigrams from the two strings.

If one has a set of pairs that are similar, one can look for common bag differences in the matches. These can correspond to extra characters inserted in the identifier names for that particular program (for example, prefixes or suffixes related to module name or variable types.) Once these are found they can be used to tweak the similarity score.


I've used this recently to do some fuzzy matching of column names in datasets, I also added it to a small python one-liner library I've been making for practice. p.s. don't give me flack, I know this isn't an efficient way to do things.

jaccard = lambda A, B: len(set(A).intersection(set(B))) / len(set(A).union(set(B)))

https://github.com/b-mc2/MiniMath


One of the weaknesses with Jaccard similarity is how it focuses on matches/true positives. It neglects the importance of "negative space."

I was happy to see Matthew's correlation coefficient (MCC) used in the recent "1st and Future - Player Contact Detection" Kaggle competition. MCC balances the eight confusion matrix ratios, and I've gotten excellent results when using it in the past.


It's not a weakness; it's a feature.

One that makes it the better choice in situations where negative space should in fact be ignored. (comparing chest xrays are a typical example in medical imaging)


It's unclear to me why they should be ignored.


Consider two xrays and the output of an algorithm that tries to outline the lungs.

Suppose that one xray was taken in a larger machine, and therefore has more negative space around the lungs.

Also suppose that the algorithm delineated the lungs equally well in both cases (anatomically speaking).

If you assess performance using the jaccard index, the metric is equal in both cases, as it should be, indicating equal performance of the algorithm w.r.t. the ground truth.

Whereas anything that takes accuracy of true negatives into account will necessarily give a higher performance in the xray from the larger machine, even if the person xrayed and the lung outline were identical.


I don't see how this is an improvement over MCC in this example, since in this case, unless I'm mistaken, MCC would hypothetically give the same (perfect) value to both x-rays, just as Jaccard would.


No, in this scenario, the MCC will generally give a misleadingly higher value to the larger machine, since it takes into account the higher number of true negatives. In this scenario, this makes it a bad metric.

Obviously, a 100% perfect segmentation would of course register as perfect in both, but in practice one rarely deals with such perfect predictions.

In general, there is no metric that is universally "better" in all scenarios. One is expected to choose the metric that best suits the particular goal one wishes to validate against.


Quoting myself from a while ago[0]

At reddit many moons ago before machine learning was a buzzword one early iteration of recommendations was based on Jaccard distance using the number of co-voters between subreddits. But with one twist: divide by the size of the smaller subreddit.

    relatedness a b =
        numerator   = | voters on(a) ∩ voters on(b) |
        denominator = | voters on(a) ∪ voters on(b) |
        weight      = min(|voters on(a)|, |voters on(b)|)
        numerator / (weight*denominator)
That gives you a directional relatedness, that is programming->python but not necessarily python->programming. Used this way you account for the giant subreddit problem[1] automatically but now the results are less “amitheasshole is related to askreddit” and more like “linguisticshumor is a more niche version of linguistics”.

The great thing is that it’s actually more actionable as far as recommendations go! Everybody has already heard of the bigger version of this subreddit, but they probably haven’t heard of the smaller versions. And it’s self-correcting: as a subreddit gets bigger we are less likely to recommend it, which is great because it needs our help less.

It's also easy to compute this because it lends itself to one giant SQL query that postgres or even sqlite[2] optimises reasonable well. It has some discontinuities around very tiny subredddits, so there was also a hack to just exclude them with a hack heuristic. It does get fairly static so once we've picked 3 subreddits to recommend if you're on subreddit A, if you don't like them we'll just keep showing them anyway. I had a hack in mind for that (use the computed values as random weights so we'll still occasionally show lower-scoring ones) but by this time people much smarter than I took over recommendations with more holistic solutions to the problem we were trying to solve in the first place. Still, as a first pass it worked great and based on my experience I'd recommend simple approaches like this before you break out the the linear algebra.

Side note, I tried co-commenters in addition to co-voters. The results tended more accurate in my spot tests but the difference fell away in more proper cross-validation testing and I didn't look into where the qualitative difference was. But since there are more votes than comments on small subreddits the number of recommendable subreddits was higher with votes. I reasoned that co-submitters (of posts) should be even more accurate but it was thrown off by a small number of spammers and I didn't want to mess with combining those tasks at the time.

[0]: https://news.ycombinator.com/item?id=22178517

[1]: that votes are distributed according to a power law, meaning that everybody has voted on the largest subreddits so most clustering approaches recommend askreddit to everybody. That's okay for product recommendations where "you should buy the most popular CPU, it's most popular for a reason" but for subreddits you already know that so we want a way to bias to the most "surprising" of your votes.

[2]: I prototyped it on sqlite on my laptop and even with close to the production amount of data it ran reasonable well. Not fast, but fine. This was on considerably less traffic to today, mind.


Super interesting! I’m currently writing my master thesis on analyzing relationships between subreddits based on user and semantic similarity. For user similarity I use the Jaccard similarity between the unique set of authors of each subreddit.

Can I send you a message and quote you in my thesis? You can shoot me a short message as well: violets.parr-0c@icloud.com


Yeah sure, you can message me on reddit too. Same username


Ha, I came here to mention it and of course you’re already here explaining the background in far more detail than I could :)


The book Mining of Massive Datasets [1] has useful information on building an efficient similarity index using Jaccard/minhash. I would also recommend Otmar Ertl's papers on extensions of minhash that approximate Jaccard better in certain situations, e.g. superminhash [2].

[1] http://www.mmds.org/ Chapter 3 [2] https://arxiv.org/abs/1706.05698


I think a good example of the Jaccard Index is from the paper Lazy Prices[1] in which the authors use it to measure diffs in a company's annual and quarterly reports. They also use other similarity measures (cosine, minimum edit, simple) so you get a better understanding of how each applies to textual similarity as a whole.

[1] https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1658471


I recently used Jaccard similarity as a measurement of distance between two sets of online articles. It’s amazing how versatile it is for all sorts of weird tasks.


I uses to use Jaccard similarity combined with w-shingling at the character level to detect clusters of fraud sites. It was surprisingly effective, because it was able to pick up common patterns in the code even if they used completely different styles and text.

https://en.m.wikipedia.org/wiki/W-shingling


Interesting - I also used Jaccard similarity to classify clusters of malicious ad traffic schemes. The idea worked well. It was unclear if the similarity was due to mimicry or authorship, but that did not matter for our use.


I just used this at work the other day to calculate similarities between different data models that had overlapping children models. One of our teams was going to go through manually to check these overlaps and consolidate, but by using this clustering algo based on Jaccard distance we were able to give them clusters to consolidate up front. Super cool stuff!


what is a children model? I'm curious but can't really follow what you wrote, can you add a bit more context?


Looks like a "reflection coefficient for sets."

Reflection coefficient in electrical or acoustic (or elastic) transmission across two media is the difference of their impedances over the sum of them.

Difference over sum is a pattern you see a lot.


I used this for a project re: similarity between two strings.

The Jaccard similarity between sets of uni- and bi-grams was a surprising effective metric.

DOG -> {d, o, g, do, og}

GOD -> {g, o, d, go, od}

intersection = {d, g, o}

union = {d, g, o, do, go, od, og}

J = 3 / 7 = ~43%


this is well-known to people working in cheminfornatics as the tanimoto similarity, used to calculate the similarity of two chemical structures based on their (folded) fingerprint


Jaccard my Dice please.


Jests asside, I've mostly used the closely related Dice coefficient when measuring segmentation reliability


they are effectively equivalent, theres a one to one mapping between the two


What is the predicted bounding and the ground truth bounding, as related by a stop sign? I have no idea what's happening there.


The "predicted" box there would be a best guess from a statistical model powered by AI or computer vision, answering, "where is the stop sign in this image?". The "ground truth" would be an annotation by a human answering the same question. The jaccard similarity metric would say that these bounding boxes are highly similar, and so the prediction could be evaluated as high quality.


didnt know there was a name for it




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: