HyperMinHash: Bringing intersections to HyperLogLog

stochastic_monk · on Nov 14, 2018

The subtitle is misleading, as hyperloglogs are excellent for intersections. HyperMinHash claims to have some asymptotically superior behavior by mixing concepts from HyperLogLogs and MinHash. However, I haven’t seen an optimized implementation and corresponding benchmarks, which makes me wonder how effective it really is.

yunwilliamyu · on Nov 14, 2018

Hi there! I'm the lead author on the paper that Seif (expertly) implements. You're right that my Python implementation isn't at all optimized, as I was more interested in the theory, though I think Seif's Go implementation does much better.

HyperLogLogs are normally used for intersections through inclusion-exclusion (though there are more sophisticated methods out there). As such, it is really good when the two sets A and B being intersected are about the same size and the intersection is large. It performs poorly when |A ^ B| << |A v B| because inclusion-exclusion means that a small error in the union size results in a large relative error in the intersection size.

HyperLogLog uses 6 bits per bucket in normal use. For our purposes, the rule of thumb we propose uses 16 bits per bucket. So we lose a factor of ~3 on the number of buckets. In exchange, we don't lose as much accuracy when |A ^ B| << |A v B| and also do much better for multi-set intersections (where inclusion-exclusion becomes unwieldy).

As an aside, there are a bunch of other really cool Jaccard index fingerprinting methods out there that are asymptotically better even than HyperMinHash. But, they aren't as easy to work with and lack some of the other nice properties of MinHash.

bruce_lipshitz · on Nov 14, 2018

> As an aside, there are a bunch of other really cool Jaccard index fingerprinting methods out there that are asymptotically better even than HyperMinHash. But, they aren't as easy to work with and lack some of the other nice properties of MinHash.

Would you be willing to list out these better methods and detail their pros / cons versus MinHash? It would be super helpful for those of us who are students in this field.

yunwilliamyu · on Nov 16, 2018

Things like b-bit MinHash [Li, Konig, 2010]. A good summary of the problem with asymptotics and some lower bounds is Pagh, Stockel, Woodruff, 2014, "Is Min-Wise Hashing Optimal for Summarizing Set Intersection." There's been more recent work, but that's a good place to start.

These other techniques have lower space complexity than MinHash (or even HyperMinHash) but have the con that they lose the streaming and composability properties of MinHash. One of the main advantages of a MinHash (or HLL) sketch is that you can combine sketch(A) with sketch(B) to get sketch(A v B). Alternately, the data structures can be progressively updated as you see a stream of items. The other fingerprinting techniques often require storing auxiliary data during the generation process.

stochastic_monk · on Nov 14, 2018

Thank you! I’ll probably further explore hmh. I have a draft implementation which I’ll compare to related structures soon. A key advantage that hyperminhash has is the ability to sample, if one is using a large enough sampled point and a reversible hash, which could be helpful for certain applications. (Since hll discards all input items)

In fact, I spend 8 bits per entry to facilitate SIMD acceleration, but I’m happy with the speed/memory tradeoff.

A lot of issues are eliminated by using Ertl’s MLE or JMLE formulas, but I do see how sets of combinatorial comparisons could be helped with hmh.

seiflotfy · on Nov 14, 2018

Author of the Library here: I can't fully agree that HyperLogLogs are good for intersections, that does not mean it can't be done, just there is always a high error rate. The easy way to do that is to maintain a separate MinHash sketch per HLL and use the Jaccard Index of several of those HLLs... an even more naive just use the Jaccard Index with the HLL registers but good luck getting good results out of that. This library is based on https://arxiv.org/pdf/1710.08436v4.pdf which has the benchmarks... Also I will be taking a stab at http://www.vldb.org/pvldb/vol11/p1097-nazi.pdf very soon :)

yunwilliamyu · on Nov 14, 2018

Should note here that my benchmarks are comparing HyperMinHash to MinHash. I didn't directly compare to HyperLogLog in the paper, though there exist other papers (mostly in the CS theory literature) that do compare MinHash to HyperLogLog in various intersection regimes.

stochastic_monk · on Nov 14, 2018

In my benchmarks, HyperLogLogs unilaterally outperform MinHash* by a constant factor, while Bloom Filters of the same size become more accurate than hyperloglogs in the 100-500kB range (though this number increases with the cardinalities of sets being compared). The biggest advantage a HyperLogLog has over MinHash and bloom filters is that it works well regardless of the sizes of the sets being compared, while MinHash is sensitive to differences in set size and bloom filters perform terribly in the small sketch domain. (These results aren't yet public, but they will be soon.)

* Part of this might be issues with the original Flajolet et al. estimation algorithm. Ertl provides several improved methods in https://arxiv.org/pdf/1702.01284.pdf.

pjungwir · on Nov 14, 2018

Neat! I can definitely imagine cases where this would be helpful.

Another thing I've wished for is being able to remove an element from an hll. I don't think intersection quite gets me there; it would need to be a separate operation.

Here is where I mentioned this once before, although that particular use case was achievable without needing an hll at all: https://news.ycombinator.com/item?id=14637903