Could you describe in a little more detail what you mean by this sentence: > ".....

j42 · on June 15, 2015

Sure thing.

Let's say we have two sets, and we're trying to find out how similar they are:

    setOne = ['the','brown','fox','jumped','a','log']
    setTwo = ['the','quick','brown','log','jumped','over','the','fox']

You could use an array intersection when its small, but if you want to do this efficiently at scale you need to take advantage of probabilistic data structures.

Let's say you created a HyperLogLog for each set:

    setOne = (bitfield representing setOne)
    setTwo = (bitfield representing setTwo)

Now HyperLogLogs are cool, because you can merge them together w/o losing anything. You can't retrieve the data, but you can efficiently check if a value exists inside. You can have a false positive, but never a false negative.

You might first try simple combinatorics (a la, similarity ~= |A or B| / |A ∪ B|) however this can get hairy depending upon the representation of the HyperLogLog (sparse/dense) and its respective cardinality.

Eventually you realize you can't accept an exponentially-compounding error rate, but still need the raw efficiency, and thus you sacrifice by doubling your storage cost.

Now instead of just the initial sets, you'd also build a parallel MinHash struct:

    setOne_minhashes = [<int>, <int>, <int>]
    setTwo_minhashes = [<int>, <int>, <int>]

    setOne = (bitfield representing setOne)
    setTwo = (bitfield representing setTwo)
    setOneMinHash = (bitfield representation of setOne minhashes)
    setTwoMinHash = (bitfield representation of setTwo minhashes)

I won't go into excessive detail about the minhash algorithm itself, but essentially it provides a way to sample a large set of values by selecting/retaining the smallest output hashes. What that means is, you can then intersect the minhash bitfields as many times over as you like, and extract a predictably accurate similarity index in linear time with definable confidence bounds.

vixsomnis · on June 15, 2015

I don't have a background in ML (I'm still somewhat early in my CS Bachelor's degree, and I'm not focusing on ML anyway), but this is very interesting, regardless that I had to look up nearly every term you used.

I was initially thinking a directed weighted graph might work well here, but I'm assuming that would scale terribly relative to something like this.