Could you describe in a little more detail what you mean by this sentence:
> "... use HyperLogLogs to encode these MinHash structs gaining a great deal of speed with a marginal error rate, all while allowing for arbitrary N-levels of intersection"
You could use an array intersection when its small, but if you want to do this efficiently at scale you need to take advantage of probabilistic data structures.
Now HyperLogLogs are cool, because you can merge them together w/o losing anything. You can't retrieve the data, but you can efficiently check if a value exists inside. You can have a false positive, but never a false negative.
You might first try simple combinatorics (a la, similarity ~= |A or B| / |A ∪ B|) however this can get hairy depending upon the representation of the HyperLogLog (sparse/dense) and its respective cardinality.
Eventually you realize you can't accept an exponentially-compounding error rate, but still need the raw efficiency, and thus you sacrifice by doubling your storage cost.
Now instead of just the initial sets, you'd also build a parallel MinHash struct:
I won't go into excessive detail about the minhash algorithm itself, but essentially it provides a way to sample a large set of values by selecting/retaining the smallest output hashes. What that means is, you can then intersect the minhash bitfields as many times over as you like, and extract a predictably accurate similarity index in linear time with definable confidence bounds.
I don't have a background in ML (I'm still somewhat early in my CS Bachelor's degree, and I'm not focusing on ML anyway), but this is very interesting, regardless that I had to look up nearly every term you used.
I was initially thinking a directed weighted graph might work well here, but I'm assuming that would scale terribly relative to something like this.
> "... use HyperLogLogs to encode these MinHash structs gaining a great deal of speed with a marginal error rate, all while allowing for arbitrary N-levels of intersection"
Thanks