They are between about 5 and 32 characters. I have converted them to bitstrings ...

dalke · on April 13, 2016

Swamidass and Baldi - http://www.igb.uci.edu/~pfbaldi/download/download.inc.php?pi... . With a followup in Nasr, Hirschberg and Baldi - Hashing Algorithms and Data Structures for Rapid Searches of Fingerprint Vectors http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2926297/ .

Both emphasize the Jaccard-Tanimoto similarity, which is a bit more complicated than the Hamming. The ideas are easily ported to a Hamming search.

The first paper uses the pre-computed popcount, like I mentioned. I don't recall if it also suggests sorting by popcount, but that's what I do. If both {N} and {M} are pre-sorted then you'll improve cache coherency, because the queries for |n|=i all search the same subrange of i-2<=|m|<=i+2. This is also easy to parallelize with OpenMP.

The second paper observes that you can precompute additional information. For example, consider the subset of M where |m| = 100. You could also compute the popcount of just the first 1/2 of each m, or the popcount of the even numbered bits. This must also be within +/-2 of the popcount same subset of your query.

That is, if the first 1/2 of your query had 40 bits set (so the remainder had 60 bits set) then you can only match those {m} where the first 1/2 set between 38 and 42 bits.

The downside of more subdivisions like this is the increased amount of index overhead and the increased number of if/branches. At some point it's faster to just do a brute force linear search.

On my laptop, with 1024 bit strings, I can do 1M x 1M tests (with a high threshold like what you suggest) within a few hours, giving the pruning from just the first paper.

Send me an email (if my HN account name is ${U} then I'm ${U}@${U}scientific.com ) and with some sample data and I'll see if my code works for what you need, and give you a pointer to the code.

Someone · on April 13, 2016

Huh? I cannot reconcile that with what you wrote earlier: "I'm currently trying to find similar strings (max hamming of 2) between two very large (1e9) sets."

'Hamming' I only know as Hamming distance (https://en.m.wikipedia.org/wiki/Hamming_distance).

If that is what you are referring to, the first thing I would do is to divide and conquer by splitting the sets of strings by length (Hamming distance isn't defined for strings of unequal length)

After that, I would handle the larger strings by deriving 3 about-equal sized string keys from them. For example, split strings of length 14 in parts of lengths 5, 5, and 4. If two of the strings have Hamming distance <= 2, one of those derived keys must be equal in both strings by the pigeonhole principle. For length 14, and assuming a ACGT alphabet (or am I showing my ignorance here?), that divides and conquers by a factor 256 or 1024. Since the naive "compare all pairs" algorithm is O(nm) for sets of size n and m, respectively, in comparisons, that should get you a speed up of a factor of at least 4 orders of magnitude.

Another approach might be to consider all possible pairs of replacements, and check whether they could make the strings equal. For example, assume the errors are that one A got replaced by a C and one T by an A. Then, do a replace all in all strings for those replacements, look for duplicates in the results, and then check the Hamming distances between the original strings that became equal after making those replacements (I would guess it isn't necessary to physically do the string replacements; adjusting the comparison function during sorting may be faster)

Again assuming a small alphabet, that should be doable, as all possible pairs of replacements isn't that many pairs.

And of course, the two can be combined.

(It would be a bit more messy to implement and would give a bit less of a speed up, but I think the gist of the above would work fine if you meant edit distance instead of Hamming distance)

Edit: an alternative could be to split each string in trigrams ("abcde" => {"abc", "bcd", "cde"}), index them with some text search engine such as Lucene, using no tokenizer, and then do a billion searches in that set. Anything with few changes will end up high in the search results there, and it is fairly easy to determine hard bounds there. At 10ms per search and 1E9 searches, that would 'only' take 1E8 seconds, or 3 years. It is easily parallelized and distributed, though.

nkurz · on April 13, 2016

Since the naive "compare all pairs" algorithm is O(nm) for sets of size n and m, respectively, in comparisons, that should get you a speed up of a factor of at least 4 orders of magnitude.

Could you flesh out that algorithm? Your "pigeonhole" insight is correct, but I don't see it leading to a speedup, much less a 10000x speedup. In practice, the limiting factor is likely memory access, and the length of the strings is irrelevant in the size ranges we're discussing. But likely I'm not understanding your approach.

Someone · on April 14, 2016

Let's say you have two sets of 1E8 strings of length 9. The naive algorithm compares all pairs, so it does 1E16 "check if Hamming distance <= 2" calls.

If you group them, first by "first 3 characters equal", then by "middle 3 characters equal", then by "last three characters equal", with a 20-letter alphabet, you will, in each case, get in the order of 1E3 groups each with in the order of 1E5 strings each. Comparing pairs in each groups takes 1E10 "check if Hamming distance <= 2" calls, for a total of 1E13 such calls.

You have to do this 3 times, so this will give you about a factor of 300, assuming the grouping to be free (it isn't, of course, but sorting is O(n log n), and this is a tiny bit simpler)

For longer strings, the speed up increases, though. Length 12 gives you 1E16 vs 10000 groups of 1E4 strings each, for 3 times 1E12 "check if Hamming distance <= 2" calls. That gets you close to that 4 orders of magnitude.

Tangential: you have to implement your "check if Hamming distance <= 2" call correctly. It shouldn't do "Hamming(s,t) <= 2", but bail out early. Also, you will call it at times where you know the first/middle/last third of characters to be equal, so it should skip those characters (ideally, those shouldn't waste cache line memory at all, but getting there may be too costly in setup time)

New idea: store both sets of strings in tries, iterate over all length 3 prefixes in the first trie, then walk the second one, pruning paths as soon as you hit branches where you know Hamming distance will be larger than two, and peeking in the first trie to check whether pruning is possible. I think that can be made to effectively do what the 'split it 3 parts trick does', but smarter, and may be faster. Implementing it will require some thinking, though, and that thinking may lead to the conclusion that the idea doesn't work.

itodd · on April 13, 2016

I did mean hamming. I group by length as insertions and deletions are not expected. I convert to a one-hot vector so I can xor to get the differences.

Thanks for the suggestions. I hadn't thought about splitting the the strings into 3 roughly equal sizes though, that would definitely help narrow the search pool. FWIW, the strings use the protein alphabet (20 values instead of 4).