Roughly speaking, find index of most significant set bit. (So 255 gives you 7, 254 gives you 7, 256 gives you 8, etc.) At a guess, they are comparing things 8/16 bytes at a time, and getting a packed set of byte flags where bytes are set to 0xff to indicate inequality (so, imagining it 4 bytes at a time, comparing 0x12345678 and 0x12125656 would give you 0x00ff00ff) and they want to see where the first different byte is. In this example, the index is 24. (Then: 4-24/8=1. So the first different byte is at offset 1.)
As to how it works: you use the v&-v trick to clear all but the top bit of your comparison result. Because you've only got 1 bit set, this value is a power of two; if you multiply another value by this one, you're getting the equivalent of a left shift by that power (which is relevant - you need to be thinking in shifts rather than multiplies).
Because of the way the predicate results are formed, you only have 5 possible values you're going to get here: 2^31, 2^23, 2^15, 2^7, or 0. The outputs these correspond to:
31 -> 0
23 -> 1
15 -> 2
7 -> 3
0 -> (special case - all identical)
So you need a value that when shifted left 31 bits can be easily turned into 0, when shifted left 23 can be turned into 1, and so on.
A suitable value for this is something like 3<<22|2<<14|1<<6. (If I've made a mistake there, which I probably have, hopefully the working will make it obvious how to fix it.) Recalling that multiplying by 2^N gives you a left shift by N:
When multiplied by 2^31, you get 0.
When multiplied by 2^23, you get 1<<29.
When multiplied by 2^15, you get 2<<29|1<<21.
When multiplied by 2^7, you get 3<<29|2<<21|1<<13.
In all cases, you can shift the result right 29 bits (unsigned shift, or signed shift and mask with 3 afterwards) and get the value you're looking for: 0, 1, 2, or 3.
Zero you can check for before doing the multiply; or you could pop a 4 in at the bottom or something so that the result would be 4 in that case. (This is what I figured you'd want originally and that's why the result starts at bit 29. But actually I guess you'd probably check first.)
This extends naturally to wider comparisons. I think they're doing it 64 bits at a time in this case.
Exercise for the reader: (maybe these are too easy?? - however they weren't instantly obvious to me)
- Why is x * 2^N equivalent to x<<N?
- What if you have more than 1 bit set? Like x * (2^N+2^M)?
- Figure out some rules for generating useful constants like these
Can you also explain the bit twiddling bit in more detail? What operation it makes faster and how it falls into a larger picture?