Optimizing 128-bit Division (2020)

amock · 2024-06-04T18:38:44 1717526324

This is not directly relevant, but Power ISA 3.1 (from May 2020) has 128 bit division operations (vdivuq and vdivsq) with a maximum latency of 61 cycles. I don't have access to a Power 10 machine to see how it compares to what's presented here, but I thought it was an interesting addition to the ISA.

https://wiki.raptorcs.com/w/images/f/f5/PowerISA_public.v3.1... https://files.openpower.foundation/s/EgCy7C43p2NSRfR

hansvm · 2024-06-04T06:35:20 1717482920

Wasn't there a nice post somewhere talking about using anything but division for the task of splitting stuff into buckets?

hansvm · 2024-06-04T06:36:04 1717482964

Yeah, it figures Lemire would have written it. I don't think OP actually needs or wants division when you can have faster algorithms with the same statistical properties.

https://lemire.me/blog/2019/06/06/nearly-divisionless-random...

arthur2e5 · 2024-06-04T08:02:38 1717488158

Lenore’s approach uses rejection sampling, which translates to needing an uncertain amount of input — a uint128 may or may not be enough.

But there is one related algorithm that doesn’t: https://github.com/apple/swift/pull/39143. This method only requires (output word size + 64) bits, which is really convenient for us since we probably don’t have 2^64 buckets. Same ballpark as (hash >> 64) * n, just with bias correction.

With this method the output value will be very not-scrambled relative to the high bits of the hash. Now whether that’s a problem is… someone else’s question. One can always do some shifts and XORs.

fanf2 · 2024-06-04T09:56:43 1717495003

I wrote about this at https://dotat.at/@/2022-04-20-really-divisionless.html

You don’t need to worry about this for mapping hashes to buckets, because a hash has a limited width, whereas unbiased random numbers are based on an unbounded stream of random bits.

In this article they are working with 128 bit hashes so they already have plenty of bits for bias not to be a problem. A single straightforward multiplication is fine.

rene75 · 2024-06-04T09:29:41 1717493381

https://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-...

rurban · 2024-06-04T05:41:21 1717479681

Some major blunders in the first paragraphs. Iterating over all 32 bits is trivial, over 64 bit not possible in normal time. She mixed up the numbers there.

Thus I don't see the real need to extend to 128 bits for the birthday paradox. At Google scale there might be, but not for pedestrians.

mungoman2 · 2024-06-04T05:59:05 1717480745

No blunder here. They're saying that with a 64-bit hash you will likely see hash collisions with 2^32 items. It's easy to generate 2^32 items, so it follows that it is easy to generate a collision.

The high likelihood of collisions is explained by the birthday paradox.

With a 128-bit hash you don't have this problem anymore.

orlp · 2024-06-04T06:47:48 1717483668

But if you are then mapping the elements to (most likely vastly fewer than 2^32) buckets... why do the hash collisions matter? Just use the top 64 bits of the hash to compute a bucket, using ((hash >> 64) * buckets as u128) >> 64.

The reason this works is that hash64 / 2^64 is almost uniformly distributed on [0, 1), so floor(hash64 * buckets / 2^64) is almost uniformly distributed on [0, buckets). This compiles to a single mul instruction on x86-64, a single mulhi instruction on ARM.

mabster · 2024-06-04T07:05:05 1717484705

Agreed on multiplying the top 64 bits rather than dividing: if the hash is calculated via prime multiplications, the top bits are "more random" than the bottom bits.

But we don't really know the hash table type. It could be doing something like hash table -> linked list -> items, in which case they might still be utilising the full 128-bit hash (I'd usually just store the lower 64 bits TBH).

orlp · 2024-06-04T07:34:46 1717486486

I'm not saying to throw away the rest of the hash, just to ignore it for the bucket calculation. You can utilize the full 128-bit hash for other parts.

mabster · 2024-06-04T08:05:56 1717488356

Ah, misunderstood you. Thanks for clarifying.

rurban · 2024-06-05T13:31:50 1717594310

Calculate the probability then. With 64 bits 2^32 tries have a possibility of ~0.39347 collisions. Which is unlikely, not likely. You need at least 10x more to get a good probability of %97. Which needs ~40m with 8 cores on my CPU. Which is not a practical flyby attack anymore. With a big GPU it would though.

https://kevingal.com/apps/collision.html

I do calculate 32bit collisions for all hash functions regularly. But not for 64bit functions, because they are not practical. Google is the only prominent one who cares for 128bit, but then they should at least get their numbers right. Because most of their arguments are pure security theater.

foota · 2024-06-09T05:37:39 1717911459

2^32 is 2 months of 1000 random requests per second, so every 2 months you'd have a random chance that your service breaks for someone if you're relying on there not being any collisions.