I wonder how many people using UUIDs still write a conditional to verify uniqueness against the previously generated ones.
Edit: not implying there aren't perfectly good reasons to make the check that are unrelated to the probabilities of a collision in the ideal case. Just an idle musing.
And, of course, considering the weakness of MD5, assuming no untrusted party had an opportunity to manipulate the messages (unless a collision couldn't possibly benefit them).
Fantastic article. Probabilistic algorithm design is fascinating to me because it's all about making sure that the system behaves as expected, rather than every individual operation being perfect. Implementing most random algorithms in practise strongly reminds me of studies of populations & behaviours of organic life (which is much better than the rube goldberg/spaghetti factories which sometimes arise as "solutions" to complex problems).
Another interesting related field is Dave Ackley's Robustness-First Computing model[1], which operates with every atomic computational cell having some elements of randomness built in.
For those who are interested, Berkeley has publicly available lecture notes for their Randomness & Computation course. They go over some randomized algorithms as well as the mathematical tools necessary to analyze them. Just a heads up, the content is fairly theoretical.
Could someone explain or point me to a resource as to why using multiple buckets in the HyperLogLog algorithm makes it more accurate than a single bucket?
I don't have an article but my intuition is that it reduces variance. It smears the probability of getting "unlucky" with the hashes across all of the buckets.
There is nothing wrong with this if the probability of failing is smaller than the chance of an asteroid landing on your head.