Hacker News new | past | comments | ask | show | jobs | submit login

A simpler and more concise explanation in https://github.com/aaw/hyperloglog-redis.

"The basic idea of HyperLogLog (and its predecessors PCSA, LogLog, and others) is to apply a good hash function to each value observed in the stream and record the longest run of zeros seen as a prefix of any hashed value. If the hash function is good, the bits in any hashed value should be close to statistically independent, so seeing a value that starts with exactly X zeros should happen with probability close to 2 -(X + 1). So, if you've seen a run of 5 zeros in one of your hash values, you're likely to have around 2 6 = 64 values in the underlying set. The actual implementation and analysis are much more advanced than this, but that's the idea."




Interesting to see Bitcoin use almost the same principle. Essentially it's "if there is a hash with enough leading zeroes, we can assume a lot of CPU time has been put into the network."

Of course that's just an artifact of what's really happening, which is pseudo-random values + laws of probability.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: