Hacker News new | past | comments | ask | show | jobs | submit login
How a HyperLogLog works (opensourceconnections.com)
117 points by softwaredoug on Feb 4, 2015 | hide | past | favorite | 17 comments



A simpler and more concise explanation in https://github.com/aaw/hyperloglog-redis.

"The basic idea of HyperLogLog (and its predecessors PCSA, LogLog, and others) is to apply a good hash function to each value observed in the stream and record the longest run of zeros seen as a prefix of any hashed value. If the hash function is good, the bits in any hashed value should be close to statistically independent, so seeing a value that starts with exactly X zeros should happen with probability close to 2 -(X + 1). So, if you've seen a run of 5 zeros in one of your hash values, you're likely to have around 2 6 = 64 values in the underlying set. The actual implementation and analysis are much more advanced than this, but that's the idea."


Interesting to see Bitcoin use almost the same principle. Essentially it's "if there is a hash with enough leading zeroes, we can assume a lot of CPU time has been put into the network."

Of course that's just an artifact of what's really happening, which is pseudo-random values + laws of probability.


I don't think this does a very good job of explaining the HyperLogLog, honestly. A slightly less contrived example would've been useful.


I agree. Additionally, the words "distinct" or "cardinality" don't appear anywhere in this article, which is a major omission when discussing HyperLogLog. The primary use of the algorithm is to provide a cardinality estimate.


"Let's say we have a whiteboard and we have to count how many individuals visit our home. If our neighbor visits 5 times, they count as one unique visitor. If a friend comes over once, that's another unique visitor. So far, we have 2 unique visitors."

They may not specifically use the word "distinct" (opting for phrases like "unique visitor" instead), but they certainly make the concept clear.


Agreed, cardinality is the key word: it goes with how his example is inappropriate -- if you were just counting the number of people coming in a simple addition (and maybe take a log() ) would cost just as much. I believe the usefulness of HLL is evaluating the cardinality of sets you need to access -- you "sample" the set and get a quick estimate of the cardinality of certain objects.


I disagree. I think that the article does an EXCELLENT job of describing it. They avoided the term "cardinality" on purpose, since the target audience was people who might well be scared off by fancy words they weren't familiar with.

And the problem posed was not to count the number of visitors, but to count the number of different people who visited... the fancy word for that is the "cardinality of unique visitors" but "number of different people" is just as accurate.

Counting the number of people could NOT be done with the same amount of space. This example required two simple counters on the blackboard... call it 30 bits of memory, assuming you didn't expect either counter to get more than 8. That's just about exactly enough space to store ONE social security number... and nowhere near enough to count the unique people.


I agree.

My concerns are not really about the contrived usage of only 2 buckets and decimal numbers. This may give a picture to start with.

But, this is a pity to elude how the final cardinal estimate is computed. And this is shame to publish a gist so sloppy : all key points are buggy (bucket assignation, count of leading zeros, final outcome).

I started my day with "Oh fine, a post witch may help me to understand quickly how hyperloglog works" but I was so confused after that reading, that I forced me to dive into the paper and a high quality code (redis implementation) to leave with a better understanding and no more false assumptions.

This may be ok to simplify an algorithm to help people to grasp it. But at least, as an author, we have to give a true insight.



Nick Johnson did a Damn Cool Algorithms post [0] on this that inspired me to play around with HLL myself and do a little writeup [1].

[0]: http://blog.notdot.net/2012/09/Dam-Cool-Algorithms-Cardinali...

[1]: https://github.com/sergeio/hyperloglog


Here's a less contrived application. This paper uses hlls to compute miss rate curves for LRU caches from traces of block requests. https://www.usenix.org/system/files/conference/osdi14/osdi14...

The short, simplified story is a list of hlls (a 'counterstack' in the paper, which is an unfortunate bit of historically motivated terminology) are used to count the number of distinct blocks requested. Each hll is fed every block request. A new hll is created after a certain number of requests, so the counts your hlls report correspond to estimations of the number of unique blocks accessed in tails of the trace. The oldest hll has seen every block in the trace, while the most recently created has seen a tail of the trace of length <= your creation interval parameter.

It turns out that you can use this structure to compute stack distances, which are the number of unique blocks accessed between two requests to the same block. The details are a little hairy, so see the paper for the full story.


Another good explanation, this one with a javascript simulation of HLL: http://research.neustar.biz/2012/10/25/sketch-of-the-day-hyp...


I thought it was a clear explanation. It was easy to understand, and I now have a good intuition of how HLL works and why I might want to use it.


Seriously? Someone renamed the title? Nobody likes Ren & Stimpy these days?


Surely that someone favored homework over R&S, but "a HyperLogLog" Why does it need an "a"? I would have expected "How HyperLogLog works".


Would you expect a title "How Hash Table works"? No, you'd expect "How a Hash Table works". Same thing here.


You really think that there are a lot of people on HN who know what Ren&Stimpy was?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: