Hacker News new | past | comments | ask | show | jobs | submit login

To be honest, I'm not an expert at this stuff—merely enthusiastic about it. I could have some details wrong.

My intuition says that a straight average should give you fairly good results.. although antirez talks a bit about bias in HyperLogLog counts that he's trying to adjust for.

So maybe an average + some bias correction gives you better results?




On light reflection: probably if you do N tests, the modal result would be a good enough choice, or perhaps the median.

The issue is that (e.g.) the arithmetic mean would be very influenced by the stats that gave randomly bad results - the very results you are trying to exclude by doing the test multiple times and taking an average.


So I think this is roughly what antirez is talking about with 'registers' for a counter.

From the article:

The standard error of HyperLogLog is 1.04/sqrt(m), where “m” is the number of registers used. Redis uses 16384 registers, so the standard error is 0.81%.

My understanding is that the error is very small because you're averaging over 16,000 registers.

Edit: also, I think you want the mean because it'll give you back a precise/fractional power of 2 that is pretty close to the actual count. The median would give you the nearest integer power of two, which is probably not a good estimate of the real count.

Double edit: looks like there's some research around improving averaging by throwing away outliers: http://blog.notdot.net/2012/09/Dam-Cool-Algorithms-Cardinali...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: