To be honest, I'm not an expert at this stuff—merely enthusiastic about it. I co...

ronaldx · on April 1, 2014

On light reflection: probably if you do N tests, the modal result would be a good enough choice, or perhaps the median.

The issue is that (e.g.) the arithmetic mean would be very influenced by the stats that gave randomly bad results - the very results you are trying to exclude by doing the test multiple times and taking an average.

dhotson · on April 1, 2014

So I think this is roughly what antirez is talking about with 'registers' for a counter.

From the article:

The standard error of HyperLogLog is 1.04/sqrt(m), where “m” is the number of registers used. Redis uses 16384 registers, so the standard error is 0.81%.

My understanding is that the error is very small because you're averaging over 16,000 registers.

Edit: also, I think you want the mean because it'll give you back a precise/fractional power of 2 that is pretty close to the actual count. The median would give you the nearest integer power of two, which is probably not a good estimate of the real count.

Double edit: looks like there's some research around improving averaging by throwing away outliers: http://blog.notdot.net/2012/09/Dam-Cool-Algorithms-Cardinali...