Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: An improved version of HyperLogLog (github.com/axiomhq)
194 points by seiflotfy on June 26, 2017 | hide | past | favorite | 44 comments



Somewhat similar to Redis implementation in many ways: Redis also uses loglog-beta, sparse representations, and dense representations in packed format without wasting 1 byte. However Redis representation uses 6 bits registers.


Its heavily inspired by the rowdies implementation, fact is I looked back then how you did loglog-beta in redis and did a repo for it. Now I just added TailCut to it. Do you mind if I do a PR for redis?


Cool! Please drop me an email if possible, I would like to explore how the reduced bits in the registers affects Redis ability to estimate very very high cardinalities: imagine millions of events / second running for years. The problem with improving the current Redis implementation is backward compatibility. I made the error of representing the HyperLogLog in Redis as a string instead of using an opaque data type, so it is not possible to do conversion-on-load of old data, but there is to convert at the first access.


If there is a version annotation it is very easy to convert the old HyperLogLog to a 4 bit register representation. I will send you an email ASAP.


Author here, the main 3 things that make this stand out is: 1) 4-bit register instead of 5 (HLL) and 6 (HLL++), but most implementations use 1 byte registers out of convenience 2) Sparse representation for lower cadinalities (like HyperLogLog++) 3) loglog-beta for dynamic bias correction medium and high cardinalities.


This is absolutely amazing. Seems like a great tradeoff of having smaller space usage and better accurancy on average at the same time. It is also worth to read the related white paper: http://cse.seu.edu.cn/PersonalPage/csqjxiao/csqjxiao_files/p...


This is awesome! We make heavy use of HyperLogLogs in our monitoring systems[0], which are also written in Go and currently use Clark Duvall's library, which this library is based on.

I'm excited to try this out on our systems and see what results we get.

[0] https://github.com/stripe/veneur


I am happy to help if needed :D


How does the Xiao et al. paper (which introduces TailCut, "Better with Fewer Bits: Improving the Performance of Cardinality Estimation of Large Data Streams", <http://cse.seu.edu.cn/PersonalPage/csqjxiao/csqjxiao_files/p...) compare with Ertl's "New cardinality estimation algorithms for HyperLogLog sketches", <https://arxiv.org/abs/1702.01284>?


The tailcut approach breaks one important property of cardinality estimation algorithms: Adding the same value multiple times should change the state of the sketch only once when adding it for the first time.

However, due to the chance of overflows the same value can lead to multiple changes when using the tailcut approach. The introduced error depends on the order and also on the frequency distribution of inserted elements. It would be interesting to know which assumptions on the input data were made for the promised accuracy. In their paper I could not find any hint how the input data looked like in their experiments. It is fairly easy to obtain good results for the trivial case where all input values are distinct.

The tailcut approach reminds me of the HyperBitBit algorithm which also drops the property described above and promises a better storage factor (less error with same amount of memory).

It is true that the traditional 5-bit or 6-bit representations of Hyperloglog registers are suboptimal. Even with lossless compression (see https://pdfs.semanticscholar.org/6558/d618556f812328f969b60b...) a significant amount of memory can be saved.

It is interesting that the standard error is reduced from 1.04/sqrt(m) down to 1.0/sqrt(m) despite the loss of information after the tailcut. Therefore, I conclude that it must be the estimation method itself which is superior to existing methods. I will need to check.


Hey oertl, big fan of your paper https://arxiv.org/pdf/1706.07290.pdf :D I going to start working on "intersections" and some improvements of the large scale cardinalities based on your research. Can I hit you up an email with some of my questions (concerning the intersection work? how does it compare to https://arxiv.org/pdf/1704.03911.pdf)


Sure, send me an email.


So if a register overflows (and is truncated) "twice" by the same element before the base register (B) is increased, then after B is increased, the overflowing register is not increased and so appears not to have overflowed. On the other hand, if that element shows up in the stream once before and once after B is increased, the overflowing register may overflow again.

This makes even the MLE procedure partially incorrect in the first case, since it is the maximum likelihood estimation of a set of hashes in which the register discussed above did not overflow.

Is that about right?


I have not yet checked their ML approach. Regarding register updates you are right.

I have also seen that they did only 10000 simulation runs to determine the standard deviation to be 1.00/sqrt(m). I think more simulations would be necessary to get a confidence interval for the standard deviation that is small enough to allow differentiation from 1.04/sqrt(m).


2 different things. This reduces the size of the HyperLogLog the other improves the accuracy.


HLL is a space/accuracy tradeoff. If you trim HLL to use the same space as if you were using HLL+TailCut but use the Ertl method for estimating cardinalities, which is more accurate: Ertl with a smaller `m` or HLL+TailCut with a larger `m`?


I think the latter. I will implement this next week. Nice call :D


I suspect you are right that HLL+TC will be more accurate. Based on Figure 5 vs. Figure 1 in the Ertl paper, I think Ert's method improves HLL estimation for low cardialities and for very high cardinalities, but not for the region where HLL is already at its most accurate.


I was interested in using HLL for a project where I had ~20 million objects and each had a "status" with about 6 possibilities. I wanted to get fast counts of how many objects were in each status. But their status can change over time, so I needed a way to also remove items from the count. HLL doesn't support this. Does anyone know of an algorithm that does? (Keeping a separate HLL for "removed" doesn't help because objects can revisit the same status they had before.)


If what you want is an approximate answer, take a look at

* Counting Bloom Filter https://en.wikipedia.org/wiki/Bloom_filter#Counting_filters

* Count-Min Sketch https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch

Here's a library: https://datasketches.github.io/

However, you can probably fit an exact answer into memory with just a simple hashtable if you've only got 20 million objects. A single byte could be used as the state marker.

Either that, or just track 6 counts.


A couple of days late, but you can solve this by recording changes rather than objects themselves.

So you log the timestamp of the status changes with the "id" of element you're changing. Then you when an object goes from one status to another, you can record a concat of the id and old timestamp to the "remove" hyperloglog for the previous status, and then record the concat of the id and the new timestamp to the "add" hyperloglog for the current state.

Say you object A at time=1 is being recorded as "At=1" for status R. If it moves to status Q at time=2, you record "At=1" to the hyperloglog for your status R "removes", and then "At=2" for the status R hyperloglog. This works with objects changing status NOT more than once for however you're measuring time. If you're using milliseconds and the status changes infrequently, you're probably good.

The down side is this requires you to track changes rather than just iterate through your objects, so it might not be compatible with your architecture. You also would have to iterate through all objects to capture their initial status. It also is susceptible to errors increasing over time, as your "add" and "remove" hyperloglogs will record more and more elements (re: changes) over time.


HLL is for calculating the number of unique objects when there are possible duplicates. Unless you're doing something really weird you shouldn't have any problem going through all 20 million objects without encountering duplicates.

You're probably better of just keeping a running total. You can either update this total online (keeping track of all state changes) or whenever you need to know (going through a mere 20 million items shouldn't take too long).

If, for whatever reason, you really want to approximate it quickly, just pick a subset at random and calculate the percentages from there. Some basic Bayesian statistics can tell you the accuracy of this approximation.


https://github.com/seiflotfy/cuckoofilter but this will require you to know how many max unique values per status your will have.


Perhaps I'm misunderstanding the problem, but is maintaining six integers and updating their values as objects move from one status to another not a suitable solution?


I'm sure you thought of this, but what are the drawbacks to simply INCR/DECR some counters when an object changes state?


Indeed after ruling out HLL I decided that was the way to go. :-) I guess an HLL is more appropriate when you can't so easily keep track of which items have already been added to each set. For example if you were counting unique visitors per day, you would have a lot of sets (one per day), so you wouldn't want to store n visitor IDs for every day. But when an item can only be in one set at a time (and the sets are finite) incr/decr is possible so you might as well go with it.


I don't understand what the result table measures. What are the numbers and what is the "Exact" column?


Exact is the actual cardinality. How many distinct elements are in the set. HLL can estimate the cardinality by using fraction of the space that would require the store the unique elements instead.


Then wouldn't it be useful to add two columns for space used?


Its always going to be half. The thing is the space is allocated upon creation of the data structure and does not increase afterwards


That doesn't match up with the tagline "An improved version of HyperLogLog for the count-distinct problem, approximating the number of distinct elements in a multiset using 20-50% less space than other usual HyperLogLog implementations."


I am comparing the implementations from influx and axiom


I'm looking for implementations of this in other languages. Specifically in Python or Java. Please hook me up if you know of any such implementation.


Do you guys know any way to calculate the intersection of two HLL instances? The algorithm allows merging two sets but it's not that easy to take the intersection of them. A + B - (A U B) is not the optimal way so I would love to hear your suggestions.


Neustar Research has a good article on this which may be of interest to you. However, I think the approach they took was still based on the inclusion-exclusion principle (still a good read though).

* https://research.neustar.biz/2012/12/17/hll-intersections-2/

Another approach if you have the flexibility of changing your implementation, is to use a different type of data-sketch that is more amenable to set expressions. There is some discussion (and references) here:

* https://datasketches.github.io/docs/Theta/ThetaSketchFramewo...

* https://datasketches.github.io/docs/Tuple/TupleOverview.html


I have a way will post it soon


Would love to hear about the solution, are you going to implement it or publish an article?


There really are a lot of hyperloglog implementations in go.

I'd be interested in seeing a comprehensive comparison between implementations for speed, accuracy and space efficiency.


I'll take care ofd this over the weekend :D


Any progress?


why use metro hash instead of xxhash ?


The answer is probably here: https://github.com/rurban/smhasher

Speed & number of cycles/hash is better with metro hash.


In my test metro hash gave me higher throughput, must be something with the xxhash implementation in Go


I tested both 64 bit variants a while ago and there was no throughput difference worth mentioning with the C++ and C implementations, respectively. (I settled on xxHash since it's around for longer and the code is portable, also I didn't need to do a C++->C translation).

If there is a large difference, perhaps one of the implementations is vectorized or botched.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: