Redis new data structure: the HyperLogLog

mitchellh · on April 1, 2014

This is nice. At my last job, we developed an extremely high performance HyperLogLog server (hlld). Take a look if you're interested:

https://github.com/armon/hlld

It processes over a 1MM ops per second on a basic MacBook Air. In production we saw it handle more. We never had to scale it over one server in order to handle millions of distinct metrics.

As an ad company, my last company had a ridiculous fanout of metrics for every request. Every ad serve API request (of which we got thousands per second) would fan out to over 150 metrics EACH. So for 1000 requests per second we'd have to modify 150,000 metrics. Raw SQL updates of course fall down with this (even if we tone back their durability). Instead, we opted for accumulated for 10 to 60 seconds in bloomd and hlld and statsite, then aggregating and doing a single bulk SQL update. Worked incredibly well.

We used it in combination with bloomd: https://github.com/armon/bloomd Bloomd is another super high performance data structure server, but this time for bloom filters.

You can read the READMEs to learn why they're better for their specific tasks. Both have been running in production for 2+ years. If you're not going to be using the data structures that much then Redis is a good choice. We personally used these servers alongside Redis for k/v.

antirez · on April 1, 2014

Hello! I'm sure a specialized server can use even the latest couple clock cycles at its best, but the Redis implementation performs very reasonably. I ran the following two benchmark simultaneously against my Macbook Air 11" that are able to trigger the worst case scenario from the point of view of the HLL data structure:

    ./redis-benchmark -P 64 -r 100000000 -n 100000000 \
      pfadd hll __rand_int__
    pfadd hll __rand_int__: 320125.56

And

    ./redis-benchmark -P 64 -r 100000000 -n 100000000 \
      pfcount hll
    pfcount hll: 313309.62

Since we are adding random integers in the range from 0 to 100 million we are basically triggering the worst case in which the updates are as frequent as possible in the registers.

At the same time HLLCOUNT is running in the other benchmark forcing the update of the cached cardinality as often as possible.

Yet we get around 640k ops/sec in a single core, that multiplied for the 4 cores of an Air i7 gets you to 2.4m ops/sec.

EDIT 1: Note that while I'm using random data as argument of PFADD, we are not using N HLLs but a single one, so tis benchmark is surely biased as cache misses are rare. But using multiple keys will not be terrible like losing an order of magnitude, just a reasonable percentage of OPS will go away.

EDIT 2: I had a process running in background consuming a lot of CPU, actually I tried again and reached 900k OPS/sec for core, but read EDIT 1 for more info.

mitchellh · on April 1, 2014

Yes, Redis performs well. I don't think I ever actually compared to hlld to Redis in my comment, except to say you LOSE data structures by using hlld specifically.

I was not making a comparison. I was showing another potential solution.

antirez · on April 1, 2014

No problem at all with comparing! And thanks for commenting. I wrote my message since you wrote "If you're looking for much higher performance than what Redis ...". My idea is just that at the level of performance of the HLL Redis implementation to add a new component to an infrastructure for the sake of faster HLL should be considered with care, but specialized solutions may be better in many regards of course.

VikingCoder · on April 1, 2014

"If you're looking for much higher performance than what Redis will get for this specific data structure"

That's a comparison, specifically claiming much higher performance.

mitchellh · on April 1, 2014

Whoops! Removed. Thanks.

arielweisberg · on April 1, 2014

There are several databases that will do 150k replicated updates a second on a 2 socket server. Heck, 150k multi-statement transactions has been realistic for years. 1 million on a MacBook air no, not yet. Not enough caching, JITing and precompilation to optimize away the overhead of SQL in anything that currently exists.

This is not to talk down using the right tool for the job, but as soon as you want to do something a little more complicated than count you are right back to reinventing databases. That is basically what you are doing by implementing periodic group commit on top of a database.

mitchellh · on April 1, 2014

We would've preferred to not have to write these things. If you knew me you'd know I'm not a "build things because we can" type of person. We spent a week tuning PostgreSQL trying to get it fast enough, and couldn't (due to the restrictions of ACID we were trying to break). We then hired an outside DB consultancy for a handful of hours, since they claimed they could do it, since consultants for a few hours is cheaper than us writing a DB, and they ultimately came back and said "SQL isn't the right choice for this."

I can see how you can make your comment without understanding this context and more. But really, I promise you SQL didn't work AT ALL. This is primarily due to the type of data we were putting in there: we needed some aggregates like totals, averages, standard deviation, etc. To get this data, you need to have it inserted first, and we were requesting on the fly in real time at a pretty large rate. This would cause the CPU of the SQL servers to go up pretty hard WHILE also serving hundreds of thousands of insert/update ops/sec. We could've solved this by adding read slaves and all that, but now we're talking expensive CapEx for something we were confident we could fix with some C in a week or two.

Enter: the two servers linked in my parent comment.

We sacrificed durability, accuracy, and some level of safety (if the server crashes before an accumulation period once every X seconds, you lose that data) for raw speed. We calculated out the acceptable tolerances (max % off the real value) and tuned our data structures to that (bloomd and hlld both expose these tunables). We then ran both servers in parallel to our existing SQL solution for a few weeks to verify the numbers were correct. Finally, we switched over.

Like I said, when we built that we were doing maybe 450,000 metrics per second. That was 2 years ago. Now they're doing much much more than that, and both hlld and bloomd are running on a single m3.medium with no persistence (they aggregate then issue a single large UPDATE to SQL) with no issue.

The amortized operating expenses (human labor) cost of developing the servers, along with the thousands saved in server costs, and with the effectively zero ops overhead (they never go down, they're stateless, etc.) has been well worth it.

Hope that makes our decision a little clearer.

wpietri · on April 1, 2014

Very well put. I think people who have only built SQL-based solutions tend to underestimate the labor and ops overhead of SQL-based solutions for problems that don't fit the typical SQL use cases very well. It's nice to read of a long-running example like this.

MichaelGG · on April 1, 2014

If the data is partition-able, a main-memory database should be able to deliver aggregates in realtime without much pain. If you're writing to disk at 1M tx/sec though, yeah could perhaps be a minor bottleneck. I'm not quite sure what this has to do with SQL though.

Of course writing a specific datastructure will outperform a general solution - that goes without saying. No HA/persistence makes it even easier. And if the datastructure is simple enough, then writing and maintaining the code yourself may pay off.

AaronFriel · on April 1, 2014

Oh, there are dozens of databases that will do 150k transactions per second. The issue with database throughput for most applications is the storage layer, which is more difficult to fine tune when working with a cloud service.

With Azure, you have to use one of their very large instances and connect and stripe virtual disks yourself to hit 40k IOs per second, and I think that's just barely possible. With Amazon, you are limited to about 48k IOPS and again, you have to use EBS and provisioned IOPS disks and stripe. Not sure about Google's services.

MichaelGG · on April 1, 2014

Or take the VoltDB way and persist to memory (in the commercial version AFAIK).

AaronFriel · on April 1, 2014

I would heartily disagree with their definition of persistence.

SubuSS · on April 1, 2014

Can you point me to one such perf benchmark? Thx.

pinars · on April 1, 2014

As a quick side note, PostgreSQL had the HyperLogLog data type for a while: https://github.com/aggregateknowledge/postgresql-hll

For people who need SQL and HyperLogLog, we found this extension to work pretty well.

pimeys · on April 2, 2014

Also working in an ad company and also using bloomd, it rocks.

For HyperLogLog, we actually utilize our column based big data store. It is possible, also pretty easily, to create the HyperLogLog algorithm as an SQL query, which we can use to aggregate several useful metrics. This is super useful when you can combine it easily with several other metrics that are not using the HyperLogLog, e.g. making another facts table for your star schema.

zheng · on April 1, 2014

Just wanted to say thank you for hlld/bloomd! At my old company we were looking into uses for them just because the idea and footprint was so cool. We had a couple ideas for hlld and the overhead was so small that we could basically add it to our servers for free.

hangonhn · on April 2, 2014

You can expand on how you combined it with bloomd?

Thanks.

dhotson · on April 1, 2014

HyperLogLog counters are one of my favourite things. They're incredibly clever.

So also, you can combine this idea with the idea of bloom filters (another probabilistic data structure) in a really interesting way. The goal is to have a _lot_ of counters in a fixed amount of space.

First part is to encode your HyperLogLog counter in unary. So each counter generally only takes up a few bits.

Quick unary tutorial: in unary you count with a sequence of 1's (followed by 0 to signify the end of the number). 3 is encoded as 1110. 5 is 111110. 7 is 11111110. You get the idea.

So also, we're just estimating cardinality, not an exact count. We're estimating if we've seen 2^n things... not n things. Counting to a billion is going to be: 1 billion ~= 2^30 ~= 31 bits.

So you can store all these counters in a big array of bits (a bitmap) and index into it using a hash function. As you probably guessed, all these unary counters will end up overlapping each other in weird ways.

But! The cool thing is if you use a bloom filter to insert and retrieve multiple counters using different hash functions.. and then average the results. The more of these probabilistic and messed up counters you average .. the better the results get!

It's kind of opposite of the usual programmer "junk in, junk out" problem. In this case, the more junk you combine, the better the results. Pretty neat huh? :)

I don't have links to the papers behind these ideas sorry. I learnt most of this from a presentation by Alex Sharp at PyCon Au. Edit: found link to talk - http://pyvideo.org/video/1621/big-data-with-python

This article explains some of this stuff a bit better: http://highscalability.com/blog/2012/4/5/big-data-counting-h...

ronaldx · on April 1, 2014

Averaging the results correctly must be an interesting problem in itself?

I believe that standard averages wouldn't necessarily give you the maximum likelihood.

dhotson · on April 1, 2014

To be honest, I'm not an expert at this stuff—merely enthusiastic about it. I could have some details wrong.

My intuition says that a straight average should give you fairly good results.. although antirez talks a bit about bias in HyperLogLog counts that he's trying to adjust for.

So maybe an average + some bias correction gives you better results?

ronaldx · on April 1, 2014

On light reflection: probably if you do N tests, the modal result would be a good enough choice, or perhaps the median.

The issue is that (e.g.) the arithmetic mean would be very influenced by the stats that gave randomly bad results - the very results you are trying to exclude by doing the test multiple times and taking an average.

dhotson · on April 1, 2014

So I think this is roughly what antirez is talking about with 'registers' for a counter.

From the article:

The standard error of HyperLogLog is 1.04/sqrt(m), where “m” is the number of registers used. Redis uses 16384 registers, so the standard error is 0.81%.

My understanding is that the error is very small because you're averaging over 16,000 registers.

Edit: also, I think you want the mean because it'll give you back a precise/fractional power of 2 that is pretty close to the actual count. The median would give you the nearest integer power of two, which is probably not a good estimate of the real count.

Double edit: looks like there's some research around improving averaging by throwing away outliers: http://blog.notdot.net/2012/09/Dam-Cool-Algorithms-Cardinali...

nerfhammer · on April 1, 2014

It uses harmonic mean, not arithmetic mean

ronaldx · on April 1, 2014

Makes sense (as a heuristic?), thanks a lot.

octo_t · on April 1, 2014

With the name, I started reading thinking it was an April fools joke, then I went down the rabbit hole of starting to read the post and every linked paper.

Filligree · on April 1, 2014

Plot twist: It actually is an april fool's joke.

ohwp · on April 1, 2014

Actually your post fooled me.

malanj · on April 1, 2014

The idea of inexact computing seems to be gaining a lot of traction over the past few years.

The HyperLogLog data structure reminded me of the research at Rice University on inexact processors: http://www.ece.rice.edu/~al4/lingamneni-cf12.pdf

There's a lot to gain by being less precise!

sitkack · on April 1, 2014

The future of computation is probabilistically inexact numbers over an interval.

http://en.wikipedia.org/wiki/Interval_arithmetic

We currently have a huge over tolerancing problem. We don't need exact math most of the time.

http://web.media.mit.edu/~bates/Summary_files/BatesTalk.pdf

malanj · on April 2, 2014

Hypothesis: inexact computations will be required to make strong AI work in the medium term.

You can't simulate a brain accurately (and you probably don't need to). Neurons aren't very accurate themselves, and it does the job fine.

sitkack · on April 2, 2014

Top of the line supercomputers consume ~10 megawatts of power, a human brain uses 20. Different problems yes, but the magnitude is staggering.

MasterScrat · on April 1, 2014

See also: BlinkDB, a database system providing "Queries with Bounded Errors and Bounded Response Times on Very Large Data"

http://blinkdb.org/

stefantalpalaru · on April 1, 2014

We're used to inexact computing from floating point math.

platz · on April 1, 2014

well, inexact vs probabilistic do have some differences.

ronaldx · on April 1, 2014

If I've understood correctly, this relates to the German Tank Problem: using statistical analysis to estimate a count.

With German Tanks, you have limited data available to create an estimate from (a few serial numbers, in the canonical case). But in this case, you deliberately limit the data stored in order to keep it at a manageable level: hash to randomise and then store only summary stats.

Pretty cool. Actually very cool, I really like it.

However, are there any interesting use cases for this? It seems to me that it's not particularly useful until your list is long enough to make search/sort/storage difficult: that's rare even at web-scale, no?

antirez · on April 1, 2014

Hello, the use case is that you can count what you usually don't store (for example different IPs connecting every day to your server), or you can count what is usually stored but it is too expensive to count otherwise (Like iterating a database table and counting unique items in a given column).

Also since it is so cheap, you start counting even things that could be unreasonable to count otherwise but that may provide useful insight, for example unique users accessing the site for every hour of the day, unique words ever used by a given user in her comments, and so forth.

ris · on April 1, 2014

On a related note, the recently released HyperLogLog extension for postgres: https://github.com/aggregateknowledge/postgresql-hll

e98cuenc · on April 1, 2014

We used this datastructure a few months ago to track an approximation to the cardinality of the followers of a set of twitter users, which is handy to calculate the reach of a twitter post. More recently, we used it again to track an approximation to the number of unique images that a user generated in a time period (each image is identified by a different URL).

Very handy data structure, and glad to see it land in Redis!

bochi · on April 1, 2014

This blog post is a nice survey about other interesting probabilistic data structures: http://highlyscalable.wordpress.com/2012/05/01/probabilistic...

dexcs · on April 1, 2014

The funny fact is that whether it's a april fool or not, i am not able to verify it because it's so well (technical) written :/

antirez · on April 1, 2014

No April fool at all :-) You can build the "unstable" branch and play with it.

MasterScrat · on April 1, 2014

Thanks for the clarification.

It's even more confusing for french speakers as "Philippe Flajolet" really sounds like a made-up name ("Flajolet" means "white beans").

sbarre · on April 1, 2014

Yeah it would be like seeing a paper written by "Peter Potato" or something like that...

eternalban · on April 1, 2014

Nick's blog is a must read for these sort of algorithms. See his original writeup on the HLL here: http://blog.notdot.net/2012/09/Dam-Cool-Algorithms-Cardinali...

dscrd · on April 1, 2014

This is awesome on so many different levels.

Check out the implementation: https://github.com/antirez/redis/blob/unstable/src/hyperlogl...

medriscoll · on April 1, 2014

The Druid.io folks recently blogged about three additional optimizations to HyperLogLog that compact the registers and can improve query speed:

https://groups.google.com/forum/#!searchin/druid-development...

emmanueloga_ · on April 4, 2014

I now that projects like http://druid.io use HLL to do timeseries analytics. Can someone explain or link to an explanation about how counting sets cardinality can be used for time series analysis? (Do I need to add I haven't find a satisfactory one myself? :).

Naively I think it could be done by factoring in a timestamp on the things to count:

  PFADD SOMEHLLVAR "#{timestamp1}#{event1}" "#{timestamp}#{event2}"
  PFADD SOMEHLLVAR "#{timestamp2}#{event3}" "#{timestamp}#{event2}"

  PFCOUNT "#{timestamp1}#{event1}" # -> 1 
  PFCOUNT "#{timestamp2}#{event1}" # -> 0

etc..

The problem with doing that is that you would need to iterate through each second in a given range to find out the count of specific events... Also, seems like a pretty wasteful way of encoding time <edit> thinking about it is probably not wasteful in the sense the hll size should be the same, but probably sub-optimal some other way

nefasti · on April 1, 2014

A very good post helping explain HyperLogLog : http://blog.aggregateknowledge.com/2012/10/25/sketch-of-the-...

PanMan · on April 1, 2014

Does anybody have a good example of this? E.g. to count unique hits over time without storing each hit?

nerfhammer · on April 1, 2014

Here's a cool example of using HLL in javascript to dynamically show set union/intersections for huge sets:

http://www.aggregateknowledge.com/science/blog/venn.html

jacinda · on April 1, 2014

The top answer about HyperLogLog on stackoverflow links to this awesome visualization: http://www.aggregateknowledge.com/science/blog/hll.html

SO Answer: http://stackoverflow.com/a/13074072/307293

Full blog post: http://blog.aggregateknowledge.com/2012/10/25/sketch-of-the-...

pointernil · on April 1, 2014

Very likable naming scheme ;) Kudos for the extraordinary work on redis. cheers

yread · on April 1, 2014

The HLL principle seems quite similar to this way of approximating a biased coin with a fair one by counting the 1s in binary

http://amakelov.wordpress.com/2013/10/10/arbitrarily-biasing...

jimktrains2 · on April 1, 2014

Oh man, I started this and native bloom filters a summer ago and never finished it. These are awesome data structures when dealing with really larges sets and you only need a cardinality count or set membership test and not all the members of the set. I can't wait to start using them!

eurleif · on April 1, 2014

Cool. I wonder if bloom filters might be up next?

lifthrasiir · on April 1, 2014

There already exists SETBIT (for adding items), GETBIT (for querying items) and BITOP (for union operation) that can be used to implement bloom filters altogether. HyperLogLog requires multi-bit registers (6 bit long in Redis) which would be inefficient with the current design, so I guess that's why Redis is to include a specialized implementation for that.

devd · on April 1, 2014

I guess there are a couple of lua implementations. However, this pushes some logic on to the client side - (generate hash, get number of hashes, set the appropriate bits) which the server should ideally handle. I have experimented adding native bloom filters here - https://github.com/devendralaulkar/redis/tree/bloomfilter Basically they reuse the bit operations internally, and set multiple bits in a single command.

eurleif · on April 1, 2014

Couldn't HyperLogLog be implemented efficiently using Lua scripting?

dbro · on April 1, 2014

Yes it's possible. I did this a few years ago as a learning project, and the relevant lua scripts can be seen here: https://gist.github.com/dbro/9920666.

... but I wouldn't claim that it's done efficiently. From the comments in the code's tests:

    # some results based on running speed tests on a lightweight netbook:
    # standard error = 0.01, 10kb of registers
    # < 1 ms per init()
    # < 1 ms per update()
    # < 1 ms per count() for single set
    # ~30 ms per set for count() of union of sets

lifthrasiir · on April 1, 2014

I haven't tried (so have a grain of salt), but I guess it would be certainly slower than the native implementation for various reasons. One reason that Lua does not support an efficient integer type yet (AFAIK 5.3 will support it though) comes to mind.

donretag · on April 1, 2014

Elasticsearch started using HyperLogLog recently for aggregating data. No April Fools there!

ithkuil · on April 1, 2014

How does HyperLogLog compare to Count-Min-Sketch?

Lanzaa · on April 1, 2014

One basic difference between Count-Min Sketches and HyperLogLog is that Count-Min Sketches represent the cardinality of items in multisets while HyperLogLog represent the cardinality of sets.

For example, Put 3 of the item 'A' (where each 'A' is indistinguishable from another 'A') and similarly 5 of item 'B' into a Count-Min Sketch and a HyperLogLog.

When you query the Count-Min Sketch for 'A', you can probably get back 3. You can also query for 'B' and get back 5.

With you query the HyperLogLog you query how many unique/indistinguishable items are in the set, in this case 2. The HyperLogLog would return a number near 2. You could query the HyperLogLog for whether or not 'A' is in the set, but HyperLogLog are not designed to be accurate for that type of query. It would return true in this case and in general there are no false negatives for set membership queries in the HyperLogLog. I believe it would be less accurate than a Bloom Filter for set membership queries.

nerfhammer · on April 1, 2014

If you made the number of registers (m) extremely large and the register width 1 then a bloom filter and HLL are almost exactly the same (except bloom filters have multiple hashes for the same input and in HLL you only split one hash into two parts).

You can also use a bloom filter as a cardinality counter assuming it isn't too filled up yet: http://en.wikipedia.org/wiki/Bloom_filter#Approximating_the_...

HLLs are designed to be useful for cardinality counting past the point where a bloom filter would be over capacity. HLL would be useful for set membership but after some threshold the false positive rate will approach 100% at some super-linear rate.

yoyo1999 · on April 2, 2014

I have realized how mediocre I am.