Hacker News new | past | comments | ask | show | jobs | submit login
When Bloom filters don't bloom (cloudflare.com)
227 points by jgrahamc on March 2, 2020 | hide | past | favorite | 73 comments



In the final program `mmuniq` I did a couple of, I think, interesting hacks.

https://github.com/cloudflare/cloudflare-blog/blob/master/20...

First, I used a hash function using aesni (aesenc) instruction set. See this:

https://gist.github.com/majek/96dd615ed6c8aa64f60aac14e3f6ab...

While I have little proof it's a good hash, it seems enough, and is _slightly_ (5-10%?) faster than siphash24 in this context.

Then I mixed counting hash with finding new lines \n. This allows me to do only one user-data load into XMM registers.

Most importantly, to offset the RAM latency cost, I'm doing 64 prefetches as I parse the input, and only after this I actually touch the hash table. The memory latency is still the biggest time sync, but at least this seem to speed up the program 2x or more. Hash table without this batching+prefetch is 6-8 seconds. With batching goes down below 3s.

I suspect linear probing / open addressing of the hash table may have some penalty. While it plays nicely with the cache prefetch, it generally leads to longer chains. This means we need to keep the hash table sparse, not loaded above 0.6-0.75. See this

https://en.wikipedia.org/wiki/File:Hash_table_average_insert...

from https://en.wikipedia.org/wiki/Hash_table


Thanks for the excellent write-up.

Probably not as cache-friendly, but did you, at any point, evaluate Radix Tree [0] or Patricia Trie / CritBit Tree [1][2] as an underlying data-structure for the hash-table? These also compact nicely into one of the many succinct data-structures [3][4].

Radix Tree, in particular, seems to work really well for IPs from what I've read [5].

---

[0] https://vincent.bernat.ch/en/blog/2017-ipv4-route-lookup-lin...

[1] https://news.ycombinator.com/item?id=6920862

[2] https://news.ycombinator.com/item?id=3015246

[3] https://news.ycombinator.com/item?id=2348619

[4] https://news.ycombinator.com/item?id=3650657

[5] https://news.ycombinator.com/item?id=18921058


I'm a bit confused, you are now storing the IPv4 addresses in a hash table using a 64-bit hash?

Why not just use the 32-bit address as a key, and grow the 'blocks' so if two addresses are just a couple of digits apart, promote it to a /24 block etc.


Apologies, maybe I oversimplified the original problem. I'm dealing with IP's (both v4 and v6), subnets, ranges (which may or may not align to subnets). These map to one or more datacenter numbers.

I could indeed define data model, parse the data thoroughly, optimize in-memory data structure, and so on. That requires rigid data structure, knowing access pattern and understanding the problem space. I'm not there yet. Instead, I created this generic tool which works with any text files, and fell into a rabbit hole of over-optimizing it. That's it.


FWIW, you should be able to represent individual IPs, ranges, and subsets all in CIDR notation, tho for ranges you may need multiple CIRD entries to reflect the whole range.

CIDR for ipv4 consists of the 32 bit address and a 32 bit mask, so with some bit packing you can uniquely represent them in 64 bits without hashing.

The problem you’ll run into there is doing a “contains” check on an origin IP for a list of CIDRs, but you’ll need to do that currently since you’re dealing with subnets, I assume.


32 bit mask is way too generous, you only need 5 bit masklen. It all doesn't matter though since they have v6 addresses and ranges.


Would save a lot of space to just have separate lists for each mask length etc.


May I suggest you to have a look at Swiss Tables? There's a good talk from CppCon [0] that explains its design decisions, and implementations available as part of Abseil [1] and Rust stdlib [2].

[0] https://www.youtube.com/watch?v=ncHmEUmJZf4 [1] https://github.com/abseil/abseil-cpp [2] https://github.com/rust-lang/hashbrown


I'll check your new aesni hash, but would recommend to try a simple fast 32bit hash instead, like the builtin crc32c. It needs half as much space and the cache misses will kill you essentially. Twice lesser cache misses.

And a simple linear hash table instead of cookoo will also help in less cache misses. There are no deletions. Should be 20% faster, I think.

Or even gperf.


So I tested your aesnihash with smhasher. It's really bad. A much better 64bit aesni variant would be falkhash, which is 4x faster, supports a seed and passes most tests.

https://github.com/gamozolabs/falkhash nasm -f elf64 -o falkhash-elf64.o falkhash.asm


<update> Crap. That gist contained some draft version of aesnihash. SORRY. See updated version https://gist.github.com/majek/96dd615ed6c8aa64f60aac14e3f6ab... </update>

Thanks for spending time on this. I would like to understand what "really bad" and "fails most of the tests" means.

For the record, the commit: https://github.com/rurban/smhasher/commit/10f56385f3e9abb018...

The main point of this hash, in this context, is to do streaming hash and find \n at in one loop. The intention is to reduce data loads _mm_loadu_si128 (I already have user data in xmm0, so why not do some aesni already?). Because it's streaming I can't for example derive the initial seed based on the chunk length, since it's unknown at the time of calling hash. See:

https://github.com/cloudflare/cloudflare-blog/blob/master/20...

I don't need full aes hash, but maybe that could be an option as well.

In other words, in my case I don't care just about hash() speed. I care about memchr() + hash() speed. I would like to understand/measure the hash quality itself. Maybe adding another aesenc round would be sufficient to fix it.


Even falkhash is not that great, and feature an abnormal amount of collisions as the nb of hashes increase. Basically, all "naive" AES implementations share this design weakness.


What about rewriting one of these[0] hash functions for SIMD?

0: https://nullprogram.com/blog/2018/07/31/


Have you looked at XXH64 from https://cyan4973.github.io/xxHash/ ?

For denser hash table, you need Robin Hood hashing.


Do you have the source data available? Or can you make an anonymou version available?


A blocked bloom filter works on cache blocks <https://www.tutorialspoint.com/blocked-bloom-filter>. It takes more memory but far fewer hits to RAM, so much better caching behaviour. It should solve your problem.

Edit: it will use more RAM than cuckoo filters.


Thanks to both of you for pointing it out, and providing a non-broken link. I wish HN would respect the angle bracket conventions of RFC 2396, but it's not a major failing.


FYI, that closing angle bracket breaks the link.



>time (cat logs.txt | sort | uniq > /dev/null)

If I skip two useless pipes, and use "sort -u logs.txt > /dev/null" instead, I'm already twice as fast as original (it seems that piping to sort effectively prevents parallelization).


For me sort|uniq is nine times slower than

  LANG=C sort -u
but `sort -u` on its own is only marginally faster than sort|uniq.

I can't remember exactly what LANG=C does, but I think the it makes sort not need to do some fancy unicode stuff? If the person writing the article just needs to uniqify IP addresses they should use it.


Oh boy. This is drastic. Thanks for great advice:

  marek:~$ time (cat logs-popcount-org.txt | sort -u | wc -l)
  39057531

  real 2m37.387s
  user 2m35.626s
  sys 0m2.937s

  marek:~$ time (cat logs-popcount-org.txt | LANG=C sort -u -S6G | wc -l)
  39057531
  
  real 0m12.908s
  user 0m42.826s
  sys 0m3.586s


I think I used to use "env LC_ALL=C". I like to set "all" the things and I'm also fond of the useless use of env...

This does run faster, but it's also important if you want a predictable order and to distinguish all strings like you would get if you implemented your own text sort naively.

From Googling just now: "sort -u doesn't report unique lines, but one of each group of lines that have equal sorting order. So if you do want unique lines, you need a locale where characters are byte and all characters have different sorting order (which the C locale guarantees)."


`sort -u` is slower for me.

Doing cat from left to right helps readability - it's important.

Performance-wise cat gives you 64KiB blocks of data, while direct pipe can give more. My programs (mmuniq-*) use 512KiB input buffer, so indeed with redirection you can reduce the number of read/write syscalls 8x, but it doesn't change much of the timing frankly.

Parallelization is an interesting aspect, which we didn't discuss really.


> Doing cat from left to right helps readability - it's important.

People may balk because it's unfamiliar, but this is syntactically legal:

    < logs.txt sort | uniq > /dev/null
That is, the redirection customarily goes at the end, but it doesn't have to.

EDIT: Also, in this specific case, the "sort" command can take a file argument, so you can also do this:

    sort logs.txt | uniq > /dev/null


Even though I agree with another commenter that it's surprising that the author used a bloom filter instead of the hash map as a baseline, this article still is an excellent small walkthrough of quick ad-hoc low-level performance profiling.


I went down the Bloom Filter rabbit hole for a project a while back and then wondered if I could actually fit the entire set space in memory. The set was IDs up to 2^32, so basically a giant bit array. I believe the IPV4 universe of this article is actually the same size as in my project. I coded up my project using Java bit arrays and got things working decently well, using a big heap. Then I found out there are a bunch of compressed bit array libraries such as EWAH and Roaring Bitmap. When I substituted Roaring for the stock Java Bitset implementation, I saw space and computation improve by many orders of magnitude. Roaring uses a bunch of tricks to achieve this, but it mostly comes down to encoding large runs of 1s or 0s. Obviously, not as tiny as bloom filters, but still pretty small for most modern machines if you have a sparse set. https://www.roaringbitmap.org/


And it's not probabilistic, it gives strict answer whether an element is in the set or not.


+1 to roaring bitmap. I've actually implemented a PoC in golang with roaring bitmap to keep IPv4 maps (unfortunately its closed-source, at least for now) and it seems to be quite efficient, both performance-wise and memory-wise, at least in the "millions of entries" range.


Bloom filters are ideal candidates for answering the question 'Is x _not_ in the set?'. If the answer is yes, nothing, including the item you're asking about, ever hashed to that location. If the answer is no, all you know is that something hashed there, possibly your item or not.


Which works perfectly for this case: if the answer is yes, add it to output, otherwise skip.


It's not that simple. If the correctness of your program relies on the "not in set" test being accurate, you're going to need to make the filter huge, and slow.

Probabilistic data structures are about trading off correctness and performance. If you try to push the correctness up to near perfect, they'll quickly stop making sense and you should just use an actual perfect algorithm instead, as the author did.

Bloom filters are great for early outs, where you can save a chunk of computation on a definite negative, but still be correct in case of false positive.


Not sure I follow. For detecting uniques all you care about is the definite negatives? Either way, the author addresses precision in his post.


False positives mean you're throwing away genuinely unique items that hash the same. If you use a 256 bit Bloom filter to process a billion items, you'll get exactly 256 results that are indeed unique.

The rate of false positives you require out of the data structure is key. If your program is correct with a 20% false positive rate, you're golden. If the goal is more or less 0, look elsewhere.

The author addresses precision, but not in a way that questions whether a Bloom filter is indeed the right tool for the job.


But you don’t care about false positives, only the definite negatives! I was trying to address the fact that the top comment on the utility of a bloom filter is covered in the post.


If you don't care about the false positives, you don't need a Bloom filter. Just reject everything.


I'm not sure why the author started with bloom filters instead of a hash table, to be honest. The workload seems ideal for a hash table. Interesting to see how big the performance gap was though, I wouldn't have expected such a difference. It probably comes down to the fact that the bloom filter has to spread its data across many locations so if the set is particularly large it's always going to lose to a hash table due to hitting main memory more times (in scenarios where you can use either one, at least).

I suppose they started with a bloom filter because they intended to use one when consuming the data, i.e. checking incoming requests against a 'malicious IP' set?


The secret is that I used a bloom filter anyway, as the next stage in the pipeline. I needed to load the IP whitelist/blacklist data quickly and rapidly compare packets against it. I had tooling to do bloom filters already.

I also had an image in my mind, that bloom filters are perfect for such a use case - "set" data structure with some adjustable loss (probabilistic) parameters. I thought that Bloom filters are underappreciated. I was wrong. As we learned "set" is better done with good old hash table.


Radix tree seems like a good fit, was bored one day and implemented one as a CPython module in a couple hours--think I had some idea around IP addresses to use it for.

--edit--

Actually, it was a hextree (radix tree specialized for the hexes). Think someone posted a link to the paper on it a while back.


I don't know why the author didn't stick to their shell script, which took all of 2 minutes for a manual data cleanup step. Like, go grab a coffee.


2 minutes was for 40M items. I had 1B items to sort/uniq.


My thought was that, plus "use `make` with an intermediate file". No need to re-run it if the sources haven't changed, right?


> For example, source IPs belonging to a legitimate Italian ISP should not arrive in a Brazilian datacenter.

This is an assumption based on a very naive understanding how packets get delivered on the Internet. I for one wouldn't enjoy being blocked from CloudFlare sites just because of poor routing or peering.

https://en.wikipedia.org/wiki/Hot-potato_and_cold-potato_rou...


Do you understand how rude you're being? This is an article about Bloom filters, not Internet routing.


Why not just use an array of 2^32 bits -- a half gigabyte -- and leave off hashing altogether?

All it would cost is the excess runtime, which we should not mind giving up unless we smoke.

If necessary, you could have two or more. 256 of them would fit in 128G, which lots of servers have without even needing it all.


The author elaborated in another post: it's also ipv6 addresses, ranges, cidr notation, and the data center id number. It's not as simple as just an ipv4 address.

IMHO this was critical information that muddled the article. I think most people who think critically about their data (ie, everyone who would be interested in the article) should have the same thought. I sure did.


Exactly what I thought as well. I don't know what you mean by excess runtime though, I don't think there would be any (compared to proposed solution).


You give up all the excess runtime you would have devoted to hashing, leaving only the time to fault in the page the bit indexes to.

The implication is that the original poster used hashing out of a preference for wasting time, speculated as an excuse to go out for a smoke.


Even better: use mmap to allocate a virtual huge array, which initially will use no memory at all, and let the OS allocate physical pages to it as needed.


It takes active effort to get all the pages faulted in at startup, and not lazily.

Some of us do that, to take the latency hit just the one time: MAP_POPULATE.


Excellent point that I'd completely missed.


2^32 == 4GiB


2^32 bytes is 4 GiB, but 2^32 bits is 512 MiB. In the GP's hypothetical array, I believe you would index (array[idx >> 3] >> (idx & 7)) != 0


Good point: they could store a whole byte, or a whole word, for each address, not just a bit, and still have lots of RAM left over.


>2^32 == 4GiB

That's bytes, not bits.


> Check out this excellent visualization by Thomas Hurst showing how parameters influence each other:

> https://hur.st/bloomfilter/

Blush. However will I cope with all this extra traffic?

One common way to improve bloom filter cache performance is to divide them into blocks - elsewhere is mentioned doing it to the cache line, but it would be interesting to see how much performance would be gained with a more naive approach, for instance, splitting the filter into 4KiB pages.

I've done this for disk-backed filters, but never looked to see if it improved performance generally.


Another solution is to change the order things are checked in. With optimum k, each hash takes about 1/2 of the possibilities out. If he'd batch it so the memory accesses are roughly ordered, it would be much faster.

EDIT:

So, for instance, if k = 19, that means there are 19 hash functions. If he goes through all the inputs, and checks: are there any hashes here falling within the first 1/19 of the memory space? If not, keep it. If so, check whether any are unset. If not, keep it. If they are, zero the pointer to the input in the array. After this is done, he should be rid of roughly 32% (0.5 * (1-(18/19)^19)) of candidates. The second pass throws out 33% of candidates, and so on and so forth.

He could even keep an absurdly large value for k and n: if it is 128G, then k = 19'053, meaning he can use an even finer increment. He'd have to spill the filter to disk, but the access patterns will be great.


seems like it would be worth comparing to the old "awk '!a[$0] { a[$0]=1; print }'". I would assume that such arrays are implemented internally using hash tables. probably not as efficient as a C implementation, but the used parts of the interpreter should fit in I-cache, so it should be within a few times as fast.


Here you go:

  marek:~$ time (cat logs-popcount-org.txt | awk '!a[$0] {a[$0]=1; print }'|wc -l)
  39057531

  real 0m41.236s
  user 0m38.179s
  sys 0m5.447s
So: sort: 2m, awk 41 seconds. Also, awk used 6.1G of RAM at peak.


yeah, a generic, probably pointer-heavy hash table is definitely gonna be worse on memory. I'm surprised that it's that much worse on time though, I expected it to be closer. I guess probably the cache misses are worse with such a large table though.


Ok, I'll bite again:

  marek:~$ cat logs-popcount-org.txt | perf stat -d awk '!a[$0] { a[$0]=1; print }' > /dev/null 
  
   Performance counter stats for 'awk !a[$0] { a[$0]=1; print }':

           40,318.47 msec task-clock:u             
                   0      context-switches:u       
                   0      cpu-migrations:u         
           1,670,649      page-faults:u            
     112,979,634,215      cycles:u                 
      93,441,976,758      instructions:u           
      18,990,099,679      branches:u               
         208,386,137      branch-misses:u          
      26,093,832,363      L1-dcache-loads:u        
         708,880,979      L1-dcache-load-misses:u  
         464,332,790      LLC-loads:u              
         245,913,835      LLC-load-misses:u        

        40.337768657 seconds time elapsed
  
        36.851718000 seconds user
         3.468126000 seconds sys

Compare this to the optimized approach which has 57M LLC-load-misses, and 7M instructions.


I would welcome seeing a comparison in your environment to using the simple 1/2 GB array of bits, with no hashing or storage of IP addresses. (Extra points for hugetlb mapping.)


The latency problem must be due to (unnecessary data) dependencies and not taking advantage of superscalar architecture. Modern CPUs support 32 or more in-flight memory operations.

After hiding latency, the next bottleneck would be instructions so vector scatter/gather can alleviate this problem.


Indeed. I actually think a few small modification are all that is required to significantly improve the performance of this code. I disagree with gather scatter as currently those implementations are not faster unless in the same cache line.


32 per core? Most modern Intel only supports 10 or 12 outstanding (line) accesses per core. AMD is a bit better. Ice Lake is significantly better.


I think it's per memory controller. Probably per socket.


On a bit unrelated note, if you want to handle very large Bloom filters (billions of entries with low false positive rates) there is an open source Java library that can help you to do that: https://github.com/nixer-io/nixer-spring-plugin/tree/master/....

There is also a command line utility that accompanies the library: https://github.com/nixer-io/nixer-spring-plugin/tree/master/....


Another potentially useful data structure that has good asymptotic complexity, but probably also poor cache locality is the Van Emde Boas tree [0]. I've never seen one in practice, but they sure make for excellent p-set problems!

[0] https://en.wikipedia.org/wiki/Van_Emde_Boas_tree


There is a very interesting talk [1] by Chandler Carruth that I stumbled upon last weekend that is very much related to this

[1] https://www.youtube.com/watch?v=nXaxk27zwlk&t=4678s


The generic sort command is easy, but in this case maybe a custom radix sort would have been faster?



I am surprised, isn't cloudflare a Go shop?


It would be ridiculous to do everything in one language. Different tools for different situations. We use Go, Rust, C++, Python, Lua, JavaScript, ...




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: