While I have little proof it's a good hash, it seems enough, and is _slightly_ (5-10%?) faster than siphash24 in this context.
Then I mixed counting hash with finding new lines \n. This allows me to do only one user-data load into XMM registers.
Most importantly, to offset the RAM latency cost, I'm doing 64 prefetches as I parse the input, and only after this I actually touch the hash table. The memory latency is still the biggest time sync, but at least this seem to speed up the program 2x or more. Hash table without this batching+prefetch is 6-8 seconds. With batching goes down below 3s.
I suspect linear probing / open addressing of the hash table may have some penalty. While it plays nicely with the cache prefetch, it generally leads to longer chains. This means we need to keep the hash table sparse, not loaded above 0.6-0.75. See this
Probably not as cache-friendly, but did you, at any point, evaluate Radix Tree [0] or Patricia Trie / CritBit Tree [1][2] as an underlying data-structure for the hash-table? These also compact nicely into one of the many succinct data-structures [3][4].
Radix Tree, in particular, seems to work really well for IPs from what I've read [5].
I'm a bit confused, you are now storing the IPv4 addresses in a hash table using a 64-bit hash?
Why not just use the 32-bit address as a key, and grow the 'blocks' so if two addresses are just a couple of digits apart, promote it to a /24 block etc.
Apologies, maybe I oversimplified the original problem. I'm dealing with IP's (both v4 and v6), subnets, ranges (which may or may not align to subnets). These map to one or more datacenter numbers.
I could indeed define data model, parse the data thoroughly, optimize in-memory data structure, and so on. That requires rigid data structure, knowing access pattern and understanding the problem space. I'm not there yet. Instead, I created this generic tool which works with any text files, and fell into a rabbit hole of over-optimizing it. That's it.
FWIW, you should be able to represent individual IPs, ranges, and subsets all in CIDR notation, tho for ranges you may need multiple CIRD entries to reflect the whole range.
CIDR for ipv4 consists of the 32 bit address and a 32 bit mask, so with some bit packing you can uniquely represent them in 64 bits without hashing.
The problem you’ll run into there is doing a “contains” check on an origin IP for a list of CIDRs, but you’ll need to do that currently since you’re dealing with subnets, I assume.
May I suggest you to have a look at Swiss Tables? There's a good talk from CppCon [0] that explains its design decisions, and implementations available as part of Abseil [1] and Rust stdlib [2].
I'll check your new aesni hash, but would recommend to try a simple fast 32bit hash instead, like the builtin crc32c. It needs half as much space and the cache misses will kill you essentially. Twice lesser cache misses.
And a simple linear hash table instead of cookoo will also help in less cache misses. There are no deletions. Should be 20% faster, I think.
So I tested your aesnihash with smhasher. It's really bad. A much better 64bit aesni variant would be falkhash, which is 4x faster, supports a seed and passes most tests.
The main point of this hash, in this context, is to do streaming hash and find \n at in one loop. The intention is to reduce data loads _mm_loadu_si128 (I already have user data in xmm0, so why not do some aesni already?). Because it's streaming I can't for example derive the initial seed based on the chunk length, since it's unknown at the time of calling hash. See:
I don't need full aes hash, but maybe that could be an option as well.
In other words, in my case I don't care just about hash() speed. I care about memchr() + hash() speed. I would like to understand/measure the hash quality itself. Maybe adding another aesenc round would be sufficient to fix it.
Even falkhash is not that great, and feature an abnormal amount of collisions as the nb of hashes increase. Basically, all "naive" AES implementations share this design weakness.
A blocked bloom filter works on cache blocks <https://www.tutorialspoint.com/blocked-bloom-filter>. It takes more memory but far fewer hits to RAM, so much better caching behaviour. It should solve your problem.
Thanks to both of you for pointing it out, and providing a non-broken link. I wish HN would respect the angle bracket conventions of RFC 2396, but it's not a major failing.
If I skip two useless pipes, and use "sort -u logs.txt > /dev/null" instead, I'm already twice as fast as original (it seems that piping to sort effectively prevents parallelization).
but `sort -u` on its own is only marginally faster than sort|uniq.
I can't remember exactly what LANG=C does, but I think the it makes sort not need to do some fancy unicode stuff? If the person writing the article just needs to uniqify IP addresses they should use it.
I think I used to use "env LC_ALL=C". I like to set "all" the things and I'm also fond of the useless use of env...
This does run faster, but it's also important if you want a predictable order and to distinguish all strings like you would get if you implemented your own text sort naively.
From Googling just now:
"sort -u doesn't report unique lines, but one of each group of lines that have equal sorting order. So if you do want unique lines, you need a locale where characters are byte and all characters have different sorting order (which the C locale guarantees)."
Doing cat from left to right helps readability - it's important.
Performance-wise cat gives you 64KiB blocks of data, while direct pipe can give more. My programs (mmuniq-*) use 512KiB input buffer, so indeed with redirection you can reduce the number of read/write syscalls 8x, but it doesn't change much of the timing frankly.
Parallelization is an interesting aspect, which we didn't discuss really.
Even though I agree with another commenter that it's surprising that the author used a bloom filter instead of the hash map as a baseline, this article still is an excellent small walkthrough of quick ad-hoc low-level performance profiling.
I went down the Bloom Filter rabbit hole for a project a while back and then wondered if I could actually fit the entire set space in memory. The set was IDs up to 2^32, so basically a giant bit array. I believe the IPV4 universe of this article is actually the same size as in my project. I coded up my project using Java bit arrays and got things working decently well, using a big heap. Then I found out there are a bunch of compressed bit array libraries such as EWAH and Roaring Bitmap. When I substituted Roaring for the stock Java Bitset implementation, I saw space and computation improve by many orders of magnitude. Roaring uses a bunch of tricks to achieve this, but it mostly comes down to encoding large runs of 1s or 0s. Obviously, not as tiny as bloom filters, but still pretty small for most modern machines if you have a sparse set. https://www.roaringbitmap.org/
+1 to roaring bitmap. I've actually implemented a PoC in golang with roaring bitmap to keep IPv4 maps (unfortunately its closed-source, at least for now) and it seems to be quite efficient, both performance-wise and memory-wise, at least in the "millions of entries" range.
Bloom filters are ideal candidates for answering the question 'Is x _not_ in the set?'. If the answer is yes, nothing, including the item you're asking about, ever hashed to that location. If the answer is no, all you know is that something hashed there, possibly your item or not.
It's not that simple. If the correctness of your program relies on the "not in set" test being accurate, you're going to need to make the filter huge, and slow.
Probabilistic data structures are about trading off correctness and performance. If you try to push the correctness up to near perfect, they'll quickly stop making sense and you should just use an actual perfect algorithm instead, as the author did.
Bloom filters are great for early outs, where you can save a chunk of computation on a definite negative, but still be correct in case of false positive.
False positives mean you're throwing away genuinely unique items that hash the same. If you use a 256 bit Bloom filter to process a billion items, you'll get exactly 256 results that are indeed unique.
The rate of false positives you require out of the data structure is key. If your program is correct with a 20% false positive rate, you're golden. If the goal is more or less 0, look elsewhere.
The author addresses precision, but not in a way that questions whether a Bloom filter is indeed the right tool for the job.
But you don’t care about false positives, only the definite negatives! I was trying to address the fact that the top comment on the utility of a bloom filter is covered in the post.
I'm not sure why the author started with bloom filters instead of a hash table, to be honest. The workload seems ideal for a hash table. Interesting to see how big the performance gap was though, I wouldn't have expected such a difference. It probably comes down to the fact that the bloom filter has to spread its data across many locations so if the set is particularly large it's always going to lose to a hash table due to hitting main memory more times (in scenarios where you can use either one, at least).
I suppose they started with a bloom filter because they intended to use one when consuming the data, i.e. checking incoming requests against a 'malicious IP' set?
The secret is that I used a bloom filter anyway, as the next stage in the pipeline. I needed to load the IP whitelist/blacklist data quickly and rapidly compare packets against it. I had tooling to do bloom filters already.
I also had an image in my mind, that bloom filters are perfect for such a use case - "set" data structure with some adjustable loss (probabilistic) parameters. I thought that Bloom filters are underappreciated. I was wrong. As we learned "set" is better done with good old hash table.
Radix tree seems like a good fit, was bored one day and implemented one as a CPython module in a couple hours--think I had some idea around IP addresses to use it for.
--edit--
Actually, it was a hextree (radix tree specialized for the hexes). Think someone posted a link to the paper on it a while back.
> For example, source IPs belonging to a legitimate Italian ISP should not arrive in a Brazilian datacenter.
This is an assumption based on a very naive understanding how packets get delivered on the Internet. I for one wouldn't enjoy being blocked from CloudFlare sites just because of poor routing or peering.
The author elaborated in another post: it's also ipv6 addresses, ranges, cidr notation, and the data center id number. It's not as simple as just an ipv4 address.
IMHO this was critical information that muddled the article. I think most people who think critically about their data (ie, everyone who would be interested in the article) should have the same thought. I sure did.
Even better: use mmap to allocate a virtual huge array, which initially will use no memory at all, and let the OS allocate physical pages to it as needed.
Blush. However will I cope with all this extra traffic?
One common way to improve bloom filter cache performance is to divide them into blocks - elsewhere is mentioned doing it to the cache line, but it would be interesting to see how much performance would be gained with a more naive approach, for instance, splitting the filter into 4KiB pages.
I've done this for disk-backed filters, but never looked to see if it improved performance generally.
Another solution is to change the order things are checked in. With optimum k, each hash takes about 1/2 of the possibilities out. If he'd batch it so the memory accesses are roughly ordered, it would be much faster.
EDIT:
So, for instance, if k = 19, that means there are 19 hash functions. If he goes through all the inputs, and checks: are there any hashes here falling within the first 1/19 of the memory space? If not, keep it. If so, check whether any are unset. If not, keep it. If they are, zero the pointer to the input in the array. After this is done, he should be rid of roughly 32% (0.5 * (1-(18/19)^19)) of candidates. The second pass throws out 33% of candidates, and so on and so forth.
He could even keep an absurdly large value for k and n: if it is 128G, then k = 19'053, meaning he can use an even finer increment. He'd have to spill the filter to disk, but the access patterns will be great.
seems like it would be worth comparing to the old "awk '!a[$0] { a[$0]=1; print }'". I would assume that such arrays are implemented internally using hash tables. probably not as efficient as a C implementation, but the used parts of the interpreter should fit in I-cache, so it should be within a few times as fast.
yeah, a generic, probably pointer-heavy hash table is definitely gonna be worse on memory. I'm surprised that it's that much worse on time though, I expected it to be closer. I guess probably the cache misses are worse with such a large table though.
I would welcome seeing a comparison in your environment to using the simple 1/2 GB array of bits, with no hashing or storage of IP addresses. (Extra points for hugetlb mapping.)
The latency problem must be due to (unnecessary data) dependencies and not taking advantage of superscalar architecture. Modern CPUs support 32 or more in-flight memory operations.
After hiding latency, the next bottleneck would be instructions so vector scatter/gather can alleviate this problem.
Indeed. I actually think a few small modification are all that is required to significantly improve the performance of this code. I disagree with gather scatter as currently those implementations are not faster unless in the same cache line.
On a bit unrelated note, if you want to handle very large Bloom filters (billions of entries with low false positive rates) there is an open source Java library that can help you to do that: https://github.com/nixer-io/nixer-spring-plugin/tree/master/....
Another potentially useful data structure that has good asymptotic complexity, but probably also poor cache locality is the Van Emde Boas tree [0]. I've never seen one in practice, but they sure make for excellent p-set problems!
https://github.com/cloudflare/cloudflare-blog/blob/master/20...
First, I used a hash function using aesni (aesenc) instruction set. See this:
https://gist.github.com/majek/96dd615ed6c8aa64f60aac14e3f6ab...
While I have little proof it's a good hash, it seems enough, and is _slightly_ (5-10%?) faster than siphash24 in this context.
Then I mixed counting hash with finding new lines \n. This allows me to do only one user-data load into XMM registers.
Most importantly, to offset the RAM latency cost, I'm doing 64 prefetches as I parse the input, and only after this I actually touch the hash table. The memory latency is still the biggest time sync, but at least this seem to speed up the program 2x or more. Hash table without this batching+prefetch is 6-8 seconds. With batching goes down below 3s.
I suspect linear probing / open addressing of the hash table may have some penalty. While it plays nicely with the cache prefetch, it generally leads to longer chains. This means we need to keep the hash table sparse, not loaded above 0.6-0.75. See this
https://en.wikipedia.org/wiki/File:Hash_table_average_insert...
from https://en.wikipedia.org/wiki/Hash_table