I find these optimizations fascinating. Anyone familiar with Lemire likely knows about this, but I listened to a podcast episode[1] with him a few days ago and learned about `simdjson`, the tool he authored that parses JSON at 25x the speed of the standard C++ library.[2][3] It's worth looking at if you're into this sort of thing.
yep, great fan of his work here. thanks for sharing that podcast.
that kind of optimization requires you to know your machine architecture quite well. SIMD optimizations aren't new. but it's always amazing to see these performance increases on a single machine!
our CPUs and GPUs are quite amazing. we have decided, as a field, that we can get enough virtual CPUs, GPUs, or RAM on-demand. and that we shouldn't concern ourselves with things happening at that level.
it's actually a situation that saddens me sometimes.
Simd convinced me to take college courses in algorithms and to learn higher maths. Things like image decoding rely heavily on doing transformations, and you just have to know the math behind it to an exact point to be able to effectively turn scalar math to vector effectively.
Ontop of this you have to identify what can and cannot be vectorized and how it can be integrated.
Working in simd isn't too hard In itself once you get down to the assembly and toss out all the extra stuff compilers add. If you look at how ffmpeg does it, they just assemble the hand written assembly as the C function itself. Arm64 is very nice to do this in because it has an elegant way in defining vectors and instructions for them.
After listening to the JSON episode, I've been spending the last few days coding up a SIMD implementation of LZ77 substring matching, for use in DEFLATE.
I used Zig, which has first-class SIMD support. As in, no need to go down to the assembly or use intrinsics, or even use a library. I just got it working and haven't had time to profile it, however (I'm new to profiling code).
If you like this stuff data oriented design[1] is a nice framework that isn't arch specific(but can be extended to be if you need to make more gains that are arch aware). Back when I worked in the PS3/X360 days engines written for PS3 had better cache locality(SPUs forced hard limits vs hitting a cache miss) and ran faster on the X360 as result well when ported.
You can do do fun things like using radix sort with bucket sizes that line up well with L2/L3 caches(saw a multi order of magnitude speedup for that one) and data aware layouts that net 10-30x speedups for the same data. Many RMDBs use similar approaches from what I understand as well.
I was recently disappointed and frustrated to learn that GCC will be enabling vectorization at -O2 soon. The realization that you have to specify both -O3 AND an architecture to get it to use AVX basically invalidated a bunch of benchmarking and testing I had done.
What's the point of building with x86-64-v3 if all your code is built at -O2 without vectorization enabled? Doh!
It's a tricky situation. CPUs reduce their clock speed when you use vector instructions so you can't enable everything by default. You have to use LTO+PGO or the compiler won't know if your code needs to be vectorised at the expense of slowing down everything else. That's one reason why O3 is sometimes slower than O2 or Os
If so, Ive been following it for a couple years, but I put it out of my mind recently after moving to AMD. I could sware it was an intel only project, but a quick scan of the that git suggests I'm wrong. So either I'm totally missremembering, or AMD support was added later.
Anyway, I cant wait to try that out again. I wonder why most projects don't just use this as their default json parser now?
For some reason, writing a less-that-completey-naive JSON parser for C++ seems to be a right of passage for all C++ developers who eventually go on to great things.
I'm really suspicious about whether this would really work out in most workloads.
On modern architectures, it's the random memory reference that kills performance. Predictive pre-fetching in the pipeline has a impossible time figuring out what to get ready in the caches with random lookups; it's not like a linear sweep. This style of bloom-filter needs to be quite large, and it pulls only one cache-line in per key lookup, both of these properties make any opportunistic (non-prefetched) caching less effective.
I wonder if real-world benchmarks would prefer a more memory dense lookup that pulls in multiple lines; just because it keeps the cache-lines hotter and less likely to be evicted. On the other hand, writes to this kind of bloom filter invalidate only one cache-line, so that's good. So I'd guess it's all very very workload dependent. Ultimately it depends on cache pressure and how over-subscribed the cache is and the ratio of writes to reads, and the distribution of keys; an implementation might also fare very differently on a system like Graviton2 (which has huge caches) than on a small Xeon.
> I'm really suspicious about whether this would really work out in most workloads.
What kinds of workloads are those? Workloads where the Bloom filter itself fits in cache?
> On modern architectures, it's the random memory reference that kills performance
Yeah, the whole point of the article is that moving from a traditional Bloom filter to a block Bloom filter is to improve from N random accesses per query/update to 1 random access per query/update.
> Ultimately it depends on cache pressure and how over-subscribed the cache is
The unspoken assumption in the article is that the Bloom filter is much bigger than cache.
If your Bloom filter fits in cache, or you're writing for something with superfast RAM like a GPU or a Xeon Phi or something, block Bloom filters are wasteful for you.
But I'm pretty sure the common case is writing normal applications running on reasonable general-purpose CPU's with Bloom filter sizes in the hundreds of MB or more. [1] [2]
[1] If the Bloom filter's smaller than hundreds of MB, your data's probably small enough that you decide Bloom filters aren't worth the hassle and just use a HashMap or language equivalent.
[2] If you're using a Bloom filter to do some sort of network stuff, like figuring out the set difference of two peers' objects in some p2p protocol, you probably don't care too much about block Bloom filters, because in this application memory access time is probably going to be dominated by network latency.
> to improve from N random accesses per query/update to 1 random access per query/update.
I think their concern is that N (nondependent) random accesses can happen in parallel not much more slowly than 1 random access (assuming good speculation and sufficient memory bandwidth).
The N accesses don't need to happen in parallel; they are faster whenever they take place because they need one block in cache rather than the equivalent of k>1 blocks together to be fast (executed from cache) and it is unconditionally much better, even without assumptions about hot and cold keys.
If you had many lookups to do, then an approach with a coroutine that issues a prefetch instruction then yields to other lookups while the data is fetched in memory might be effective.
> whether this would really work out in most workloads
> just because it keeps the cache-lines hotter and less likely to be evicted.
Okay, so keeping cache for a bloom filter problem is real - but the real force evicting memory out of the cache line is the next row-group you read + all the other stuff you have to do when you implement this in a database product. Memory bandwidth is contested heavily and the caches will get knocked out every 1024 rows.
So the two things I work with, Apache Hive and Apache Impala switched to a blocked bloom filter at different points in time.
We ran a lot of trivial benchmarks and several benchmarks where the shuffle-join+network (not sort-merge, this is just a partitioned hash join) generates a bloom filter (a semijoin) before sending rows out and the 1-cache line version won out when the bloom filter went slightly over the 1 Million + 5% rate [1].
The regular bloom filter went from (38ns -> 108ns for 1k -> 1m items), while the BloomK stuck at (27ns) despite making room for a million times more items in the bloom. The bloom-1 (which is the 64bit version) underperformed on accuracy (was ~2x faster at 16ns per op, but worse at filtering out items).
If I math right, setting 5 bits in a 64-bit word extends the effective length of the hash function by approximately (5-1) * log_2(64) = (5-1) * 6 = 24 bits. In a simple bloom filter of storage size m bits, the hash functions have length log_2(m). All else being equal, the number of hash functions used -- hence the number of memory reads per lookup -- gets reduced by a factor of log_2(m) / [24 + log_2(m)].
(But it's not exactly equal; the storage size m has to go up by a factor of 1.2 (in the OP parameter set), for constant false-positive rate).
Each word in this design is a little bloom filter. It has number of bits (m)=64; number of hashes (k)=5; and going for 1% false positive rate, which per formula (with n being number of items in the filter) in [1] is:
fpr = (1 - e ^ -(k * n)/m ) ^ k
OP solved for 1% and got n =~ 4.
There is exactly 1 hash function (to compute the 64 bit key). 5 bits are pseudo-randomly selected (giving us k=5) and written to an array element indexed off of the 64 bit key itself.
So sizing this [naively] is a matter of taking your n (say 4 million items) and dividing by 4 and getting a 1 million array of words.
What is not discussed in detail is block selection. Given the segmented nature of the array based data structure, distributing exactly 4 items per word is impossible for an on-line filter (as that requires perfect hashing). So, the actual 'load' capacity of this filter is, in array form, a fraction of 4 x array-len. Precisely, given an array of size S words, with capacity of Sx4, well before we reach Sx4, our keys will resolve to array elements, the little micro filters, that already have 4 registrations. Inserting the full allocated capacity (Sx4) will certainly result in a significant number of words having n greater than 4, which reduces the false positive rate. At 5 items we have fpr at 3.5%, and at 6 fpr is 7.3%. Equally, some array elements (words) will have less than 4 elements, getting exceptional fprs. For example, at 2 entries, fpr is 0.0063%! If we load the full capacity, we can no longer make statements about "1%" false positive rates, as OP does, it should be noted.
So, you either have to keep track of elements assigned to words (3 bits / word to count 0..4), or you compute a probable value to minimize overloaded array elements, keeping max writes to a word to 4 items. My guess would be ~70%ish utilization range of allocated memory. In this case, we get our desired 1% (max) but get bonuses of fortunate keys that land in underloaded words and get super high assurances at the cost of eating a bit of space inefficiency. /g
So this is the classic ball into bins question [2]. For hashes, the surprising but powerful 2 choice approach gets us high 90%s load factors. I'd recommend that. Yes, we'll x2 hashing and cache-line costs, but loading factor will be in high 90s range. Here OP should examine two candidate words, and pick the one with least items. [On look ups, we have 2 filters (words), one of which asserts it doesn't have it, and another that is 99% sure it does. Or both disclaim having it.]
p.s. to OP: you could also return the fpr if you keep track of items per word so the call site has insight into the assurances. That would be useful for a probabilistic data structure.
> Each word in this design is a little bloom filter.
It's effectively a hash table of bloom filters.
> So, you either have to keep track of elements assigned to words (3 bits / word to count 0..4), or you compute a probable value to minimize overloaded array elements
One option would be to just directly limit the number of marked bits in each bucket, which seems to correlate better to false positive rate than number of elements per se. (Eg a element with all k indexes equal (marking only one bit) causes less false positives than one with k distinct indexes.) You get 6bits * 4elements = 24bits marked, so just call a bin full if it has popcount >= 24.
There is some p probability that a given filter with less than 24 set bits has 4 elements (since some of their bits overlapped). So this approach falls under "compute a probable value" as there is a non-zero probability that some filters have > 4, or even higher. So you will have fpr > 1%. When you count, you have the assurance of fpr < 1.0% at the cost of 4 additional bits / array element.
For this design, it's just a step or two away from venturing into 3.5 or 7% error rates. And if you simulate it, you'll see a gaussian distribution with mean at theoretical capacity (len x 4) and some standard deviation. By capping bits, we cut a lot of possible cases that exceed the mean, but will not eliminate them. It's pretty straightforward to simulate this and get empirical data to see if their impact (of fp) is acceptable.
> There is some p probability that a given filter with less than 24 set bits has 4 elements (since some of their bits overlapped).
> there is a non-zero probability that some filters have > 4, or even higher.
Overlapped bits don't contribute to false positive rate. As a unlikely-but-simple example, consider a filter with 3 items, marking 15 distinct indexes, and a new 'fourth' item whose 5 indexes all align with one of the existing 15 indexes. Adding this new item does not change the filter at all, and so cannot change the false positive rate.
In general the false positive rate (assuming our hash function is good, ie H(X) is independently uniformly random for each X) is the chance that a uniformly random hash passes the filter. For a N bit filter with P bits marked, and K bits per element, that's (P/N)^K (select K random indexes, and check if they're all marked, which they will be with probability P out of N each). Since N and K are fixed, the FPR only depends on the popcount P. A straightforward bloom filter can't depend on the number of items inserted (seperately from the number of marked bits), because that information isn't even present.
You want to check if an element is probably in a dataset. You want speed, and the possibility of a false positive doesn’t scare you that much.
Say you want to lookup if a username is in a database. That could take a while if you have a really large db of usernames, but you want to tell a user quickly if a name is taken.
So what you can do is make a bloom filter, which is basically just an array of bits of n length. Let’s say 10 for our example.
Then gather some hash functions (h1, h2, and h3).
When someone inputs a username, you run it through the hash functions, and modulo by the number of bits, and get an integer result.
Then you check the bits at position 3,2, and 7 in the array.
If they any of the bits are set to 0, we haven’t seen the username before.
If they’re all set to 1, we have possibly seen the username before.
Let’s say they’re all set to 0, so we probably haven’t seen it. We can present the message that the username is available, and allow registration. During registration, we can do a slower more expensive verification that the username isn’t actually in the dataset.
Now that the username is in the dataset, we can flip the bits on our bloom filter at position 3,2, and 7 to 1. So the next time we lookup that username we can tell quickly that it’s probably taken.
There is no way to delete from the bloom filter. The more values you store relative to its size, the more false positives you get.
Not exactly correct. If all of the bits is 1, then the value is possibly in the set (not "probably", but "possibly", with the likelihood being affected by the sizes of the set and the array).
If any of the bits is zero, the value definitely isn't in the set (also not "probably haven't", but definitely not).
Was trying to keep the initial explanation simple by leaving out the false positives scaling relative to the size of the array and set, and then clarifying at the end.
The second part (any 0 bits it hasn’t been seen) is true, a goof up on my part.
Usernames is a nice example. Just as a demo, we could make things simpler:
Look at all the usernames in the input, keep track of all the letters you’ve seen. When you get a new name, if you’ve seen all the letters before, you might have seen the name before. If there’s a new letter, it must be a new name.
Obviously that isn’t a great hash function, some letters are more common than others, so real Bloom filters use better ones, but it works basically the same way.
Think of them as asking a bouncer if they have seen a suspect enter a club. They don't know folks by name, but can answer a series questions about the people that entered.
Then, the "hashes" are effectively your new questions. Did someone with green hair enter? Did someone limping enter? Wearing a hat? Etc
To that end, it is clear that you won't get a definitive yes to entry. You may get a sufficient no, though.
say though nummerology you reduce the character values of a name to a number 0-10. Lots of names will reduce to the same number. We take 11 bits and for "jim" we set the first bit. We have only one name in our data set so all other bits are 0. Now if someone types "joe" in the search box and it reduces to 2 we look at the second bit, see it is a zero and know 100% *for sure* that this name is not in the data set. If "jack" reduces to 1 and we look him up we see 1 is set so this name *might be* there.
I did say I'd butcher it. :-) imho one shouldn't be overly attached to an exact implementation but learn to reason about it. You could for example make an array and count how many times each bit was set. A textbook bloomfilter doesn't allow you to remove things but if you count how often a bit was set you obviously can.
Excellent thought provoking article, but all the times I’ve brought in a dumb cache or bloom filter it is to save calling a micro service or hitting a DB or something so fantastically expensive (in computer time) that even a staggeringly inefficient bloom filter or just keeping lists of recently-seen values etc is a massive win. So there’s no big pressing need to microoptimise the bloom filter or whatever, and all the real profile guided optimizations point at other low hanging fruit.
Same, I was thinking about implementing a bloom filter myself, but when I put in the hit ratio and the frequencies, a simple LRU cache was going to be close enough. It's hard to justify the complexity sometimes, which is a pitty cause they are so cool.
wild tangent, focusing on the stated example, not the filter implementation:
> Suppose that you are given a database of users where only a small percentage are ‘paying customers’ (say 5% or less). You can write an SQL query to check whether a given user is indeed a paying customer, but it might require a round trip to your database engine. Instead, you would like to hold a small ‘filter’ in memory to quickly check whether the given user is a paying customer.
another way to avoid needing to solve this example filtering problem is to store the data in the database in a way that doesn't require any filtering. e.g. if it is often useful to distinguish paying customers from other kinds of users (sounds plausible), maybe paying customers could be segregated into their own separate table from non-paying users. then there's no need to filter.
this is a rule of thumb that is written about a bit in 'data-oriented design'. instead of operating on collections of heterogeneous entities that are obfuscated behind some facade of commonality, figure out which are the most frequent types of entities and put them in their own collections, where they can be computed upon in homogeneous batches -- with better performance.
> maybe paying customers could be segregated into their own separate table from non-paying users.
If you know that, for example, you just need the customer ID and name of paying customers, you can also make an index on (is_paying, customer_id, customer_name).
Any database engine worth its salt will then be able to satisfy such a query entirely from the index without reading any pages from the actual table, making this basically equivalent to the "separate table" approach but without the "administrative" overhead.
This seems a bit like a solution in search of a use case. There are simpler solutions for this particular usecase.
Let's take an extreme case: You have 250 million paying customers you want to cache access to. Yeah, that's way more than you probably have, but the point is that you can keep 250 million integers in memory no problem what so ever. It fits in a gigabyte. There are raspberry pis with eight times that amount of RAM.
You can just put the ids in a sorted array and you can check for existence with a binary search without having to deal with false positives. Adding a bunch more is as easy as a merge of sorted lists, and removing items is a great deal easier than doing the same in a floom filter. You're looking at log_2(250 million) = 27 iterations. Cache-wise it's not great but it's faster than a database access by a mile and you don't have to worry about hash functions.
Maybe if you have several billion customers it starts to become a bit much to keep in memory, but that is a weird corner case indeed.
The main use case I've always seen is to use bloom filters on the client side to reduce traffic to the server looking things up. As you said, you could cache 250 million integers in a gigabyte - but you don't want to bloat your client side implementation by a gigabyte for every bloom filter you use. Also, many times the items are a lot larger than an integer. For example, storing a list of malicious URLs, a common use case.
I'm so confused by this use case (the traffic-saving one, not the malicious URL classifier). Why not store the "is-paying-customer" bit in a cookie?
What are we using as the user identifier? Where does it come from, if not a cookie?
Also, this client-side bloom filter kind of leaks your user database, supposing it's keyed on email addresses and your adversary has a gigantic list of email addresses, or is patient enough to enumerate them.
> Why not store the "is-paying-customer" bit in a cookie?
You shouldn't trust the client. You probably don't want people to get access to paid features with a relatively easy tweak of cookies.
The client-side filter is more suitable for listing, say, malware URLs, as mostly the response is "no" (go ahead) instead of "maybe", which would require a bit more work (like, network requests) to check if it's blocked or not.
I think you missed the point of the example. Lemire's example isn't "the use-case" for this optimization, it's just a trivial motivating example. The use-case is more like, "assume you're currently using a Bloom filter in an appropriate context, here's a way that might be more efficient".
The point of the post is the optimization technique, not trying to describe when a Bloom filter would be appropriate.
Obviously you don't need bloom filters if lookups are cheap, but you're ignoring all of the cases where lookups aren't cheap.
You can't always put the entire database of everything in memory. Consider cases where the check is done client-side and requires a network request. A bloom filter could allow you to optimize away a network request in some, though not all, cases.
I'm curious if you have a concrete example of a case where lookups aren't cheap and a bloom filter can solve the problem.
In general, bloom filters are mainly useful for the sort of data where lookups are cheap, that is, testing for membership of a set, and that is typically one of the operations a database does really well.
In most cases where you just want to save a network roundtrip, a LRU cache is often a lot easier to get right and is much easier to keep consistent against mutable data.
I use them in astronomical data indexing when doing cross-match searches. That is, “could there be a star or other moving object in within 0.5 arcsec of this point in the sky?” Sky catalogs have many billions of objects but the density per solid angle of the sky is extremely variable, and moving objects also make lookups really complicated. Sky catalogs are big enough that they are not storable in memory, and may be in big Parquet tables on the network, so the full lookup is quite expensive.
Bloom filters are great for this because the backing dataset changes extremely slowly, like a few times a year when a survey publishes a new data release.
Bloom filter generally are much smaller than the data they represent. So if your set fits on disk, the bf might fit in ram; if your data fits in ram, your bf might fit in L3 and so on.
Edit: in particular you might want to use bloom filters fir early rejects, when most of the set ownership queries are expected to fail.
Imagine you're making a web browser plugin that blocks ads, or malicious sites. Let's assume the blocklist is a hundred megs (easily fits in ram) and your millions of users need to get the latest data at least hourly in order to keep up with the latest URLs that you want to block.
Rather than distributing the entire blocklist to your userbase, you can instead send a bloom filter + an allowlist of the small handful of sites which have a hash collision with one of the blocked sites.
As a bonus, computing the hashes will have great branch prediction characteristics and you'll have fewer cache misses because the bloom filter is tiny and frequently accessed, so your plugin will not add any perceptible slowdown to the user's experience.
Couldn't you just send deltas if that is the case? Surely the hourly updates wouldn't be hundreds of megabytes? I just tested extracting 5 million URLs from my web crawler and it was like 150 Mb in plain text. That's ignoring how easy it is to create compression schemes for URLs that slash their memory footprint by something like 80%.
Yes, you could do it with deltas instead. The tradeoffs are that it won't be as fast and will cost you a LOT more in bandwidth. (and cost your users more bandwidth as well. They might be on a very slow/limited data plan) Maybe you don't care about bandwidth and would prefer to avoid the complexity and maintenance overhead of adding a bloom filter.
As with anything, there are tradeoffs and your requirements can change over time. Maybe the ad networks or malware creators start using new domains every 10 minutes to counter your blocking system so now you have to store more data and disseminate it more frequently.
As engineers, it's our job to weigh the tradeoffs between different solutions given the resources and constraints of the situation. For the situation I've outlined above, I'd at least strongly consider a bloom filter but it's certainly not the only way to do it.
> As engineers, it's our job to weigh the tradeoffs between different solutions given the resources and constraints of the situation. For the situation I've outlined above, I'd at least strongly consider a bloom filter but it's certainly not the only way to do it.
We should also not gloss over that with a bloom filter can't rule out false positives, and it's really not feasible to figure out which they are. Due to the nature of hashing, you won't be able to easily come up with a list of false positives. Finding just one hash collision in a wide hash is computationally stupidly hard.
> As with anything, there are tradeoffs and your requirements can change over time. Maybe the ad networks or malware creators start using new domains every 10 minutes to counter your blocking system so now you have to store more data and disseminate it more frequently.
Domains cost quite a lot of money so that is still pretty unrealistic. Sure you can have CN wildcards, but you can also do wildcard matching.
Actually this whole scenario is unrealistic, since you can just serve ads off a random URL. The way you would create a decent ad filter is to look for characteristics in the script itself (a bit like an antivirus program), not base it off the URL.
> We should also not gloss over that with a bloom filter can't rule out false positives, and it's really not feasible to figure out which they are.
Nothing says a bloom filter has to be the only data structure you use. It's a performance optimization; even if 1% of the time you have to consult a more expensive data structure to confirm your result, it can still save you a lot of computation in the long run.
> Finding just one hash collision in a wide hash is computationally stupidly hard.
But bloom filters don't use wide hashes; the domain of the hash function is the number of bits in the filter.
> Domains cost quite a lot of money so that is still pretty unrealistic.
In the case of malware, the cost of buying domains isn't that relevant, because you can compromise existing domains using automated attacks.
Why wouldn't "isPayingCustomer" not be part of the Customer object graph immediately after logging in to begin with? (Maybe I didn't understand the use case properly)
This is a very bad example, but the underlying concepts are useful.
It's more realistic if you are storing a set of revoked access keys or altered files on your cache that you subscribe from some channel that is not on your main request - database - response loop.
Normally you won't ever need to store many of those until they reach the expiration time or you update you cache, and you will want a lot of performance for the common case (what the small memory requirements help ensuring).
I think fundamentally bloom filters are mainly useful for astronomical data sizes. If you have a petabyte of data partitioned between several datacenters, and you want to know which shards have entries with certain properties, bloom filers are amazing. A bloom filter that is just a few dozen megabytes can save you an enormous amount of computation in such a scenario.
To cache network requests it really isn't that great. A basic LRU cache typically performs just as good, and is a lot easier to get right.
a good use case is malicious URL detection for browsers. they can look up locally and get a sense whether the url is bad and then double check against webserver.
I feel like he skipped over something. Which part of his algorithm requires “about 12 bits” instead of about 10? He’s storing 5 entries per word which is 12.8 bits per entry (which is quite a rounding error). Why not 6 entries at 10.67 bits per entry?
I suspect a bloom filter should use a separate array of bits for each hash function. This came up in a previous discussion where stated probability of collisions was "wrong" but not if it were actually implemented with separate bit arrays.
The real reason why people don’t use the blocked bloom filter, is because it’s FPR is much higher + it requires more storage. It has the worst of both worlds.
You're right, A hash function doesn't map to multiple values.
Bloom filters and other probabilistic set membership DS's like Cuckoo filters instead use multiple hash functions to map one input to multiple output values. [1]
The number of hash functions to use depends on the expected false positive probability rate, number of bits in the filter and the number of elements which will be inserted
Telling the GoogleBot to not index your site doesn't necessarily remove your site from Google. They can and often do still show your site but they replace your description a generic statement that they can't show it because you have decided not to include your site in their index. It is crazy but true.
1. https://corecursive.com/frontiers-of-performance-with-daniel...
2. https://www.youtube.com/watch?v=wlvKAT7SZIQ
3. https://arxiv.org/pdf/1902.08318.pdf