If I math right, setting 5 bits in a 64-bit word extends the effective length of the hash function by approximately (5-1) * log_2(64) = (5-1) * 6 = 24 bits. In a simple bloom filter of storage size m bits, the hash functions have length log_2(m). All else being equal, the number of hash functions used -- hence the number of memory reads per lookup -- gets reduced by a factor of log_2(m) / [24 + log_2(m)].
(But it's not exactly equal; the storage size m has to go up by a factor of 1.2 (in the OP parameter set), for constant false-positive rate).
Each word in this design is a little bloom filter. It has number of bits (m)=64; number of hashes (k)=5; and going for 1% false positive rate, which per formula (with n being number of items in the filter) in [1] is:
fpr = (1 - e ^ -(k * n)/m ) ^ k
OP solved for 1% and got n =~ 4.
There is exactly 1 hash function (to compute the 64 bit key). 5 bits are pseudo-randomly selected (giving us k=5) and written to an array element indexed off of the 64 bit key itself.
So sizing this [naively] is a matter of taking your n (say 4 million items) and dividing by 4 and getting a 1 million array of words.
What is not discussed in detail is block selection. Given the segmented nature of the array based data structure, distributing exactly 4 items per word is impossible for an on-line filter (as that requires perfect hashing). So, the actual 'load' capacity of this filter is, in array form, a fraction of 4 x array-len. Precisely, given an array of size S words, with capacity of Sx4, well before we reach Sx4, our keys will resolve to array elements, the little micro filters, that already have 4 registrations. Inserting the full allocated capacity (Sx4) will certainly result in a significant number of words having n greater than 4, which reduces the false positive rate. At 5 items we have fpr at 3.5%, and at 6 fpr is 7.3%. Equally, some array elements (words) will have less than 4 elements, getting exceptional fprs. For example, at 2 entries, fpr is 0.0063%! If we load the full capacity, we can no longer make statements about "1%" false positive rates, as OP does, it should be noted.
So, you either have to keep track of elements assigned to words (3 bits / word to count 0..4), or you compute a probable value to minimize overloaded array elements, keeping max writes to a word to 4 items. My guess would be ~70%ish utilization range of allocated memory. In this case, we get our desired 1% (max) but get bonuses of fortunate keys that land in underloaded words and get super high assurances at the cost of eating a bit of space inefficiency. /g
So this is the classic ball into bins question [2]. For hashes, the surprising but powerful 2 choice approach gets us high 90%s load factors. I'd recommend that. Yes, we'll x2 hashing and cache-line costs, but loading factor will be in high 90s range. Here OP should examine two candidate words, and pick the one with least items. [On look ups, we have 2 filters (words), one of which asserts it doesn't have it, and another that is 99% sure it does. Or both disclaim having it.]
p.s. to OP: you could also return the fpr if you keep track of items per word so the call site has insight into the assurances. That would be useful for a probabilistic data structure.
> Each word in this design is a little bloom filter.
It's effectively a hash table of bloom filters.
> So, you either have to keep track of elements assigned to words (3 bits / word to count 0..4), or you compute a probable value to minimize overloaded array elements
One option would be to just directly limit the number of marked bits in each bucket, which seems to correlate better to false positive rate than number of elements per se. (Eg a element with all k indexes equal (marking only one bit) causes less false positives than one with k distinct indexes.) You get 6bits * 4elements = 24bits marked, so just call a bin full if it has popcount >= 24.
There is some p probability that a given filter with less than 24 set bits has 4 elements (since some of their bits overlapped). So this approach falls under "compute a probable value" as there is a non-zero probability that some filters have > 4, or even higher. So you will have fpr > 1%. When you count, you have the assurance of fpr < 1.0% at the cost of 4 additional bits / array element.
For this design, it's just a step or two away from venturing into 3.5 or 7% error rates. And if you simulate it, you'll see a gaussian distribution with mean at theoretical capacity (len x 4) and some standard deviation. By capping bits, we cut a lot of possible cases that exceed the mean, but will not eliminate them. It's pretty straightforward to simulate this and get empirical data to see if their impact (of fp) is acceptable.
> There is some p probability that a given filter with less than 24 set bits has 4 elements (since some of their bits overlapped).
> there is a non-zero probability that some filters have > 4, or even higher.
Overlapped bits don't contribute to false positive rate. As a unlikely-but-simple example, consider a filter with 3 items, marking 15 distinct indexes, and a new 'fourth' item whose 5 indexes all align with one of the existing 15 indexes. Adding this new item does not change the filter at all, and so cannot change the false positive rate.
In general the false positive rate (assuming our hash function is good, ie H(X) is independently uniformly random for each X) is the chance that a uniformly random hash passes the filter. For a N bit filter with P bits marked, and K bits per element, that's (P/N)^K (select K random indexes, and check if they're all marked, which they will be with probability P out of N each). Since N and K are fixed, the FPR only depends on the popcount P. A straightforward bloom filter can't depend on the number of items inserted (seperately from the number of marked bits), because that information isn't even present.
If I math right, setting 5 bits in a 64-bit word extends the effective length of the hash function by approximately (5-1) * log_2(64) = (5-1) * 6 = 24 bits. In a simple bloom filter of storage size m bits, the hash functions have length log_2(m). All else being equal, the number of hash functions used -- hence the number of memory reads per lookup -- gets reduced by a factor of log_2(m) / [24 + log_2(m)].
(But it's not exactly equal; the storage size m has to go up by a factor of 1.2 (in the OP parameter set), for constant false-positive rate).