Ahaa, I didn't take AWStats into account. I was thinking more in terms of "one bloom filter per unique thing you want the distinct count out of", for example 1 per day & IP & User agent. I guess your approach would work fine for quite "empty" bloom filters, but if your filter is too large it becomes again more plausible to reverse the IPs from it.
You get more accurate counts from "https://en.wikipedia.org/wiki/Bloom_filter#Approximating_the... formula. And I'm not even sure what to do if you have under-sized your filter and suddenly you get lots of unique visitors, I guess you'd need to create a larger filter on-the-fly.
I'm looking to also rely on k-anonymity a bit for this.
If I have a bloom filter that is 128 KiB (1,048,575 bits) in size and I use the first 5 characters of a SHA256 hash as the identifier, then there's only 1,048,575 possible unique values that can be stored.
The total number of publicly routable IPv4 addresses is 3,706,452,992 - so that means that for each bit in the bloom filter, there are an average of 3,535 possible IPs that it could relate to.
In other words, if you were to brute force the bloom filter with the hashes of every publicly routable IPv4 address (which wouldn't take very long since it's only 3.7 billion), the average accuracy would be 1 in ~3500.
This means that IPs that share the same 5 starting SHA256 characters wouldn't be counted, but that would only be 1 in ~3500, so not a big program. At worst it would result in a lowered unique visitors count - which is much better than a bloated one.
For IPv6, the same applies but even better, as there are magnitudes more v6's than there are v4's.
> But if your filter is too large it becomes again more plausible to reverse the IPs from it
Yes, I've been considering this. If I were to use a larger filter, I'd need to take more characters of the SHA256 hash in order to match the total size of the filter. If I used the first 10 characters of a SHA256 hash rather than just 5, then that is easily enough possible values to uniquely identify each IPv4 address. If I used 10 then the bloom filter would be trivially brute-forceable. With only 5 characters though, it's a 1 in ~3500 accuracy rate as I mentioned above.
> I guess your approach would work fine for quite empty bloom filters
By this did you mean 'small' filters (i.e. by file size), or filters that just aren't that populated? If the latter it should still be fine as long as the size of the filter allows for a good k-anonymity ratio (1 in 3500, etc).
You get more accurate counts from "https://en.wikipedia.org/wiki/Bloom_filter#Approximating_the... formula. And I'm not even sure what to do if you have under-sized your filter and suddenly you get lots of unique visitors, I guess you'd need to create a larger filter on-the-fly.
Btw Redis supports bloom filters, so you don't have to worry about the in-memory implementation ;) https://redislabs.com/blog/rebloom-bloom-filter-datatype-red...