Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Privacy-centric and ethical analytics solutions?
1 point by jamieweb on Nov 25, 2018 | hide | past | favorite | 7 comments
I would like to gain some high-level insight into the traffic accessing my website. For example:

  - Unique visitor counts
  - Most viewed pages
  - Referring sites
  - Activity per time of day/week/month
I do not want to be able to track individual users - I want to keep this strictly to statistics rather than intrusive tracking. That throws out pretty much anything that involves JavaScript or stuff done on the client-side.

I've been trying to put together a solution using the AWStats log analyser, however this requires me to collect IP addresses. If I remove or obfuscate IP addresses, then the 'Unique Visitors' count doesn't work. Unfortunately it seems that AWStats uses IPs as the primary method for identifying unique visitors.

What other solutions are out there? My site is PHP so doing something myself would also be acceptable.




I have built a platform that exactly does that. It does require JavaScript for a few reasons:

1. It allows single page apps to analyze

2. Caching of pages does not have any effect on the JS to be executed. Most back end tracking don't know if something is visited when cached.

So I would recommend you to use JavaScript if above reasons apply to you. As far as I know you can't really obfuscate the IP address in a why that you can't track a visitor. That's why I decided to drop IP address from our logs and don't use them at all.

Regarding your last point: unique visitors are hard to measure if you don't use IP or a cookie. A cookie is tracking, be not sure how intrusive you think this is for you. It could be a cookie with just a value of 'visited=1' or something, so you know it's a non-unique visitor when the cookie is present. That way you don't track I think.

You can see demo stats of my platform here https://simpleanalytics.io/simpleanalytics.io


It was actually your Show HN from a few months ago that prompted me to add 'analytics' to my to-do list.

For the past years of running my site, I've had basically no analytics. I have the site in Google Webmaster Tools which shows stats on the Google clicks, but other than that there isn't much. At the moment traffic from everywhere but Google is completely unaccounted for.

I also disabled web server logging completely in May when GDPR came in.

Simple Analytics looks really good, and it's on my personal list of 'cool startups to use in the future', however for my particular site, there are a couple of issues:

- Adblock users

- JavaScript

The reason I'm so interested in log analysers is that the data is closer to network/traffic statistics than it is analytics/tracking. I think that a massive portion of the traffic to my site will be with Adblock, so if all of it is unaccounted for then the numbers will be way off.

Also, my site is strictly JavaScript free. It does look like you offer a <noscript> version though. My site is locked down with a super-tight Content Security Policy, and including an external JS file would be too big of a risk in my opinion.

As far as I can see, Simple Analytics isn't able to count unique visitors. This is what makes it so good when it comes to privacy and security, but it's a stat that I'd really like to see. I've started putting together my own solution using a bloom filter for this, as nikonyrh suggested in this thread.

So overall I think that Simple Analytics is great and I would love for more sites to adopt it, rather than going down the guy with a camera and notebook route (as shown in your promo video!). However for my particular project, it isn't suitable at the current time. As I said though, it's on my list and I can definitely see it been useful for other projects that I may be involved in.

Thank you :)


Thank you for your kind words! I think you like to build something cool here for this website.

To reply on your points:

- Adblock users are definitely a portion and some of them are blocking major trackers. Unfortunately also Simple Analytics. But I’m building a proxy so that people can link a CNAME if their domain to Simple Analytics.

- Simple Analytics has indeed a noscript version. It removes the ability to get the referrer though.

- Uniques without tracking. Yes. It’s possible. Will keep HN posted

Good luck with the bloom filter!


You could hash the ips before storing them, but as there aren't that many IPv4 addresses wit would be trivial to revert.

However if you use bloom filters to calculate "distinct counts" then I think you cannot reliably re-construct visitors ips. You gotta do some planning in advance on implementation details so that you can extract the stats you are looking for.


Thanks for the suggestion, this has got me thinking...

I had thought about hashing IPs and user agents combined, but even that would be quite reversible since user agents aren't that unique really.

I've done some research on bloom filters and this looks like a good lead. It could be challenging to implement with AWStats though, as I guess I'd have to perform the bloom filter logic before writing the log, and then write a fake IP address accordingly in order to keep the Unique Visitors count consistent.

For example, if the bloom filter says that the IP is not known, then I could write 10.0.0.2 to the log and increment a count. Next time a not known IP comes in, 10.0.0.3 is written, then 10.0.0.4, and so on.

If the IP is already known, then the unique visitor has already been counted, so just write 10.0.0.1 to the log.

The result of this is that all log entries would be for 10.0.0.1, except for those where a new visitor had connected for the first time, which will be an arbitrary IP from the 10.0.0.0/8 range.

Have I got the right idea here, or is there a better way? This is my first rough concept.


Ahaa, I didn't take AWStats into account. I was thinking more in terms of "one bloom filter per unique thing you want the distinct count out of", for example 1 per day & IP & User agent. I guess your approach would work fine for quite "empty" bloom filters, but if your filter is too large it becomes again more plausible to reverse the IPs from it.

You get more accurate counts from "https://en.wikipedia.org/wiki/Bloom_filter#Approximating_the... formula. And I'm not even sure what to do if you have under-sized your filter and suddenly you get lots of unique visitors, I guess you'd need to create a larger filter on-the-fly.

Btw Redis supports bloom filters, so you don't have to worry about the in-memory implementation ;) https://redislabs.com/blog/rebloom-bloom-filter-datatype-red...


I'm looking to also rely on k-anonymity a bit for this.

If I have a bloom filter that is 128 KiB (1,048,575 bits) in size and I use the first 5 characters of a SHA256 hash as the identifier, then there's only 1,048,575 possible unique values that can be stored.

The total number of publicly routable IPv4 addresses is 3,706,452,992 - so that means that for each bit in the bloom filter, there are an average of 3,535 possible IPs that it could relate to.

In other words, if you were to brute force the bloom filter with the hashes of every publicly routable IPv4 address (which wouldn't take very long since it's only 3.7 billion), the average accuracy would be 1 in ~3500.

This means that IPs that share the same 5 starting SHA256 characters wouldn't be counted, but that would only be 1 in ~3500, so not a big program. At worst it would result in a lowered unique visitors count - which is much better than a bloated one.

For IPv6, the same applies but even better, as there are magnitudes more v6's than there are v4's.

> But if your filter is too large it becomes again more plausible to reverse the IPs from it

Yes, I've been considering this. If I were to use a larger filter, I'd need to take more characters of the SHA256 hash in order to match the total size of the filter. If I used the first 10 characters of a SHA256 hash rather than just 5, then that is easily enough possible values to uniquely identify each IPv4 address. If I used 10 then the bloom filter would be trivially brute-forceable. With only 5 characters though, it's a 1 in ~3500 accuracy rate as I mentioned above.

> I guess your approach would work fine for quite empty bloom filters

By this did you mean 'small' filters (i.e. by file size), or filters that just aren't that populated? If the latter it should still be fine as long as the size of the filter allows for a good k-anonymity ratio (1 in 3500, etc).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: