Hacker News new | past | comments | ask | show | jobs | submit login

This is a good question to which there are a lot of good answers that are often poorly received on HN so you’re unlikely to get a solid answer.

I’ll take the hit though.

It’s not just pageview logs, but GA has great tools to analyze those logs, do reporting on some decent set of actions and to bring it all together in a simple to use interface.

You can take your server logs and then what will a non technical person do with them? Not much.

That said, you can deploy GA while opting out of behavioral data and ad network features, and even fuzz ip addresses.

Analytics has the stigma of ad networks because they historically existed to validate ad spend. We’re past that point and they are often used with strict first-party intent.

There’s nothing preventing us from imagining all the malicious things any analytics tool could do, and imaginations run wild.

Disclosure: I work for an analytics company that doesn’t want to own your data, but I understand why folks have a knee jerk reaction to analytics of any kind.




> That said, you can deploy GA while opting out of behavioral data and ad network features, and even fuzz ip addresses.

How useful is the information without this? If they aren't tracking you then they don't have your profile data, ASL is usually the most useful data but only the L is sort of available.

> You can take your server logs and then what will a non technical person do with them? Not much.

IME this is exactly what usually happens with analytics. It's one of those things that management is convinced they just have to have for it's pretty charts and feeling of empowerment, but when ask them what changes they've made based on the data they won't have a lot of examples.

I'm sure they're valuable in the right hands, but for the vast majority it's just it's a waste of time, similar to most reporting.


> ASL is usually the most useful data but only the L is sort of available

you are thinking in term of ads.

If instead what you want to know is, what parts of the sites do visitors stop navigating. Or which pages are seen by recurring visitors vs other pages seen mainly by new visitors. What pages are almost never visited.

Those informations don’t need ASL, the goal is not to target individuals but to profile the site and see what brings value and what might not.

> management is convinced

I think analytics are not a tool for management though, except perhaps in very broad stokes. I see it more for product owners who need a feedback tool so see the impact of what they do or have a vision of how the user uses their product.

It’s like asking management what decisions they make based on NewRelic. None surely. That’s not their job.


> you are thinking in term of ads.

I'm thinking in terms of demographics of visitors. Who's visiting, who's not visiting, why are we so big in japan, that sort of thing. My objection to analytics isn't just ads, it's sending information to a a third party behind the backs of users.

> If instead what you want to know is, what parts of the sites do visitors stop navigating. Or which pages are seen by recurring visitors vs other pages seen mainly by new visitors. What pages are almost never visited.

Those examples can be handled just fine by some trivial processing on server logs. Only the first needs a way to identify users (which analytics will also need) and the second 2 don't even need a user id in the logs. I'll give you the benefit of the doubt and assume they were just simple examples and you want a lot more detailed information and in real time and output prettier than graphviz, why aren't you setting up a locally hosted alternative? If the data collected is truly worthwhile then it's surely worth this minimal time investment?

> That’s not their job.

That's kind of what I'm getting at, it will be mandated by someone in management or marketing but IME it's usually no ones job and nothing happens with it. Google is the only winner.


> Who's visiting, who's not visiting, why are we so big in japan, that sort of thing.

I misread your focus on ads, is it more about user aquisition perhaps ?

Government sites for instance have less of these issues IMO, as they’ll have other means, usually in person or by mail survey, to directly ask why people are not coming to the site (do they know about it ? Do they have a computer ? Can they read in the language etc)

More than anything, these sites have a captive audience so the focus can really be on improving the access to the relevant information.

> processing on server logs

I think it’s overestimating the cost and technical competency of the agency handling these websites, but also the time it would take to reimplement a log parser that surfaces all these informations user session by user session.

It definitely can be done, it’s not trivial in any way though. Compared to what some of the government websites do (they’re basically glorified wordpress sites) building a log analyser + the associated dashboard would cost more than the site itself.


Sorry you’re getting downvoted. These are good questions. I’m in transit and will answer before tomorrow :)


> You can take your server logs and then what will a non technical person do with them? Not much.

Can this analysis be done offline? Data collection can be done without third-party accesses, and any analysis on that data can be done offline using separate tools, isn't it? That removes the third-party script surface of attack.


Nobody has built a useful tool to do this. So, no, not in practice.


There are tools for analyzing server logs. I played with a CLI tool for that, but here's first result for regular folks from Google search on "server logs analytics":

https://matomo.org/feature-overview/


Matomo is the new name of Piwik; it's self-hosted, but JavaScript based, not just a server log processor.


Apparently you can use it for just server logs; see: https://matomo.org/faq/log-analytics-tool/#faq_16303.


All of this could just be a post-processing-step on offline logs which any non-technical person could do.

Then you wouldn't need to imagine what else is done with the data. To imagine that your users data is exploited behind your back should not be a stretch of anyone's imagination.


Server logs are simply not powerful enough for most types of analysis, even basic analysis. Also, tools don’t exist to do the basic analysis in a meaningful way. It’s a catch 22, I’d love to see useful tools that work on server logs, but it’s not 1999 any more.


I keep repeating myself, but there is goaccess [1] which is powerful enough for most use cases. It creates a nice realtime, html report of your apache or nginx logs. It can even show you a report inside the terminal.

[1] https://goaccess.io/


That’s a nice overview for a sysadmin but doesn’t provide tools for making business decisions for non-sysadmins.


That's actually a much more reasonable answer than the one I had imagined. The unfortunate part is that I don't trust anyone not to misuse the data, especially not government employees.


Interesting, the 'especially from government employees' bit feels like a very US-centric reaction. In many other places in the world people just kinda trust their governments. I have no problem with any European government collecting analytics data.


I disagree, as always the US is just 10 years ahead (so it will come to us as well, just later) and there are numerous law initiatives that show to what extent your privacy and freedom of expression is a concern of politicians.


>I disagree, as always the US is just 10 years ahead

At the same time the US is 10 years behind. GDPR is far ahead of anything the US has. Gun control is 50+ years ahead and lets not try to count years in social security nets or healthcare for the non-rich (or health in itself for that matter). In average I'd say the US is behind the curve and falling further by the day, especially now China has become a semi-great soon to be superpower.



Yeah it’s just analytics data. I trust every shady website I’ve ever visited with what I click on, why should a government service (provided by the government to me in the public good) be any different? What could they possibly get from that the census bureau doesn’t have? The tax authority?


Europe has a long and recent history of not being worthy of trust. Francoist Spain just ended in 1975. We had communist Europe fairly recently. The Nazis weren’t that long ago. We’ve had actual genocide in Europe within the past 30 years. It’s a mistake to “trust” any government beyond what can be immediately audited and verified. However to your point about analytics: that’s pretty benign unless those analytics are personally identifiable.


But it is a government website, the government employees would have access to the data regardless of using GA


Yeah and they can't cross reference it with which newspaper stories you read, for example, like google can, will and do then selling that information you did not agree to provide to google and google's customers.


Sure, I don't trust government bureaucracies either.

However, they can easily get lots of compromising data from their own servers. Both from standard web server logs, and from their own scripts and tools.

Once you involve third-party analytics, though, there's another party to worry about. And not just about what they do with the data, but also about how carefully they manage it. That's arguably a key thing in GDPR.


> GA has great tools to analyze those logs, do reporting on some decent set of actions and to bring it all together in a simple to use interface

All of this for free. Why?


GA is a tie in to validate AdWords spend primarily. If you scale up it costs $120k/year. It costs more if you pay someone to implement it custom for your use case.


>.. even fuzz ip addresses

Fuzz how? The IP is known to Google from the very connection ...

> Analytics has the stigma of ad networks

Well... Google with GA is an ad company, isn't it?


They randomize the last octet. It’s a token effort tbh but it’s a decent first step.


This is disabled by default though, no?


Yep probably.


Where / when exactly this is happening? If on Google side, then G already knows real IP, exactly what is to be avoided, especially on Gov and Health websites.


On tangent note, what is it with health data that many, in US at least, find so sensitive seemingly more than their bank account login information.

I genuinely don't understand. Is it that the majority have some secret pre-existing conditions and are afraid the insurance companies might realize?

Every time I visit a new doctor I need to spend ~ 10 minutes to fill out a long multi page form on paper listing all my medical history which could've been loaded from some database. I want my data to be analyzed and used to derive insights and help future patients.


Job applicant screening services would love knowing your health. Credit rating companies would love that too. Google is doing anything it can, free email, free photos/docs storage, free GA to get all possible data. They are capable of ML processing it all together, with their resources.


> what is it with health data that many, in US at least, find so sensitive seemingly more than their bank account login information.

It's not US specific, the privacy of health information goes back to at least the Hippocratic oath.

> Is it that the majority have some secret pre-existing conditions and are afraid the insurance companies might realize?

A lot of people do. Not just embarrassing conditions but they keep notes on mental health, drinking habits, illicit drug use, etc. If that information leaked out you could expect everyone from future employers to dates to be taking a look.


As far as I’m aware it’s totally on the google side. It’s not possible to do before that. As the request to the server is made directly from the client. Gov and Health can likely be ok if they trust the data isn’t being stored. This is a similar concern my health vertical customers have tackled.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: