This is a good question to which there are a lot of good answers that are often ...

flukus · on March 19, 2019

> That said, you can deploy GA while opting out of behavioral data and ad network features, and even fuzz ip addresses.

How useful is the information without this? If they aren't tracking you then they don't have your profile data, ASL is usually the most useful data but only the L is sort of available.

> You can take your server logs and then what will a non technical person do with them? Not much.

IME this is exactly what usually happens with analytics. It's one of those things that management is convinced they just have to have for it's pretty charts and feeling of empowerment, but when ask them what changes they've made based on the data they won't have a lot of examples.

I'm sure they're valuable in the right hands, but for the vast majority it's just it's a waste of time, similar to most reporting.

hrktb · on March 19, 2019

> ASL is usually the most useful data but only the L is sort of available

you are thinking in term of ads.

If instead what you want to know is, what parts of the sites do visitors stop navigating. Or which pages are seen by recurring visitors vs other pages seen mainly by new visitors. What pages are almost never visited.

Those informations don’t need ASL, the goal is not to target individuals but to profile the site and see what brings value and what might not.

> management is convinced

I think analytics are not a tool for management though, except perhaps in very broad stokes. I see it more for product owners who need a feedback tool so see the impact of what they do or have a vision of how the user uses their product.

It’s like asking management what decisions they make based on NewRelic. None surely. That’s not their job.

flukus · on March 19, 2019

> you are thinking in term of ads.

I'm thinking in terms of demographics of visitors. Who's visiting, who's not visiting, why are we so big in japan, that sort of thing. My objection to analytics isn't just ads, it's sending information to a a third party behind the backs of users.

> If instead what you want to know is, what parts of the sites do visitors stop navigating. Or which pages are seen by recurring visitors vs other pages seen mainly by new visitors. What pages are almost never visited.

Those examples can be handled just fine by some trivial processing on server logs. Only the first needs a way to identify users (which analytics will also need) and the second 2 don't even need a user id in the logs. I'll give you the benefit of the doubt and assume they were just simple examples and you want a lot more detailed information and in real time and output prettier than graphviz, why aren't you setting up a locally hosted alternative? If the data collected is truly worthwhile then it's surely worth this minimal time investment?

> That’s not their job.

That's kind of what I'm getting at, it will be mandated by someone in management or marketing but IME it's usually no ones job and nothing happens with it. Google is the only winner.

hrktb · on March 19, 2019

> Who's visiting, who's not visiting, why are we so big in japan, that sort of thing.

I misread your focus on ads, is it more about user aquisition perhaps ?

Government sites for instance have less of these issues IMO, as they’ll have other means, usually in person or by mail survey, to directly ask why people are not coming to the site (do they know about it ? Do they have a computer ? Can they read in the language etc)

More than anything, these sites have a captive audience so the focus can really be on improving the access to the relevant information.

> processing on server logs

I think it’s overestimating the cost and technical competency of the agency handling these websites, but also the time it would take to reimplement a log parser that surfaces all these informations user session by user session.

It definitely can be done, it’s not trivial in any way though. Compared to what some of the government websites do (they’re basically glorified wordpress sites) building a log analyser + the associated dashboard would cost more than the site itself.

codezero · on March 19, 2019

Sorry you’re getting downvoted. These are good questions. I’m in transit and will answer before tomorrow :)

msravi · on March 19, 2019

> You can take your server logs and then what will a non technical person do with them? Not much.

Can this analysis be done offline? Data collection can be done without third-party accesses, and any analysis on that data can be done offline using separate tools, isn't it? That removes the third-party script surface of attack.

codezero · on March 19, 2019

Nobody has built a useful tool to do this. So, no, not in practice.

TeMPOraL · on March 19, 2019

There are tools for analyzing server logs. I played with a CLI tool for that, but here's first result for regular folks from Google search on "server logs analytics":

https://matomo.org/feature-overview/

icebraining · on March 19, 2019

Matomo is the new name of Piwik; it's self-hosted, but JavaScript based, not just a server log processor.

TeMPOraL · on March 19, 2019

Apparently you can use it for just server logs; see: https://matomo.org/faq/log-analytics-tool/#faq_16303.

tjoff · on March 19, 2019

All of this could just be a post-processing-step on offline logs which any non-technical person could do.

Then you wouldn't need to imagine what else is done with the data. To imagine that your users data is exploited behind your back should not be a stretch of anyone's imagination.

codezero · on March 19, 2019

Server logs are simply not powerful enough for most types of analysis, even basic analysis. Also, tools don’t exist to do the basic analysis in a meaningful way. It’s a catch 22, I’d love to see useful tools that work on server logs, but it’s not 1999 any more.

zubspace · on March 19, 2019

I keep repeating myself, but there is goaccess [1] which is powerful enough for most use cases. It creates a nice realtime, html report of your apache or nginx logs. It can even show you a report inside the terminal.

[1] https://goaccess.io/

codezero · on March 19, 2019

That’s a nice overview for a sysadmin but doesn’t provide tools for making business decisions for non-sysadmins.

sverige · on March 19, 2019

That's actually a much more reasonable answer than the one I had imagined. The unfortunate part is that I don't trust anyone not to misuse the data, especially not government employees.

arcticbull · on March 19, 2019

Interesting, the 'especially from government employees' bit feels like a very US-centric reaction. In many other places in the world people just kinda trust their governments. I have no problem with any European government collecting analytics data.

dmichulke · on March 19, 2019

I disagree, as always the US is just 10 years ahead (so it will come to us as well, just later) and there are numerous law initiatives that show to what extent your privacy and freedom of expression is a concern of politicians.

Dahoon · on March 19, 2019

>I disagree, as always the US is just 10 years ahead

At the same time the US is 10 years behind. GDPR is far ahead of anything the US has. Gun control is 50+ years ahead and lets not try to count years in social security nets or healthcare for the non-rich (or health in itself for that matter). In average I'd say the US is behind the curve and falling further by the day, especially now China has become a semi-great soon to be superpower.

scarejunba · on March 19, 2019

Any European government?

https://rsf.org/en/news/romania-tries-use-gdpr-force-journal...

arcticbull · on March 19, 2019

Yeah it’s just analytics data. I trust every shady website I’ve ever visited with what I click on, why should a government service (provided by the government to me in the public good) be any different? What could they possibly get from that the census bureau doesn’t have? The tax authority?

briandear · on March 19, 2019

Europe has a long and recent history of not being worthy of trust. Francoist Spain just ended in 1975. We had communist Europe fairly recently. The Nazis weren’t that long ago. We’ve had actual genocide in Europe within the past 30 years. It’s a mistake to “trust” any government beyond what can be immediately audited and verified. However to your point about analytics: that’s pretty benign unless those analytics are personally identifiable.

tonmoy · on March 19, 2019

But it is a government website, the government employees would have access to the data regardless of using GA

harry8 · on March 19, 2019

Yeah and they can't cross reference it with which newspaper stories you read, for example, like google can, will and do then selling that information you did not agree to provide to google and google's customers.

mirimir · on March 19, 2019

Sure, I don't trust government bureaucracies either.

However, they can easily get lots of compromising data from their own servers. Both from standard web server logs, and from their own scripts and tools.

Once you involve third-party analytics, though, there's another party to worry about. And not just about what they do with the data, but also about how carefully they manage it. That's arguably a key thing in GDPR.

auslander · on March 19, 2019

> GA has great tools to analyze those logs, do reporting on some decent set of actions and to bring it all together in a simple to use interface

All of this for free. Why?

codezero · on March 19, 2019

GA is a tie in to validate AdWords spend primarily. If you scale up it costs $120k/year. It costs more if you pay someone to implement it custom for your use case.

auslander · on March 19, 2019

>.. even fuzz ip addresses

Fuzz how? The IP is known to Google from the very connection ...

> Analytics has the stigma of ad networks

Well... Google with GA is an ad company, isn't it?

codezero · on March 19, 2019

They randomize the last octet. It’s a token effort tbh but it’s a decent first step.

p49k · on March 19, 2019

This is disabled by default though, no?

codezero · on March 19, 2019

Yep probably.

auslander · on March 19, 2019

Where / when exactly this is happening? If on Google side, then G already knows real IP, exactly what is to be avoided, especially on Gov and Health websites.

gerash · on March 19, 2019

On tangent note, what is it with health data that many, in US at least, find so sensitive seemingly more than their bank account login information.

I genuinely don't understand. Is it that the majority have some secret pre-existing conditions and are afraid the insurance companies might realize?

Every time I visit a new doctor I need to spend ~ 10 minutes to fill out a long multi page form on paper listing all my medical history which could've been loaded from some database. I want my data to be analyzed and used to derive insights and help future patients.

auslander · on March 19, 2019

Job applicant screening services would love knowing your health. Credit rating companies would love that too. Google is doing anything it can, free email, free photos/docs storage, free GA to get all possible data. They are capable of ML processing it all together, with their resources.

flukus · on March 19, 2019

> what is it with health data that many, in US at least, find so sensitive seemingly more than their bank account login information.

It's not US specific, the privacy of health information goes back to at least the Hippocratic oath.

> Is it that the majority have some secret pre-existing conditions and are afraid the insurance companies might realize?

A lot of people do. Not just embarrassing conditions but they keep notes on mental health, drinking habits, illicit drug use, etc. If that information leaked out you could expect everyone from future employers to dates to be taking a look.

codezero · on March 19, 2019

As far as I’m aware it’s totally on the google side. It’s not possible to do before that. As the request to the server is made directly from the client. Gov and Health can likely be ok if they trust the data isn’t being stored. This is a similar concern my health vertical customers have tackled.