Hi, I'm the product manager for Cloudflare Analytics. Thanks for this thorough and thoughtful review.
We are totally serious about building a world-class, privacy-first, free analytics product. At risk of HN cliche, this is our "early work". We are actively working to fix many of the rough edges mentioned here; if we had waited to fix all of them before shipping, we never would have shipped!
For folks who haven't seen it, I suggest checking out our launch blog post[0] which gives some more context around edge vs browser analytics (spoiler: we do both!), why we count visits the way we do, and how we handle bot traffic.
We know we have work to do on the "jagged lines" problem. For some low-traffic websites, we might show noisier, low-resolution data than is ideal. (We've artificially constrained our analytics to query a maximum of 7 days at a time because this problem is exacerbated with longer time ranges.)
My colleague Jamie wrote a nice blog post about how and why we sample data [1]. In short: we have an existing customer base of 25 million+ Internet priorities, whose traffic volume spans 9 orders of magnitude! Sampling data is an elegant approach that allows us to serve fast, flexible analytics for all our customers. Sampling shouldn't be feared, but we know we can do better in some cases. We've recently merged some deep-in-the-weeds improvements to ClickHouse [2] that should result in improved resolution. And we're currently working to store full-resolution data for the smallest websites.
Happy to address any other specific points that folks have questions about.
Well I would say the opposite, sampling should absolutely be feared. In a lot of case sampling is not an issue, home page, or popular page but in others, including checkout pages, , product pages, and low visibility pages, sampling can make massive difference. When working with sampled data you should always keep it in mind
But would that happen on a per-page level or a per-site level? I think probably the latter, in which case the data is going to be a lot less useful where arguably it's most important.
> In short: we have an existing customer base of 25 million+ Internet priorities, whose traffic volume spans 9 orders of magnitude! Sampling data is an elegant approach that allows us to serve fast, flexible analytics for all our customers.
I've been reflecting recently on how problems like this only exist for companies with extreme scale (similar to how microservices came about to solve FAANG-sized problems). This is a non-issue if you go with a product like plausible (or my personal choice: GoatCounter) for your analytics, because in that case you're essentially just paying them to manage an instance of their open source software for you on a multi-tenant server (I'm guessing here). And if it does eventually become a scale problem for plausible to the point where they start complicating their architecture to solve it, you can self-host or switch to another plausible provider.
If you set out to solve a simpler problem, you can use a simpler solution.
Two HN marketing powered companies fighting against each other.
Cloudflare Analytics is server side, we all know that server side analytics is good to get number of GET requests but is almost unusable for anything else. But in their case it seems they are not even trying to filter bots
Page views is always the biggest complains people have when they change analytics tool, "why you have 100k more/less pages views than GA, adobe Analytics?" -> "We don't count pageviews the same way" The reality is everyone is filtering bots a different way and nobody is really doing it well, as a result anyone numbers are wrong. The raw number of pageviews usually doesn't mean anything, that's better to look at difference
> But in their case it seems they are not even trying to filter bots
We try very hard in fact :) We just have some work ahead to surface this better in our analytics UI.
For more detail: we actually assign a "bot score" to each request. Customers customers of our Bot Management product can use this information to block bots at the edge. We are working right now to show the distribution of Bot traffic in the UI so that anyone can see how we are classifying traffic.
In addition to that, it took me a while to figure out that bounce rate, page views, and duration aren't calculated very well (if at all) if the user only goes to one page, then leaves. Finally figured out that setting up Google tag manager to trigger updates to GA on scroll not only gives you insight as to whether the content is being read (and to what extent), but also forces updates of pageviews/duration and even bounce rate as well (I am guessing due to a single page visit not being considered a bounce if they spend a certain amount of time on the page?).
It's not time based; it's how you configure the events to send.
GA events have an optional field called "non-interaction", which defaults to false if not present[1].
Pageviews are considered interactive hits. GA's technical definition of a bounce is any visit with only a single interactive hit, leading to the layman definition of bounce rate being "people who viewed two or more pages".
Events though, are also interactive events by default (non-interaction = false being a double negative). So the moment you implement event tracking, if you don't explicitly control that parameter, the definition of "bounce rate" expands to include "the number of people that triggered my event tracking" and your bounce rate will become next to worthless the more liberal or trigger happy your event tracking becomes.
A general rule of thumb is to make "passive" event tracking non-interactive, so that it doesn't impact bounce rate. Things like scroll tracking, timers, etc. Those are all all subject to passive behaviors such as background-ed tabs, impulse scrolling when you first get to a site, etc. And not necessarily representative of active engagement with your site/content.
Then make "active" event tracking (such as clicks tracking on interactive elements or outbound links) interactive. That way bounce rate becomes more representative of active engagement from users. Then optionally, and deliberately, make certain passive tracking events interactive for the express purpose of impacting/influencing bounce rate. For example, a website with lots of long form content but few interactive elements may want to make their passive "30 second" timer event an interactive event, based on business logic that someone spending 30 seconds on an article is considered engaged with it and, having hit that threshold of passive engagement with the page, should. no longer be considered a bounced visit. But every other timer would be set to non-interactive, and scroll tracking would be set to non-interactive (so you don't get false positives from people that scroll all the way to the bottom immediately).
Sending non-interactive events continues to increment the duration, so you still get that value from sending the non-interactive events. But without it skewing your bounce metric.
That said, you rarely see that much care (or understanding) put into it in practice. Familiarity with how implementation impacts interpretation tends to be relatively rare, and a "bad" implementation leads to juiced metrics. So few analytics/marketing teams are even aware they can actively define business rules around what is a bounced visit and what isn't, so most don't and take whatever the system spits out as gospel, implicitly leaving it as "undefined behavior" subject to the whims of however it's implemented.
And even if you educate those teams on such things, no one wants to eat the hit to their vanity metrics that result from truing up an existing implementation that has such issues. So even broaching the subject tends to just lead to consternation and continuation of the status quo[2].
[2] The bitter grumblings of someone that pokes this hornets nest on a frequent basis, and potentially a skewed perspective rather than objective fact.
Very cool! I really appreciate you taking the time to write out the detailed response. I truly learned a lot and am going to rework some of our configuration in tag manager. Thank you!!
My pleasure! Feel free to reach out if there's anything else you have questions about (email in profile).
The biggest thing to realize is that you can control that particular calculated metric in the first place. At which point, you can decide subjectively what constitutes a bounce for your business/site, and explicitly match your implementation to that business definition. Which may even fluctuate between sections of a site, but rarely aligns with what you'll get out of the box leaving everything with Google's defaults.
That makes perfect sense, but at least for me, sometimes it can be challenging to figure out what metrics are important and what is acceptable vs. not. For example with bounce rate - what is realistically a good bounce rate for a given sector? Even if you have a decent bounce rate, is it converting to actual sales? If it's not, then is that bounce rate really "good". The struggles of trying to manage SEO as a fourth or fifth hat =). Thank you again and will definitely reach out!
Wouldn’t server side analytics be better though cause you could filter ip addresses and the like? For most people using the web it’s unlikely that they will have randomized ip addresses beyond a certain range if I recall correctly however it’s been almost a decade since I had to think about this problem
> Cloudflare should be serious about this or they’ll only help Google Analytics in its dominance
> The main danger I see after paying for and trying Cloudflare Analytics is that they may not believe in this at all.
This is so true! Hopefully, the issues mentioned in the post will be fixed soon, considering this is a quite new product. I can see high value for small personal websites, as they would be able to get something for free and focus on user privacy.
Meanwhile, the mentioned issues might be quite beneficial for other companies, such as Plausible or Fathom. They are quite mature on the privacy-focused analytics market, and offer things that Cloudflare analytics doesn't. People might start on Cloudflare Analytics (free) and eventually migrate to something more full-featured.
>> Cloudflare should be serious about this or they’ll only help Google Analytics in its dominance
>> The main danger I see after paying for and trying Cloudflare Analytics is that they may not believe in this at all.
I disagree with this part of the post because developers who opted for Cloudflare Analytics simply because it was "privacy first" are likely to explore other alternatives before selling their souls to GA.
The results of cloudflare analytics look like the old awstats results: You know instantly that something is wrong because the numbers are much too high.
Interestingly that was the exact reason why Google Analytics got big as they had basically no bots in their statistics (in the old days running js was so complex no crawler or bot did it). And now we‘ve done a 360 degree turn and be in the same position with inflated server logs.
Is there any good analytics solution which is only server side? Getting some good statistics but not exposing to visitors they are tracked (even in a friendly mode) would be nice.
Those numbers are definitely high - and the “unknown” section suggests a lot of that is bot activity but I would be careful about assuming GA is accurate rather than low. I’ve seen it falling off for years as tracker blocking became common and then default, and even before then the need to load a short TTL script meant that a measurable fraction of users wouldn’t be captured.
> ...Is there any good analytics solution which is only server side? Getting some good statistics but not exposing to visitors they are tracked (even in a friendly mode) would be nice...
I'm in the same boat. About a year ago i migrated over to Matomo (fka Piwik) and away from google analytics for all my (and family, friend) web sites that i manage. While i really don't have many big complaints of Matomo, i really wish i didn't have to use javascript for this. I started looking at - but did NOT implement - GoAccess [http://goaccess.io]...but don't think it fills what i need. Honestly, for the websites that i manage - and in these days where i want to preserve both my privacy and the privacy of my web visitors - I'm wondering if i should just very well shut off all tracking on the front-end, and try using GoAccess or even AWStats, and process stuff offline. I'm not running any ecommerce site after all. And, removing yet another little element (javascript) from websites helps with performance for visitors. Sorry, not helping much, maybe give a look at GoAccess...? Good luck!
Also interested in any CDN that has decent server-side analytics. It seems like it wouldn't be hard to at least make an attempt to exclude bot requests from the results -- filtering out the main AWS-owned (and other cloud providers) CIDRs alone would go a long way I'm sure
I love how people on hn recommend using a private vpn with algo deployed on aws/digitalocean and then other people recommend straight-up discarding all of them as bots. I know those are different people, but kind of frustrated people are so trigger happy to ban hosters/tor.
Yeah, I totally get this, but percentage-wise I figure the amount of real natural human traffic coming from an AWS-owned IP is pretty small, so if something like that is what it takes to get decently accurate analytics from server-side request logs, then at least it's something. I could be wrong though
You may want to look at netlify and what their server side logging is doing I heard good things about it from others but I am no expert in this domain.
I'm not sure it is bots, I think it is just the difficulty in counting hits vs page views and page views within a "session" which has no meaning whatsoever on the web..
>Google Analytics has a big issue with adblockers and on this site alone more than 25% of visitors block it.
>Plausible avoid it by not collecting any personal data in the first place.
Seems like this is wrong. The default filters on ublock seem to block plausible's script just as much as google's.
Could likely be argued that the greater ability to filter out bots with a script makes it preferable as compared to server-side analytics. But you are definitely under-counting legitimate visitors if you go that route.
> I could talk about free never being free or the venture-funded model that allows them to lose hundreds of millions of dollars every year centralizing a big part of the internet but all that is a different conversation.
But that's not what cloudflare is doing. You can't separate analytics from other parts of its business. Cloudflare collects a lot of data to determine which traffic is legitimate, and offering analytics simply means that they build a nice UI for site owners to access that data.
I'm really enjoying this series of blog posts from @markosaric - on some sites I work with we've been really struggling to understand the differences in visitor counts from various analytics solutions, and some of those sites depend on those numbers for funding. I was quite excited when CloudFlare announced an analytics product as really they should know better than anything else we can install and run, exactly how much human traffic is hitting our URLs. But what we got seems to be almost useless.
"In my Cloudflare Analytics versus Plausible Analytics comparison, Cloudflare Analytics is inaccurate to the point that the data is pretty much useless as web analytics... Seems for now that the brand new Cloudflare Analytics is simply a server log tool with server log accuracy."
Okay, this is nice to distinguish but I would kinda like to know both numbers: what's the raw load coming through CF to my website and what are the "real human users" -- and what tools can CF provide to block the former while allowing the latter? If CF can provide easy to configure and understand WAF rules to filter more of the bots and other raw traffic that I don't want then I'm going to save on my AWS/Azure/hosting bill.
Cloudflare Analytics is not there yet then. I guess it currently resembles more like their standard analytics overview page, which currently rather looks like a summary of a server access log.
I actually use Clicky and self hosted Matomo for most of my properties. I would love to use something like Plausible but to me (SEO pro) the most important feat is lacking: landing pages and average amount of actions and/or time spend per visitor per landing page.
I'm working on an open-source analytics lib [0] and I fully agree with the article that Cloudflare should either do it right, or not do it at all. Filtering bots is probably the hardest part and time consuming. There is no simple approach to solve the issue, but not doing it at all makes it worthless, especially for the price.
Hopefully their push will make the issue more prominent to website owners and help Google Analytics alternatives to grow.
Agreed, but I think Plausible differentiates between sessions and visitors? The term should be sessions: non-unique visits across a defined time range. At least that is what I do with Pirsch [0] and on my website [1].
We don't show the number sessions/visits at the moment. Most seem happy with the 'unique visitors' number and we don't want to add extra data to the UI unless it's truly useful.
If you read the site, especially the comparison to Matomo, it makes clear that the emphasis is on ease of use, stripped down, clean UX, not the whole kitchen sink.
If you want that, I suppose GA or Matomo are for you? Has to be something to differentiate them, pointless to expect or want them all to be the same.
If anyone is curious, that howtotomakemyblog.com referral site in the CF dashboard seems to be a domain that redirects back to the same domain that's hosting this blog post.
It didn't appear in his analytics tool's referrals. I guess it's because his tool follows redirects and ignores listing same site referrals while CF does not?
I hope this doesn't turn into what their domain service became...which feels like a complete after thought. I was excited about it at first but it's been a year now and it feels the same as day 1. Incomplete and abandoned.
We are totally serious about building a world-class, privacy-first, free analytics product. At risk of HN cliche, this is our "early work". We are actively working to fix many of the rough edges mentioned here; if we had waited to fix all of them before shipping, we never would have shipped!
For folks who haven't seen it, I suggest checking out our launch blog post[0] which gives some more context around edge vs browser analytics (spoiler: we do both!), why we count visits the way we do, and how we handle bot traffic.
We know we have work to do on the "jagged lines" problem. For some low-traffic websites, we might show noisier, low-resolution data than is ideal. (We've artificially constrained our analytics to query a maximum of 7 days at a time because this problem is exacerbated with longer time ranges.)
My colleague Jamie wrote a nice blog post about how and why we sample data [1]. In short: we have an existing customer base of 25 million+ Internet priorities, whose traffic volume spans 9 orders of magnitude! Sampling data is an elegant approach that allows us to serve fast, flexible analytics for all our customers. Sampling shouldn't be feared, but we know we can do better in some cases. We've recently merged some deep-in-the-weeds improvements to ClickHouse [2] that should result in improved resolution. And we're currently working to store full-resolution data for the smallest websites.
Happy to address any other specific points that folks have questions about.
[0] https://blog.cloudflare.com/free-privacy-first-analytics-for... [1] https://blog.cloudflare.com/explaining-cloudflares-abr-analy... [2] https://github.com/ClickHouse/ClickHouse/pull/14221