I'm not sure the author really understands the math (from his comments) but he's right that it's significant, where significant means "I am at least 95% confident that the observed improvement is caused by this change". The test is fine (assuming he didn't just check the results every day until he found one that would make for a fun blog post).
The reason it feels intuitively broken is because the conversion rate is so low. But there were about 10k impressions in his test.
> The reason it feels intuitively broken is because the conversion rate is so low. But there were about 10k impressions in his test.
Can't be 95% confident in something if natural variability drowns out the effect. That kind of conversion rate is one bot RNG away from returning to the mean (noise), it just takes some domain knowledge to recognize why p-hacking is an illusion.
I think the objection is about the phrasing that can be interpreted as "I will only be wrong in at most 5% of significant results", while what we mean is "if there is no effect, we will only get so large differences in at most 5% of cases".
It's not uncommon that people fail to notice the nuance here and it's quite important for correctly interpreting reults.
Well, with low probability events you can go a bit further with your back of the envelop calculations, because that means they're more or less Poisson distributed. The average is somewhere around 16 so that gives a standard deviation of 4.
So there's about 3 standard deviations between the two, this sounds like quite a bit but really means they're both 1.5 standard deviations from the supposed mean. Which is, not great, though it might pass some of the weaker statistical tests.
Now you should actually weight the values by the total number of impressions in which case you might get a slightly higher significance since the one with 12 clicks was seen by more people.
So on the balance you should be wondering what you're paying the graphic designer for, but perhaps not start a new career designing low-budget adds.
I assume that one takes the uneven sampling into account? That should boost the significance a bit, though perhaps a bit more than I'd originally assumed. It remains a back of the envelop calculation after all.
Their argument consists of linking an optimizely screenshot.
12 vs 24 clicks is not significant, it could’ve gone either way. Also given this minuscule sample, it’s easy to conduct p-hacking to get the desired outcome
Could you explain the calculations that lead to the claim that the result is not significant? From what I can tell, if we assume that clicking the ad is a weighted binary variable, eg. what is modelled as a "proportional distribution" there's a statistically significant difference between the two results. It's even pretty strong at P=0.006 (per https://epitools.ausvet.com.au/ztesttwo), but I might be missing something?
In other words, if I'm doing the math right, for him to p-hack this by rerunning the experiment in the case of no difference, he'd have to run it more than 100 times to get a 50% chance of getting as good or better significance.
There's of course plenty of other things that could be wrong outside of the simple statistical test, he could be making the numbers up, the groups might not be properly randomized etc.
The author is using a tool designed for landing pages. The quality of the samples are going to be wildly different and that needs to be taken into account.
Linkedin calculates an ad impression (last I checked) as 50% of the ad is on screen for at least 300ms. I can be scrolling as fast as my thumb will flick down my LinkedIn newsfeed and it would probably count.
Then the KPI being used is clicks. I don't know of any business owner who would take that as valid. It should be some kind of conversion event (newsletter signup, contact request, or purchase, etc.)
If it were up to me, I'd want 4,400 clicks and a few dozen conversion events to do my calculations on statistically significant effectiveness. Especially since the author is paying CPC (cost-per-click)... who cares about impressions at all?
As long as there's not a bias favoring one group in the collection of samples, statistically significant is statistically significant. There's no such statistical thing as a "tool designed for landing pages", but instead tools that compare occurrences in different population.
It’s statistically safest to live on Mars to avoid bear attacks.... the data matters. There’s limited real world application, and that’s what matters most. Semantics and arguing definitions isn’t useful.
You're handwaving about why there's some special definition of "significance" here, and when called on it, it comes down to "I don't feel like this is true".
Valid arguments, better formed, are: A) you're not sure it's representative of a real campaign, B) you're not sure it predicts the end-outcomes we really care about instead of some intermediate measure. Neither of these improve with more n, and so they're not significance related issues--- 4000 clicks doesn't improve either A or B.
I'm not off in the semantic weeds. And I think it's kind of embarrassing you're trying for a witty rejoinder instead of giving -some- kind of cogent argument.
Sibling comments are right on the technical side, but to help build intuition: Statistical significance isn't solely about 12 vs 24, you have to take into account the total population as well (5563 vs 4403). In particular, the smaller population had more than double the results - this is a hint that you have to do the math and can't just say "it's not significant", because it really well could be.
You can have a sample size of 5 and have it be statistically significant. Or maybe you are using some non-statistical measure of statistical significance?
The fact that it has a significant p-value is interesting, but the lack of information about how the author decided when to stop is suggestive of p-hacking (i.e. we don't know how many screenshots were taken, but we understand that the author posted only the most favourable one)
Also when running campaigns, I’ve seen an ad suddenly get way more clicks than usual in a short timespan. Depending on how clicks are counted, just one user who gets click-happy could skew the result.
But in this case I suspect people were just curious why the ad looked like that and clicked to find out. Those may not be the people likely to convert.
In the real world, we often have to work on a "preponderance of evidence" standard to actually get things done.
Especially if the second option is cheaper and faster, there's IMO no bayesian prior that the professional ad being better (the null hypothesis) is true.
I think there can be prior that a professional looking ad can generate more clicks. Your argument shows a lack of statistical understanding - conditional on this data, the Bayesian approach would be to update the prior (whether A is better, or they’re equally as good) with the data collected. With such a small dataset, you might end up with a belief that there’s a 60% probability that B is better than A, but that’s not significant enough to conclude that B is in fact better than A, as you still have a lot of uncertainty.
With a prior that A is superior, you may still end up believing that A > B after updating, because there’s just so little data.
I addressed in my second sentence that I disagree with that prior. I understand the statistics perfectly well.
And my main point is that a 60% probability is in fact actionable in the real world, in a situation where you are forced to take action with incomplete information. Assuming you are running an ad campaign, you have to choose one of the two.
P=.95 still is an arbitrary threshold, even if it's a commonly used one.