There were 12 (24) clicks total, you don't even have to use p-value or any other...

NationalPark · on Jan 3, 2021

I'm not sure the author really understands the math (from his comments) but he's right that it's significant, where significant means "I am at least 95% confident that the observed improvement is caused by this change". The test is fine (assuming he didn't just check the results every day until he found one that would make for a fun blog post).

The reason it feels intuitively broken is because the conversion rate is so low. But there were about 10k impressions in his test.

akiselev · on Jan 3, 2021

> The reason it feels intuitively broken is because the conversion rate is so low. But there were about 10k impressions in his test.

Can't be 95% confident in something if natural variability drowns out the effect. That kind of conversion rate is one bot RNG away from returning to the mean (noise), it just takes some domain knowledge to recognize why p-hacking is an illusion.

canjobear · on Jan 3, 2021

That’s not what significance means.

NationalPark · on Jan 3, 2021

I mean... yes it is? 95% confidence in an effect is what most people mean when they say "statistically significant".

jonex · on Jan 4, 2021

I think the objection is about the phrasing that can be interpreted as "I will only be wrong in at most 5% of significant results", while what we mean is "if there is no effect, we will only get so large differences in at most 5% of cases".

It's not uncommon that people fail to notice the nuance here and it's quite important for correctly interpreting reults.

contravariant · on Jan 3, 2021

Well, with low probability events you can go a bit further with your back of the envelop calculations, because that means they're more or less Poisson distributed. The average is somewhere around 16 so that gives a standard deviation of 4.

So there's about 3 standard deviations between the two, this sounds like quite a bit but really means they're both 1.5 standard deviations from the supposed mean. Which is, not great, though it might pass some of the weaker statistical tests.

Now you should actually weight the values by the total number of impressions in which case you might get a slightly higher significance since the one with 12 clicks was seen by more people.

So on the balance you should be wondering what you're paying the graphic designer for, but perhaps not start a new career designing low-budget adds.

whimsicalism · on Jan 3, 2021

Fisher's exact test p-value of 0.007 is pretty decent, not "not great."

contravariant · on Jan 3, 2021

I assume that one takes the uneven sampling into account? That should boost the significance a bit, though perhaps a bit more than I'd originally assumed. It remains a back of the envelop calculation after all.

joegahona · on Jan 3, 2021

The author argues in the comments that it's statistically significant: https://www.gkogan.co/blog/looks-vs-results/?r=2#comment-520...

andreilys · on Jan 3, 2021

Their argument consists of linking an optimizely screenshot.

12 vs 24 clicks is not significant, it could’ve gone either way. Also given this minuscule sample, it’s easy to conduct p-hacking to get the desired outcome

jonex · on Jan 3, 2021

Could you explain the calculations that lead to the claim that the result is not significant? From what I can tell, if we assume that clicking the ad is a weighted binary variable, eg. what is modelled as a "proportional distribution" there's a statistically significant difference between the two results. It's even pretty strong at P=0.006 (per https://epitools.ausvet.com.au/ztesttwo), but I might be missing something?

In other words, if I'm doing the math right, for him to p-hack this by rerunning the experiment in the case of no difference, he'd have to run it more than 100 times to get a 50% chance of getting as good or better significance.

There's of course plenty of other things that could be wrong outside of the simple statistical test, he could be making the numbers up, the groups might not be properly randomized etc.

Reebz · on Jan 4, 2021

The author is using a tool designed for landing pages. The quality of the samples are going to be wildly different and that needs to be taken into account.

Linkedin calculates an ad impression (last I checked) as 50% of the ad is on screen for at least 300ms. I can be scrolling as fast as my thumb will flick down my LinkedIn newsfeed and it would probably count.

Then the KPI being used is clicks. I don't know of any business owner who would take that as valid. It should be some kind of conversion event (newsletter signup, contact request, or purchase, etc.)

If it were up to me, I'd want 4,400 clicks and a few dozen conversion events to do my calculations on statistically significant effectiveness. Especially since the author is paying CPC (cost-per-click)... who cares about impressions at all?

mlyle · on Jan 4, 2021

As long as there's not a bias favoring one group in the collection of samples, statistically significant is statistically significant. There's no such statistical thing as a "tool designed for landing pages", but instead tools that compare occurrences in different population.

Reebz · on Jan 4, 2021

It’s statistically safest to live on Mars to avoid bear attacks.... the data matters. There’s limited real world application, and that’s what matters most. Semantics and arguing definitions isn’t useful.

mlyle · on Jan 4, 2021

You're handwaving about why there's some special definition of "significance" here, and when called on it, it comes down to "I don't feel like this is true".

Valid arguments, better formed, are: A) you're not sure it's representative of a real campaign, B) you're not sure it predicts the end-outcomes we really care about instead of some intermediate measure. Neither of these improve with more n, and so they're not significance related issues--- 4000 clicks doesn't improve either A or B.

I'm not off in the semantic weeds. And I think it's kind of embarrassing you're trying for a witty rejoinder instead of giving -some- kind of cogent argument.

Reebz · on Jan 7, 2021

I agree with you, but I made points A and B in my original comment that you replied to.

mlyle · on Jan 7, 2021

Neither of which improve with more n, and are completely unrelated to the discussion of "significance".

Izkata · on Jan 3, 2021

Sibling comments are right on the technical side, but to help build intuition: Statistical significance isn't solely about 12 vs 24, you have to take into account the total population as well (5563 vs 4403). In particular, the smaller population had more than double the results - this is a hint that you have to do the math and can't just say "it's not significant", because it really well could be.

throwaway2245 · on Jan 3, 2021

It is absolutely 'significant' for the usual statistical meaning, p<0.05.

smabie · on Jan 4, 2021

You can have a sample size of 5 and have it be statistically significant. Or maybe you are using some non-statistical measure of statistical significance?

whimsicalism · on Jan 3, 2021

Ah yes, the brilliant "reject a priori" inference strategy in the wild.

It's a statistically significant difference.

mlyle · on Jan 3, 2021

When I calculate a p value, it looks like p<0.01.

That seems like a highly significant result to me...

throwaway2245 · on Jan 3, 2021

It depends.

The fact that it has a significant p-value is interesting, but the lack of information about how the author decided when to stop is suggestive of p-hacking (i.e. we don't know how many screenshots were taken, but we understand that the author posted only the most favourable one)

gk1 · on Jan 4, 2021

I stopped when the campaign reached its $500 budget.

clircle · on Jan 4, 2021

The effect may be statistically significant depending on your tolerance:

  R> prop.test(c(12, 24), c(5563, 4403))

   2-sample test for equality of proportions with continuity correction

  data:  c(12, 24) out of c(5563, 4403)
  X-squared = 6.5211, df = 1, p-value = 0.01066
  alternative hypothesis: two.sided
  95 percent confidence interval:
   -0.0059903650 -0.0005970741
  sample estimates:
       prop 1      prop 2 
  0.002157109 0.005450829

jldugger · on Jan 4, 2021

Not that it harms the argument, but do you really need it two sided here?

clircle · on Jan 4, 2021

No, a one sided test is also appropriate. I tend to default to two sided tests because the acceptance of one sided tests varies by field.

boas · on Jan 3, 2021

p value by Fisher's exact test is 0.007. It's a significant difference.

bemmu · on Jan 4, 2021

Also when running campaigns, I’ve seen an ad suddenly get way more clicks than usual in a short timespan. Depending on how clicks are counted, just one user who gets click-happy could skew the result.

But in this case I suspect people were just curious why the ad looked like that and clicked to find out. Those may not be the people likely to convert.

bpodgursky · on Jan 3, 2021

In the real world, we often have to work on a "preponderance of evidence" standard to actually get things done.

Especially if the second option is cheaper and faster, there's IMO no bayesian prior that the professional ad being better (the null hypothesis) is true.

So... seems like useful data to me.

Tinyyy · on Jan 3, 2021

I think there can be prior that a professional looking ad can generate more clicks. Your argument shows a lack of statistical understanding - conditional on this data, the Bayesian approach would be to update the prior (whether A is better, or they’re equally as good) with the data collected. With such a small dataset, you might end up with a belief that there’s a 60% probability that B is better than A, but that’s not significant enough to conclude that B is in fact better than A, as you still have a lot of uncertainty.

With a prior that A is superior, you may still end up believing that A > B after updating, because there’s just so little data.

bpodgursky · on Jan 3, 2021

I addressed in my second sentence that I disagree with that prior. I understand the statistics perfectly well.

And my main point is that a 60% probability is in fact actionable in the real world, in a situation where you are forced to take action with incomplete information. Assuming you are running an ad campaign, you have to choose one of the two.

P=.95 still is an arbitrary threshold, even if it's a commonly used one.

mlyle · on Jan 3, 2021

Yes, but the significance is high here; it's a pretty (un)lucky outcome to get if A and B are equivalent, let alone if A is better than B.

xmichael0 · on Jan 3, 2021

Thank you for commenting, I am rather annoyed I spent the time to read this... The author waisted everyone and his time.