It will only differ 5% of the time if you have an adequate sample size and only check for significance once the sample size is reached.
If you end the test the moment the data reaches 95% significance, it will show a difference about 50% of the time for the same email. Many people make this mistake.
A 95% confidence interval doesn't mean much if you dont follow good statistical practices.
I don't understand why you would need a sample of a certain size. Setting a significance threshold at 5% takes sample size in to account. For example, if I ran a permutation test with a sample size of 5 in each group it could never been significant at that threshold, and never is < 5%!
A small sample size would lower your power to detect meaningful differences, which the original scenario doesn't have (by definition).
(If distributional assumptions, etc, are violated, then that's a different story!)
You need to make sure you have enough samples in order to know if you rejected the null hypothesis by chance. Stopping your test early, is a form of p-hacking. See:
Peeking at your data, and calculating the sample size you need for a test are separate statistical issues. I agree that peeking messes up significance levels :).
The point I was trying to make was you can decide to run a test with a very small sample (e.g. n = 5), and it will still have the level of type 1 power you set if you chose a significance level of .05.
> You need to make sure you have enough samples in order to know if you rejected the null hypothesis by chance.
You do this when you decide the significance level (e.g. .05). The value needed to reject, given a significance level, is a function of sample size.
The definition of Type 1 error on wikipedia has a good explanation of this:
It should be possible to adjust the significance estimate down based on an assumption of continuous sampling less than the sampling size target. E.g. if you're aiming for a sample size of 100, and after 20 samples the split is 19/1 the current ratio is 95%, but that's not accurate yet, so e.g. it could be adjusted to the average between the two possible outcome extremes (99/1 and 19/81) and show 59% with low confidence. I don't know the statistics or if that specific method actually makes sense, but it should be possible.
Only bigcos or companies whose core business is DTC will have the resources to do proper ab tests and then they might not have the knowledge to make them statistically significant.
What you're saying is perhaps correct but it's not how AB testing is done in the real world. I think many tests show a difference for random reasons.
a) 5% is still one in 20. Are you doing 20 trials or more a month? (Of course, it won't be _exactly_ 5% of the time... there's some other statistics we could calculate to say how likely it is to differ by more than some specified delta to 5% depending on how many trials you run... oh my)
b) That's assuming you are using statistics right. It's quite hard to use statistics right.
That's way better odds than most people's gut feel. Everyone thinks they don't need to run experiments because THEY already know what works.
Did you know, in experimentation programs run by Microsoft, Google and Amazon, roughly two thirds of their ideas have no impact or hurt the metrics they were designed to improve? And yet rookie web Devs or marketing assistants "know" better.
It's actually more like 80% for Google. I spent a lot of time running various experiments on Search.
I'll point out a major difference, though: Microsoft, Google, and Amazon are already highly optimized. They've had millions of man-hours put into optimizing everything from product design to UI to UI element positioning to wording to colors to fonts. When you get a new, beneficial result in a change to Google, it's usually because the environment has changed while you weren't looking, and the user is different.
That doesn't apply to a new business that's just learning how to sell their product. In a new business, by definition, you've spent zero time micro-optimizing your message & product. You can get really big wins almost by accident, if you happen to have stumbled into a product category where users actually want what you're building.