It will only differ 5% of the time if you have an adequate sample size and only check for significance once the sample size is reached.
If you end the test the moment the data reaches 95% significance, it will show a difference about 50% of the time for the same email. Many people make this mistake.
A 95% confidence interval doesn't mean much if you dont follow good statistical practices.
I don't understand why you would need a sample of a certain size. Setting a significance threshold at 5% takes sample size in to account. For example, if I ran a permutation test with a sample size of 5 in each group it could never been significant at that threshold, and never is < 5%!
A small sample size would lower your power to detect meaningful differences, which the original scenario doesn't have (by definition).
(If distributional assumptions, etc, are violated, then that's a different story!)
You need to make sure you have enough samples in order to know if you rejected the null hypothesis by chance. Stopping your test early, is a form of p-hacking. See:
Peeking at your data, and calculating the sample size you need for a test are separate statistical issues. I agree that peeking messes up significance levels :).
The point I was trying to make was you can decide to run a test with a very small sample (e.g. n = 5), and it will still have the level of type 1 power you set if you chose a significance level of .05.
> You need to make sure you have enough samples in order to know if you rejected the null hypothesis by chance.
You do this when you decide the significance level (e.g. .05). The value needed to reject, given a significance level, is a function of sample size.
The definition of Type 1 error on wikipedia has a good explanation of this:
It should be possible to adjust the significance estimate down based on an assumption of continuous sampling less than the sampling size target. E.g. if you're aiming for a sample size of 100, and after 20 samples the split is 19/1 the current ratio is 95%, but that's not accurate yet, so e.g. it could be adjusted to the average between the two possible outcome extremes (99/1 and 19/81) and show 59% with low confidence. I don't know the statistics or if that specific method actually makes sense, but it should be possible.
If you end the test the moment the data reaches 95% significance, it will show a difference about 50% of the time for the same email. Many people make this mistake.
A 95% confidence interval doesn't mean much if you dont follow good statistical practices.