Hacker News new | past | comments | ask | show | jobs | submit login

Disclaimer: I'm the director of data science at VWO, an Optimizely competitor.

In my view, the issue is not one-tail vs two-tail tests, or sequential vs one-look tests at all. The issue is a failure to quantify uncertainty.

Optimizely (last time I looked), our old reports, and most other tools, all give you improvement as a single number. Unfortunately that's BS. It's simply a lie to say "Variation is 18% better than Control" unless you had facebook levels oftraffic. An honest statement will quantify the uncertainty: "Variation is between -4.5% and +36.4% better than Control".

When phrased this way, it's hardly surprising that deploying this variation failed to achieve an 18% lift - 18% is just one possible value in a wide range of possible values.

The big problem with this is that customers (particularly agencies who are selling A/B test results to clients) hate it. If we were VC funded, we might even have someone pushing us to tell customers the lie they want rather than the truth they need.

Note that to provide uncertainty bounds like this, one needs to use a Bayesian method (only us, AB Tasty and Qubit do this, unless I forgot about someone).

(Frequentist methods can provide confidence intervals, but these are NOT the same thing. Unfortunately p-values and confidence intervals are completely unsuitable for reporting to non-statisticians; they are completely misinterpreted by almost 100% of laypeople. http://myweb.brooklyn.liu.edu/cortiz/PDF%20Files/Misinterpre... http://www.ejwagenmakers.com/inpress/HoekstraEtAlPBR.pdf )




I pushed for this really hard at one company. I was just a dev, not one of our Harvard MBAs, but I know math.

I got lots of pushback because it's harder to remember two numbers instead of one, and it makes reporting confusing because executives are also used to just the one. Throw in the business guys who did statistical significance by (literal quote) "experience and gut feel" instead of reliable, rigorous quantification of uncertainty, i.e. statistics, and I really couldn't do anything.


Could you point to the math/protocol that goes from a sample set, to a range "-4 to +34" and of course how one goes from a range to giving a single number?

I feel like the discussion on this thread is missing the underlying "source code"


Here are slides on the topic and the full math paper. But we dont go from a credible interval to a single number - we just report credible intervals. We find that to be the only honest choice.

https://www.chrisstucchio.com/pubs/slides/gilt_bayesian_ab_2...

https://www.chrisstucchio.com/blog/2013/bayesian_analysis_co...

https://cdn2.hubspot.net/hubfs/310840/VWO_SmartStats_technic...


thank you - bedside reading :-)


I don't know about the one that goes from a sample set, but the range provided is likely just the 90 or 95% confidence interval, which means you can get the single number by taking the average of the two numbers. Don't do it though, it will tell you a lot less than the two numbers; a range that goes from -50 to +60 means that you have a lot of uncertainty and should probably not deploy, whereas +2 to +6 means you have little uncertainty and should deploy since it is almost certainly a win, if a small one.


It's not a confidence interval, it's a credible interval. It's derived from a Bayesian posterior.

Confidence intervals are confusing and should rarely be used to communicate statistics to laypeople. (See the links in my above post.) Most of the time, you can tell people the definition of a confidence interval, and they will misinterpret this and think it's a credible interval.

This is why we went Bayesian at VWO - we felt it would be easier to change the statistics to match people's thinking than to change people's thinking about statistics.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: