Hacker News new | past | comments | ask | show | jobs | submit login
Most Winning A/B Test Results Are Illusory [pdf] (qubit.com)
163 points by maverick_iceman on Jan 19, 2017 | hide | past | favorite | 105 comments



This is yet another article that ignores the fact that there is a MUCH better approach to this problem.

Thompson sampling avoids the problems of multiple testing, power, early stopping and so on by starting with a proper Bayesian approach. The idea is that the question we want to answer is more "Which alternative is nearly as good as the best with pretty high probability?". This is very different from the question being answered by a classical test of significance. Moreover, it would be good if we could answer the question partially by decreasing the number of times we sample options that are clearly worse than the best. What we want to solve is the multi-armed bandit problem, not the retrospective analysis of experimental results problem.

The really good news is that Thompson sampling is both much simpler than hypothesis testing can be done in far more complex situations. It is known to be an asymptotically optimal solution to the multi-armed bandit and often takes only a few lines of very simple code to implement.

See http://tdunning.blogspot.com/2012/02/bayesian-bandits.html for an essay and see https://github.com/tdunning/bandit-ranking for an example applied to ranking.


Thompson sampling is a great tool. I've used it to make reasonably large amounts of money. But it does not solve the same problem as A/B testing.

Thompson Sampling (at least the standard approach) assumes that conversion rates do not change. In reality they vary significantly over a week, and this fundamentally breaks bandit algorithms.

https://www.chrisstucchio.com/blog/2015/dont_use_bandits.htm...

Furthermore, you do not need to use Thompson Sampling to have a proper Bayesian approach. At VWO we also use a proper Bayesian approach, but we use A/B testing in to avoid the various pitfalls that Thompson Sampling has. Google Optimize uses an approach very similar to ours, (although it may be flawed [1]) and so does A/B Tasty (probably not flawed).

https://cdn2.hubspot.net/hubfs/310840/VWO_SmartStats_technic...

Note: I'm the Director of Data Science at VWO. Obviously I'm biased, etc. However my post critiquing bandits was published before I took on this role. It was a followup to a previous post of mine which led people to accidentally misuse bandits: https://www.chrisstucchio.com/blog/2012/bandit_algorithms_vs...

[1] The head of data science at A/B Tasty suggests Google Optimize counts sessions rather than visitors, which would break the IID assumption. https://www.abtasty.com/uk/blog/data-scientist-hubert-google...


No, no, no, no. https://www.chrisstucchio.com/blog/2015/dont_use_bandits.htm... needs a rebuttal so very, very badly.

> Depending on what your website is selling, people will have a different propensity to purchase on Saturday than they have on Tuesday.

Affects multi-armed bandit and fixed tests. If you do fixed A/B test on Tuesday, your results will also be wrong. Either way, you have to decide on what kind of seasonality your data has, and don't make any adjustments until the period is complete.

If anything, multi-armed bandit shines because it can adapt to trends you don't anticipate.

> Delayed response is a big problem when A/B testing the response to an email campaign.

Affects multi-armed bandit and fixed tests. If you include immature data in your p-test, your results will be wrong. Either way, you have to decide how long it takes to declare an individual success or failure.

> You don't get samples for free by counting visits instead of users

Affects multi-armed bandit and fixed tests. Focusing on relevant data increases the power of your experiment.

---

For every single problem, the author admits "A/B tests have the same problem", and then somehow concludes that multi-bandit tests are harder because of these design decisions, despite the fact they affect any experiment process.


If anything, multi-armed bandit shines because it can adapt to trends you don't anticipate.

It can, but the time it takes is exp(# of samples already passed).

You can improve this by using a non-stationary Bayesian model (i.e. one that assumes conversion rates change over time) but this usually involves solving PDEs or something equally difficult.

For every single problem, the author admits "A/B tests have the same problem", and then somehow concludes that multi-bandit tests are harder because of these design decisions, despite the fact they affect any experiment process.

The point the author (me) is trying to make is not that bandits are fundamentally flawed. The point is that for A/B tests, all these problems have simple fixes: make sure to run the A/B test for long enough.

For bandits, the fixes are not nearly as simple. It usually involves non-simple math, or at the very least non-intuitive things (for instance not actually running a bandit until 1 week has passed).

At VWO we realized that most of our customers are not sophisticated enough to get all this stuff right, which is why we didn't switch to bandits.


> The point is that for A/B tests, all these problems have simple fixes: make sure to run the A/B test for long enough.

Multi-bandit has the same fix: make sure the test has run for long enough before adjusting sampling proportions.


So what I'm proposing to do is run A/B with a 50/50 split for a full week, then when B wins shift to 0/100 in favor of B.

You seem to be proposing to run A/B with a 50/50 split for a full week, then when B does a lot better shift to 10/90 in favor of B and maybe a few weeks later shift to 1/99.

What practical benefit do you see to this approach? From my perspective this just slows down the experimental process and keeps losing variations (and associated code complexity) around for a lot longer.


First, Google Analytics (for example) runs content experiments for a minimum of two weeks regardless of results. It's hardly an unrealistic timeframe for reliable conclusions.

> What practical benefit do you see to this approach?

Statistically rigorous results, with minimal regret.

In your example, you reach the end of the week, and your 50/50 split has one-sided p=0.10, double the usual p<0.05 criteria. What do you do?

(a) Call it in favor of B, despite being uncertain about the outcome. (b) Keep running the test. This compromises the statistical rigor of your test. (c) Keep running the test, but use sequential hypothesis testing, e.g. http://elem.com/~btilly/ab-testing-multiple-looks/part1-rigo... This significantly increases the time to reach a conclusion, and costs you conversions in the meantime.

(a) and (b) are the most popular choices, despite them being statistically unjustifiable. http://www.evanmiller.org/how-not-to-run-an-ab-test.html

---

The essential difference when choosing the approach is that 50/50 split optimizes for shortest time to conclusion, and multi-bandit optimizes for fewest failures.

In web A/B testing, the latter is usually the most applicable, and for that, you cannot beat Thompson sampling on the average, no matter how diabolically clever your scheme. https://www.lucidchart.com/blog/2016/10/20/the-fatal-flaw-of...

There are times when the former is more important, e.g. marketing wants to know how to brand a product that is being released next month. These are the clinical-like experiments that frequentist approaches were formulated for.


Statistically rigorous results, with minimal regret.

The results are only statistically rigorous provided your bandit obeys relatively strong assumptions.

As another example, suppose you ran a 2-week test. Suppose that from week 1 to week 2, both conversion rates changed, but the delta between them remained roughly the same. A 50/50 A/B split doesn't mind this, and in fact still returns the right answer. Bandits do mind.

I don't do p-values. I do Bayesian testing, same as you. I just recognize that in the real world, weaker assumptions are more robust to experimenter or model error, both of which are generally the dominant error mode.

In web A/B testing, the latter is usually the most applicable, and for that, you cannot beat Thompson sampling on the average, no matter how clever yur scheme.

This is simply not true. The Gittins Index beats Thompson sampling, subject again to the same strong assumptions.

Look, I know the theoretical advantages of bandits and I advocate their use under some limited circumstances. I just find the stronger assumptions they require (or alternately the much heavier math requirements) mean they aren't a great replacement for A/B tests which are much simpler and easier to get right.


Thompson sampling does not need to assume stability. You can inject time features into the model if you want to model seasonality (or, more accurately, ignorances of seasonality) and you can also have a hidden random-walk variable.

Yes, if you assume stability and things vary, you will not have good results. That seems like any statistics.


I agree. I've even done similar things a few years back:

https://www.chrisstucchio.com/blog/2013/time_varying_convers...

However, with an A/B test you don't need to change the math or eliminate the stability assumption from it. You just need to choose a good test duration.

As I pointed out to pauldraper in the other thread, when you start fixing bandits by only changing the split every season, suddenly bandits start to look a lot like A/B testing.


Actually, Chris, I think you misunderstand my comment.

Thompson sampling (and Bayesian Bandits in general) can be applied with a model for estimating conversion that is more complex than P(conversion|A). It can include parameters for time of day, day of week and even be non-parametric.

If you do this, you the standard Thompson sampling framework with nochanges* whatsoever will still kill losers quickly (if the loss is big enough that seasonality cannot save it) and will also wait until it has enough data seasonally to make more refined decisions. This is very different from simply waiting for an even season point to make decisions.

You do need more data to understand a more complex world, but having an optimal inference engine to help is better than having a severely sub-optimal engine.


I understand that. The blog post I linked to describes doing exactly that.

But the point I'm making is different. This is a lot of stuff to get right and most people aren't that sophisticated. Getting A/B tests right is a lot easier, mainly because they are significantly more robust to model error.


VWO, you mean Vanguard FTSE Emerging Markets ETF?


Visual website optimizer, vwo.com. We are an A/B testing vendor and analytics tool.


Ah. Namespace collision.


This ignores what I have seen in my experience, which is that marketing teams - composed of the people who dictate what A/B tests the business should run - have little to no background in statistics, let alone any interest whatsoever in actually performing legitimate A/B tests.

It's often the case that the decision maker has already decided to move ahead with option A, but performs a minimal "fake" A/B test to put in their report as a way to justify their choice. I've seen A/B tests deployed at 10am in the morning, and taken down at 1pm with less than a dozen data points collected. The A/B test "owner" is happy to see that option A resulted in 7 conversions, with option B only having 5. Not statistically significant whatsoever, but hey let's waste developers' time and energy for two days implementing an A/B test in order to help someone else try to nab their quarterly marketing bonus.


Join us, comrade, in the fight against the statistical blight!

Move your decision process to multi-armed bandit and you never have to decide when to end an A/B test -- math does it for you, in a provably optimal way.


I'm not sure this solves it because you have to have a really strong sense of the loss function to pull it off. That's much easier to intuit and use to guide experiments than actually build into the bandit algo.


> That's much easier to intuit and use to guide experiments than actually build into the bandit algo.

IDK about your intuition, but for most other people, it gets in the way of statistics.

The "loss function" is just as easy to calculate for A/B tests as for multi-armed bandit. The value of user doing A is $X, the value of B is $Y, and the value of C is $Z.


You gave me some reading to do. :) Thank you.


But it's DATA SCIENCE. You know it's SCIENCE because they called it SCIENCE.


This is yet another comment claiming that Thompson sampling is the answer to all of our statistical problems!

Naive Thompson sampling (like the code you linked to) will result in problems equally disastrous to those I wrote about in the Qubit whitepaper. Other comments have highlighted a key problem with simple bandit algorithms - reward distributions which change over time will render their results worthless. You can model these dynamics but not in 'a few lines of very simple code'. It is verging on the irresponsible to suggest otherwise.

I personally favour a bayesian state-space model to elegantly take care of these things - but that's outside the remit of the whitepaper and outside the skill set of most non-statisticians. Frequentist testing, when done properly, is simple to implement and has statistical guarantees that are very attractive in practice.


+1

I wrote a blog post focusing on the #2 problem mentioned: multiple testing. I simulated the typical mitigating approaches, and comparing them against Thompson sampling (code in GitHub). https://www.lucidchart.com/blog/2016/10/20/the-fatal-flaw-of...

I'm not a Bayesian fanatic, but given how perfectly A/B test optimization fits in the Bayesian approach, it's a shame it's not yet the de facto standard.

I think the primary reasons are (1) it's not as intuitive (especially for the uninitiated), (2) it's harder to implement an automated feedback mechanism, (3) FUD. E.g. https://www.chrisstucchio.com/blog/2015/dont_use_bandits.htm... lists devastating complications with correct multi-armed bandit tests, and then in fine print admits that traditional tests have all the same complications.


Is it really less intuitive? I would've said less familiar. Null-hypothesis significance testing is unintuitive: nonexperts seem to explain it wrongly more often than not. (Like "the p-value is the probability the result was random chance".) Probably both approaches are unintuitive to humans unless they're explained really well.


You're right. True understanding of null hypothesis and p-value isn't easier than Bayesian. Heck, just try to figure out if you need a one-sided or two-sided test.

But it's easier to try to black box it and say this number cruncher tells me if these two things are really different. Of course, if you don't understand what's going on, you start p-hacking.

Anyway, you're I think you're right.


Actually, telling you whether things are different isn't the goal.

The goal is to make practical business decisions with reasonably high likelihood of being right and, if wrong, having only limited impact.

The output of a well done Bayesian analysis can be very easy for business people to understand. Statements like (you have a 60% chance of making just the right decision and a 35% chance of making a decision that is within 3% of right) are easy for most business stake-holders to understand because that is close to how they frame their own decision making process.

Laboring to reject a null hypothesis is an unnatural act for most people.


I'd guess the majority of p-hacking is not intentionally fraudulent, yeah. The first few intros to stats I tried to read had trouble sinking in -- it felt opaque and authoritarian somehow. Easy to treat it as a cookbook until I learned of the bayesian approach.


Given most people don't get A/B testing, it's a stretch to me to believe that some that does would know about more complex approaches that require more skill.

Firmly believe that there's way more to be gained by more people understanding how to use A/B testing than more complex solutions.


This is a great read. But left me thinking I'm missing in the fundamentals. Can you recommend a book (or other posts) on statistics fundementals for programmers?


If no one actually knowledgeable answers, I'll say I liked what I read of Doing Bayesian Data Analysis -- early partial draft here: https://faculty.washington.edu/jmiyamot/p548/kruschkejk%20ba...

I didn't persevere because the R language rubs me the wrong way and I'd be comfier with Python or Haskell or something. http://camdavidsonpilon.github.io/Probabilistic-Programming-... looks promising in that vein, but I haven't read it.


I agree with you (and love your blog, btw), but I think you're skipping over at least a few benefits you can get out of a mature / well built a/b framework that are hard to build into a bandit approach. The biggest one I've found personally useful is days-in analysis; for example, quantifying the impact of a signup-time experiment on one-week retention. This doesn't really apply to learning ranking functions or other transactional (short-feedback loop) optimization.

That being said, building a "proper" a/b harness is really hard and will be a constant source of bugs / FUD around decision-making (don't believe me? try running an a/a experiment and see how many false positives you get). I've personally built a dead-simple bandit system when starting greenfield and would recommend the same to anyone else.


Speaking of mature, well-built A/B test frameworks, Google Analytics uses multi-armed bandit.

https://support.google.com/analytics/answer/2844870?hl=en


Probably worth mentioning that the Google Content Experiments framework is in the process of being replaced with Google Optimize (currently in a private beta) which does NOT make use of multi-armed bandits much to my confusion and disappointment.


Huh. So do you know if they do anything help with repeat testing/peeking?

Optimizely takes an interesting approach: they apply repeat testing methods, segmenting the tests by user views of the results. Like 30x more complicated than multi-bandit, but they don't need a feedback mechanism.


Thank you Ted for bringing sanity to this conversation. Terrific point and post.

By the way, I doubt you remember me, but thank you for inviting me on a tour of Veoh ten years ago when I was a young college sophomore. I enjoyed the opportunity as well as our brief chat about Bayesianism.


My calendar agrees with you ... we apparently had lunch in December, 2007. Sadly, I don't remember it off-hand.

On the other hand, one reason is that I invited a lot of students to come visit (and a number to intern with us).


I think the entire approach discussed in this pdf is flawed. (Edit: not saying PDF itself is flawed or wrong, just the hypothesis testing approach to A/B testing.)

The right question to ask is: What is the difference between A and B, and what is our uncertainty on that estimate?

The wrong question to ask is: Is A different/better than B, given some confidence threshold?

The reason this is the wrong question is that it's unnecessarily binary. It is a non-linear transformation of information that undervalues confidence away from the arbitrary threshold and overvalues confidence right at the arbitrary threshold.

A test with only 10 or 100 samples still gives you information. It gives you weak information, sure, but information nonetheless. If you approach the problem from a continuous perspective (asking how big the difference is), you can straightforwardly use the information. But if you approach the problem from a binary hypothesis-testing perspective (asking is there a difference), you'll be throwing away lots of weak information when it could be proving real (yet uncertain) value.

Once you switch away from the binary hypothesis-testing framework, you no longer have to worry about silly issues like stopping too early or false positives or false negatives. You simply have a distribution of probabilities over possible effect sizes.


That's putting the cart before the horse.

Before you can quantify a difference, you have to determine whether one exists in the first place. That is the purpose of binary testing; without it, you're just looking at noise without any means to decide what is real and what is not.

As a corollary, if you can meaningfully quantify the difference between A and B, then you should have no trouble establishing that they are different. Obviously business decisions are not generally going to uphold the rigor of good science, but what is the purpose of quantifying things when you're as likely to be wrong as your are right?


Asking whether A and B are different is not a useful question. There will be a difference 100% of the time. (Though of course sampling may not reliably detect the difference for feasible samples.)

The superior question is which is probably better, and by how much. If all you know is that A is 75% likely to be better than B, then go with A. It's useful information, even if it doesn't cross a an arbitrarily preset threshold of 95% or whatever you use.

You don't need to wait for certainty to act. In fact, all actions are taken under uncertainty. So it feels incredibly artificial and counterproductive to frame these questions in such a binary, nonlinear way. It's clinging to certainty when certainty does not exist.


Actually even that isn't the best question.

The best question is what's the expected reward from choosing A.

Consider the scenario of 10% chance B > A, 90% chance they are equal. To me that sounds like a winning bet.

In contrast, 70% chance B is a lot better than A, 30% chance B is a lot worse than A sounds like a scary bet. I'll wait and gather more info.

Proper decision theory is based on a loss function which incorporates the magnitude of gains/losses rather than just their existence.

https://en.wikipedia.org/wiki/Loss_function


I don't know about 'flawed'. Classical hypothesis testing has its place, when used correctly. I guess the point of the article is that it falls apart dramatically when you do certain things wrong. And that these things turn out to be in very common practice.

Disclaimer: I wrote that article (a long time ago...)


If you're running an A/B test, by which I mean precisely that you already have two versions available and are simply trying to decide which is most likely to give better results in the future by comparing how successfully they achieve the desired result with all else being equal, then I'm still waiting for someone to explain to me why the correct question is not simply "Which performs better, A or B?"

The statistical rigour of all the hypothesis tests and Bayesian methods and so on is laudable, but fundamentally any result and any conclusion you draw from those tests can only ever be as good as the underlying assumptions from which the result was derived. For example, if you choose to perform any statistical test that designates one of A and B the default case and tests for evidence in favour of the other, you have made a determination that the situation is not symmetric. You can choose a level of power or a significance level you want to achieve, but the 5% convention is again arbitrary. The same goes for any prior you choose in a Bayesian method.

Fundamentally, if you are truly starting from a neutral situation where you just want to know which of A and B is better, and you truly have no other information to go on other than how they have performed since the start of your measurements with all else being equal, then it remains the case that you still have nothing else to go on no matter what statistical calculations you perform.


> The wrong question to ask is: Is A better than B, given some confidence threshold?

That's a perfectly fine question so long as there are three acceptable answers: Yes, No and Unclear.


Suppose A is significantly better than the control. And B is not significantly better than the control. Does that imply A is significantly better than B? No. Statistical differences do not the follow the same logic as normal differences.

It feels so pointless to bin probabilities into yes, no, and unclear. We're throwing away information with value when we compress the problem this way. And I think it leads to more misinterpretations.


In A/B testing, either A or B is the control.


> A test with only 10 or 100 samples still gives you information. It gives you weak information, sure, but information nonetheless.

If your market consists of 1000 people.


Could you elaborate? Even if your market consists of infinite people, it gives you real information.


Maybe I'm misunderstanding but isn't the whole point of A/B testing to answer the question "is A better than B?"


This is very well explained, even if you don't understand statistics. Apparently not many vendors of A/B testing software do.


I suspect the people building it do, the people selling it probably do not.


Having worked between customers and engineering at a place that did this, I can confirm that "our tests are Bayesian!!" was a refrain everyone was taught to repeat, but few if any were taught what it meant.

This video is helpful: https://www.youtube.com/watch?v=Dy_LRK2Pkig

I still don't know a great way to describe it in the 6 or 7 seconds you have before the potential customer's attention starts to flag.


> I still don't know a great way to describe it in the 6 or 7 seconds you have before the potential customer's attention starts to flag.

I don't think there is a 6-7 second way to describe it. I mean, imagine someone that knows nothing of frequentist statistics, could you explain the meaning/significance of a confidence interval to them? It's a pretty high bar to clear, I've found for CI's it's typically 1-5 minutes depending on background.


For most businesses, focusing on the math and making incremental improvements to the statistical methodology is a waste of time. There is a "good enough" approach to using tools like Optimizely and VWO.

Instead, they should be focusing on the quality of the tests they run. Quality hypotheses, preferably backed by data-driven insights into behavior that test very clear, if not dramatic, changes to the user flow will be what leads to bottom-line improvements, not increasing the mathematical merit and rigor of poorly-conceived tests. Of course, doing both is ideal but I put more fault on the testing software than the companies using it.

Keep track of historical conversion rate and adjust/account for noise in your historical conversion rate. Conceive a testing program that focuses on the quality testing and you'll likely see an upward trend in that historical conversion rate.


Hmmh, this is interesting. Most A/B software will let you set a level of statistical confidence that needs to be attained before a winner can be declared. For example in Google Analytics two common ones are 95% and 99%. We stop our tests when they reach at least 95% confidence. Is the author saying one must wait for 6000 events even if the difference between A/B is large? The larger the relative difference, the fewer events needed.


Stopping when you hit 95% confidence is a classic failure mode. Yes, if you are doing classic t-test based A/B testing, you have to wait until a pre-determined threshold; otherwise, effectively, by looking at the p-value and stopping when it hits 95% confidence, what you are doing is ignoring all negative results and accepting the first positive result you see -- clearly, that's bad science. I'm simplifying the math here, but this is the general notion.

You can see a demonstration of this in practice: http://www.gigamonkeys.com/interruptus/ (sit back and watch your false positive rate, aka "bogus a/b testing results", grow).


Is it really a failure mode? I think it should be fine to stop whenever you want during a test. You just have to be smart enough to not use a t test and misinterpret it to mean than A is 95% likelier than B.


Yes. It really is a failure mode. No, it is not fine to stop whenever you want during a test. It doesn't matter what statistical test you are using -- you need to gather enough data to have sufficient statistical power.

Also, the more tests you do the lower the p-value should be.

If you stop when you reach your p-value then you are misinterpreting the results if you claim the results mean anything.


This seems terribly dogmatic to me. Take the example of clinical trials. Suppose you're testing a new cancer drug. You design an experiment to test the new drug, named B, versus an established chemotherapy treatment named A. You expect B's performance to be similar to A's performance in controlling the cancer, so to make sure your trial has high power, you plan to test the drugs on 2,000 patients (with each drug administered to 1,000).

Now consider the following two scenarios:

(1) After giving drug B to 100 patients, all 100 patients are dead. Do you continue the trial, giving the (apparently) deadly drug B to 900 more patients?

(2) After giving drug B to 100 patients, all 100 patients are totally cured (vs A curing 3 in 100). Do you continue the trial, withholding the (apparent) cure for cancer from 900 more patients?

In either case, since you have a strong effect, it seems to me there is logical justification to end the trial early.

Obviously the stakes are higher in clinical trials than website design, but in both cases, data acquisition has costs and intermediate results may inform changes to your experiment design.

I honestly cannot see how anyone could blankly assert that stopping a test is always wrong. There are certainly circumstances where you do want to stop early. You just have to make sure you aren't misinterpreting a statistic when you do so.


You can arrange a trial (either a clinical one or an A/B test) so that it stops early. However, the analysis plan needs to take that into account: you absolutely cannot just peek at the p-values willy-nilly and use that information to make go/no-go decisions; doing this makes them uninterpretable.

One of the simplest ways to end a trial early is through curtailment--you stop the trial when additional data won't change its outcome. Imagine you have a big box of 100 items, which can either be Item A or Item B. You want to test whether the box contains an equal proportion of As and Bs, so you pull an item from the box, unwrap it, and record what is inside. Naively, one might think that it is necessary to unwrap all 100 items, but you can actually stop after you find 58 of the same type because the 58:42 split--and all more extreme imbalances--allows you to reject the 50:50 hypothesis.

Curtailment is "exact" and fairly easy to compute if you have a finite sample size, each of which contributes a bounded amount to your result. This would certainly happen in your extreme examples. There are also more complicated approaches that allow you to stop even earlier if either a) you're willing to sometimes make a different decision than if you ran to completion and/or b) you're willing to stop earlier on average, even if it means running for longer in some cases.


Obviously it depends on how strong the effect is. The point is you identify the effect and the power. If you get to 1600 people and you are seeing a > 10% effect then sure -- you can stop. As long as you have sufficient power. The point is you must know what the statistical power is, and know where your break points are. You absolutely can not stop just at seeing a 10% effect -- which could happen if you happen to get one in the first 10 samples. That is not dogmatic, it is good statistics.

Take your example. 100 patients, all dead or 100 patients all alive -- you have demonstrated an infinite effect (edit: ok no need to be hyperbolic, 99+% effect), and probably have covered statistical rigor. If your drug is that effective you are probably criminally liable a lot sooner than 100 dead patients. Unfortunately actual medical (and A/B) studies do not mimic make believe scenarios.


Totally agree. Stopping at 10% and then claiming the effect size is 10% would be silly. But seeing a giant effect and stopping is totally cool in my book. The bigger the effect difference, the fewer samples you need to judge it. So I think it can be fine to peek and halt. Nothing forces us to use a static number of samples other than an old statistics formula.


The point is, you don't know what that number is unless you do the math. It's not a matter of "judging it". It is a matter of calculating it.

If you "peek and halt" without doing the math. You might as well have a random good result in the first 10 and say "look! Positive results!". You recognize that is ridiculous. So when is peeking and stopping not ridiculous?

A: when the statistical power is sufficient for the observed effect. In the examples -- 1600 or 6000 for a 10% or 5% effect, respectively. And much less for a 20% or 40% effect! -- but you don't know the number required unless you do the math.


Again, this limitation doesn't apply to good bandit approaches. You see big effects quickly and smaller effects more slowly and don't need to do any pre-computation about power at all.

You can even get an economic estimate of the expected amount of value you are leaving on the table by stopping at any point.


"Wrong" isn't the word we're looking for here, I don't think. But your above example is bullshit -- nobody puts 1000 patients at risk in a phase I (safety) trial, and if the dose isn't reasonably well calibrated by the phase III study you're describing above, someone's going to jail. In Phase II we will often have stopping rules for exactly this reason, just in case the sampling was biased in the small Phase I sample.

Above there are a number of things to notice:

1) The phasing approximates Thompson sampling to a degree, in that large late-phase trials MUST follow smaller early phase trials. Nobody is going to waste patients on SuperMab (look it up).

2) The endpoints are hard, fast, and pre-specified:

IFF we have N adverse events in M patients, we shut down the trial for toxicity.

IFF we have X or more complete responses in Y patients, we shut down the trial because it would be unethical to deprive the control arm.

IFF we have Z or fewer responses in the treatment arm, given our ultimate accrual goal (total sample size), it will be impossible to conclude (using the test we have selected and preregistered) that the new drug isn't WORSE than the standard, so we'll shut it down for futility. Those patients will be better served by another trial.

You are massively oversimplifying a well-understood problem. Decision theory is a thing, and it's been a thing for 100 years. Instead of lighting your strawman on fire, how about reframing it?

Stopping isn't "always" wrong, but stopping because you've managed to hit some extremal value is pretty much always biased. The "Winner's curse", regression to the mean, all of these things happen because people forget about sampling variability. It's also why point estimates (even test statistics) rather than posterior distributions are misleading. If you're going to stop at an uncertain time or for unspecified reasons, you need to include the "slop" in your estimates.

"We estimate that the new page is 2x (95% CI, 1.0001x-10x) more likely to result in a conversion"... hey, you stopped early and at least you're being honest about it... but if we leave out the uncertainty then it's just misleading.

All of the above is taken into account when designing trials because not only do we not like killing people, we don't like going to jail for stupid avoidable mistakes.


> But your above example is bullshit — nobody puts 1000 patients at risk in a phase I (safety) trial

It isn't bullshit - at least, not for that reason. It was a thought experiment, using an extreme example to test your assertions.


It is an example that tends to bring in lots of irrelevant detail and ethics that distracts from the actual point


My point is that you can still extract useful information when your stop is dynamic rather than static. One typical scenario is when your effect size ends up being larger than you originally guessed. There's little reason to continue if the difference becomes obvious.

In the future, I would appreciate it if you steelmanned my comments or asked for clarification instead of insulting me. It hurt my feelings. I wish I had written a better comment that hadn't incited such a reaction from you. Best wishes.


You are right, I shot from the hip. Sorry about that.

I also have noprocrast set in my profile so I couldn't go back and edit it (something I thought about doing). I probably would have toned it down if I hadn't requested that Hacker News kick me off after 15 minutes.

Your line of discussion is productive. It's just important that people understand the difference between degrees of belief and degrees of evidence from a specific study and never confuse the two. Trouble is, lots of folks confuse them, and lots of other folks prey on that confusion.

Anyways, sorry for being a jerk.


No worries and thanks for the apology. I apologize for my own comments in this thread, which were lower than the quality I aspire to. I had pulled an all-nighter for work and was sitting grumpily with my phone at an airport.


Your intuition about effects bigger than you expected is just on target.

But it applies at all scales of effect. Stop when you have a big enough effect or have high confidence that you won't care.


P-values are just the wrong metric here. With good sequential testing, you would stop

a) when the challenger strategy is much worse than the best alternative with reasonable probability,

or

b) when it is clear that some strategy is very close to the best with high probability (remember, you may still not know how good the best is)

Note that waiting for significance kind of handles the first case, but in the case of ties, it completely fails on the second case. That is, when there are nearly tied options, you will wait nearly forever. It is better to pick one of the tied options and say that it is very likely as good as the best.


I mean.

(1) would absolutely not be continued, obviously.

(2) would be continued to completion.

In science one is maximally conservative with positive results. Negative results can be accepted rather quickly but positive results need to be maximally investigated. False positives are unacceptable, false negatives are often less damaging.

I'd question how scenario #1 made it through animal trials preceding experiments in humans.


Actually, trials are sometimes stopped early for "efficacy." This is done on the grounds that--past some point--incremental gains in statistical certainty are not worth depriving the subjects in the control arms of the drugs' benefits. For example, this happened in the PARADIGM-HF trial in 2014: http://www.forbes.com/sites/larryhusten/2014/03/31/novartis-...

This isn't totally uncontroversial (e.g., https://www.ncbi.nlm.nih.gov/pubmed/18226746 ) but it's not obvious to me how to balance statistical and ethical concerns.

As for the animal studies, it sometimes happens. Someone up-thread mentioned SuperMAB, a potential drug which seemed safe in monkeys. However, when first tested in humans (using a dose 500x lower than the one administered to the monkeys), it generated a cytokine storm that put all six test subjects in the ICU. Here's a fairly decent summary: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2964774/


Using Bayesian methods and some relatively rarely used Frequentist methods you can achieve what you suggest. Maybe you'll take it from a statistician that you're being grossly naive about experimentation.


You are advocating for something called sequential testing. I'm not going to go into the details, but you don't use the same metrics with sequential testing if you want to control your final error rates. That is, you can't look at the p-value on your t-test and say that's your false positive rate for the sequential test. It's important for reasons you describe, but not exactly idiot proof.


Math does tend to be a bit dogmatic, yes.


If you decide to stop when a test says the error rate (FPR) is 5%, your error rate will be higher than 5%. If you don't want to call that a failure mode, it's at least a misuse of the metric.


Yes, stopping at 95% percent is BAD. First of all you need to reach on a sample size large enough, otherwise you are just lying to yourself.

Source: I have a degree in analytics.


The problem is that people often repeatedly check the significance, despite the fact that this test only guarantees a certain false positive rate if you use it once.

If you're planning to stop as soon as you find a positive result you'll need to modify the tests to ensure that the total chance that any test results in a positive is low enough. In general you'll need to keep increasing the significance level as you do more events (if you only plan to test a finite number of times you can keep the significance fixed, but I think this will lead to more false negatives than an 'adaptive' significance).

To illustrate, if you have 6000 events and check for 99% significance after every one, then you'd expect about 60 false positives on average. Of course these false positives aren't distributed uniformly, so it's not like you'll always find 60 false positives, however it's not like you'll only find 0 or 6000 either, meaning that (significantly) more than 1% of the time you'll have at least 1 false positive.


Yep, this is precisely the multiple testing point covered.


Checking and stopping is only a problem if you use the inappropriate formula. It's not generally a problem as far as I'm aware. What do you think?


You can probably come up with a proper formula, but you do actually need to put some effort into it to get it right.

Of course there's the generic Bonferroni correction where you just divide the significance threshold by the number of tests, but you'll end up with a lot more false negatives (where you just miss the result entirely), and if you want to keep running the test for arbitrary lengths of time it becomes even more difficult (you'll need to keep lowering the threshold). Then again this does give strong guarantees and might work well enough if you don't check the results often.

Basically to get the best results you'll need to balance the false negative rate against the false positive rate and the expected length of the test, which is a rather complicated trade off. But I expect someone will have done at least some of the calculations for simple A/B testing.


It is a problem. You are "spending" some of your error "budget" whenever you peek prior to an endpoint. This is why if you're going to do interim analyses in clinical trials, you have to include it in the design.

People got tired of trialists playing games with statistics and patients dying or ineffective drugs making it to market. Now if you want to run a trial as evidence for approval, you need to specify an endpoint, how it will be tested, when, with what alpha (false positive) threshold, and what's the minimum effect size required for this.

If you are doing interim analysis, futility, or non-inferiority, you have to write that into the design, too.

People can jerk around with subgroup analyses in publications but the FDA won't accept that sort of horse shit for actual approval. And thank heavens for that.


Just to expand on this, David Robinson quantified this a while back: http://varianceexplained.org/r/bayesian-ab-testing/

He ran some simulated A/B tests using my Bayesian A/B testing technique (which is now powering VWO). He showed that while peeking does ensure that loss < specified threshold, if you don't peek the loss will be even lower.

So although peeking is still valid, that validity does come at a cost.


In a world with rational actors and free computation, there shouldn't ever be a penalty for having more information about reality. Therefore, the only reason not to peek is that actors are irrational and/or computation is expensive.

Honestly, if the first 100 patients die into a 1,000-patient clinical trial, I have zero qualms about making the judgment to stop early, even if it wasn't written into the design. I'm not going to kill 900 people by religiously following bad statistical principles.

I think we should be open-minded and understand that sometimes peeking is ok and sometimes it isn't.

When the effect is large, you can end earlier. There's no reason to cling to a formula and procedure that requires a fixed number of samples when other methods exist that lack that drawback.


These comments generally read to me along the lines of:

"It's totally reasonable to roll your own cryptography, as long as you're an expert in the field and do it correctly".

The rules of "Don't Stop And Peek" is general advice that is given out because the vast majority of people that conduct these kinds of trials are not statisticians, and will not do the math. Those that attempt the math will likely do the math wrong.

So, you're totally correct that if you know exactly what you're doing as a statistician and apply the math correctly, it can be possible to peek at results while still drawing a conclusion. This ignores the fact that, for most humans who are receiving the advice, this is terrible advice.

So, general rule: "Don't Stop and Peek". Advanced rule: "Unless you are a trained statistician, Don't Stop and Peek".


There's no problem with aborting a test early, for whatever reason. However that doesn't mean you can still draw conclusions from such a test. If you plan to do a trial with 1,000 patients and you stop midway because you've reached statistical significance you run a big risk of claiming a treatment works when it doesn't.

Similarly, every test you do has a small probability to give a false positive, the more test you do the bigger the total chance that you'll be jumping to conclusions.

Also, the size of the effect is irrelevant since that should already be accounted for by whatever test you do.


Any Bayesian analysis should still be valid, despite stopping conditions. You can still draw conclusions from an aborted test. You just have to use valid formulas, and not formulas that assume incorrect counterfactual scenarios. I think it's dogmatic to say stopping is bad and you can't do analysis. In my mind, you totally can. It just needs to be the appropriate analysis.

Anyway, I don't even think false positives are the right way to think about this. The framing should be continuous, not binary. The goal is to maximize success, not maximize the odds of picking the better of A or B.


Your last sentence is telling.

A Bayesian analysis tells you Bayesian things: specifically, this is the most reasonable conclusion one can draw from this data, right now. A frequentist analysis also tells you things, but they are different things. Specifically, frequentists are concerned with...frequencies: if we were to run this procedure many times, here's what we can say about the possible outcomes.

There's this persistent meme that all Bayesian methods protect you from multiple comparisons/multiple looks problems. That's not really true. Bayesian methods don't offer any Type I/Type II error control--and why would they, when the notion of these errors doesn't make much sense in a Bayesian framework?

You can certainly use Bayesian methods to estimate a parameter's value. However this you cannot repeatedly test your estimates--Bayesian or otherwise--in a NHST-like framework and expect things to work out correctly.


From the article:

"You can perform power calculations by using an online calculator or a statistical package like R. If time is short, a simple rule of thumb is to use 6000 conversion events in each group if you want to detect a 5% uplift. Use 1600 if you only want to detect uplifts of over 10%. These numbers will give you around 80% power to detect a true effect."

These numbers assume approximately 10% of changes give true uplift (aka actual improvement). If the incidence of improvement is rarer you would need more than 6000 events to identify a 5% increase.

(OT: that term sounds like cult language -- you have three chances to achieve true uplift)


You can take an adaptive approach, but almost universally it is done wrong. https://www.lucidchart.com/blog/2016/10/20/the-fatal-flaw-of...

FYI, the difference with Google Analytics and most other A/B platforms is that GA uses multi-armed bandit. It adjusts the proportions according to the results. This is nice because it effectively ends the test for you, and you don't have to stress about suboptimal success rates just for the purpose of experimentation. https://support.google.com/analytics/answer/2844870


> Is the author saying one must wait for 6000 events even if the difference between A/B is large?

Yes.

> The larger the relative difference, the fewer events needed.

Wrong. That's the common mistake that is being addressed.

The relative difference could be anything at any point in time. You always need a lots of events to be somewhat confident that there is really a difference.

Short version: If you want X% confidence on a test, there are maths formula to say that you need to monitor N events, at least. The formula doesn't depends on how large is the difference (which you don't know anyway).


It's absolutely true that if the difference is larger you need fewer samples to reach statistical significance. This why is why small effects require studies with large sample sizes.


Well, in this case the distribution is (assumed to be) binomial, which does allow you to assign a higher significance for cases where there is a strong effect.


Nice article. One question though:

> Perform a second ‘validation’ test repeating your original test to check that the effect is real

Isn't this just the same as taking a larger sample size?


Almost.

The difference is time, so the populations involved are 'the people that used the site last month' and 'the people that used the site last week'. Usually your assumption is that these are comparable, but that's not necessarily true. Furthermore, the effect is only meaningful if that assumption is true, for most business cases (i.e. you want this effect to hold in the future).

In practice, a lot of effects do disappear when you repeat a test because there was some unaccounted for factor that varied between them. It's a good sanity check.

But you're right, much of the purpose is to discover mean regression. This is something that happens more often than you'd expect because you tend to be focusing large effects, many of which will simply be due to chance.


So basically, it means waiting a while before performing the test again? Since "larger sample size" mostly just means "keep the test running for longer", so the difference is that in that case, the "second" test is very close in time to the "first" test?


Not asking me, but yes. I think the underlying issue is whether you assume the process that generated the data yesterday is similar enough to the process that generated data today. In truth, all data points are unique and all data is high dimensional data. It only becomes workable in low dimensions once we make assumptions like "the process that produced this point yesterday is the same as the process that produced this second days point today." Separating the tests makes it slightly less difficult to test for violations of these assumptions (e.g., model drift). But otherwise it seems silly and arbitrary to commit yourself you two medium tests instead of one big test. At least two medium tests let you filter out likely failures if you allow yourself to quit midway. (This whole business of tests and halting is silly anyhow.)


Right, thanks. And am I correct in thinking that your final sentence before the parentheses means that two medium tests are worse, because you don't want to filter out likely failures?


Not entirely. Even if run right after one another, running two tests buckets the results into early/late, which simply using a larger sample set does not.

Assume your first samples show the effect and the later ones don't. The first test will show a strong correlation and the second not. The larger test will simply show a weaker correlation.


So what would the conclusions be in the two-test and large-test case? My assumption would be that if one of the two tests would fail, then the correlation in a large test would also have been so low as to mean no effect?


The issue is that you probably wouldn't do two tests if you'd planned a short one. So you'd randomly get an overly strong, or overly weak, result.

The actual effect depends on your data and how noisy it is, and if that correlates to external factors, etc.

You need to know the characteristics of your data then you can decide on your desired confidence and only then pick a test designed to give you that based on your data.


I don't get this: "But remember: if you limit yourself to detecting uplifts of over 10% you will miss also miss negative effects of less than 10%. Can you afford to reduce your conversion rate by 5% without realizing it?"

You'd potentially lose 5% CR if you ship the variant even when it doesn't show a detectable uplift. Why would you do that?


Any other statisticians want to champion this? https://news.ycombinator.com/item?id=13434410


Pity they didn't add a date to the paper!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: