Thompson sampling is a great tool. I've used it to make reasonably large amounts of money. But it does not solve the same problem as A/B testing.
Thompson Sampling (at least the standard approach) assumes that conversion rates do not change. In reality they vary significantly over a week, and this fundamentally breaks bandit algorithms.
Furthermore, you do not need to use Thompson Sampling to have a proper Bayesian approach. At VWO we also use a proper Bayesian approach, but we use A/B testing in to avoid the various pitfalls that Thompson Sampling has. Google Optimize uses an approach very similar to ours, (although it may be flawed [1]) and so does A/B Tasty (probably not flawed).
Note: I'm the Director of Data Science at VWO. Obviously I'm biased, etc. However my post critiquing bandits was published before I took on this role. It was a followup to a previous post of mine which led people to accidentally misuse bandits: https://www.chrisstucchio.com/blog/2012/bandit_algorithms_vs...
> Depending on what your website is selling, people will have a different propensity to purchase on Saturday than they have on Tuesday.
Affects multi-armed bandit and fixed tests. If you do fixed A/B test on Tuesday, your results will also be wrong. Either way, you have to decide on what kind of seasonality your data has, and don't make any adjustments until the period is complete.
If anything, multi-armed bandit shines because it can adapt to trends you don't anticipate.
> Delayed response is a big problem when A/B testing the response to an email campaign.
Affects multi-armed bandit and fixed tests. If you include immature data in your p-test, your results will be wrong. Either way, you have to decide how long it takes to declare an individual success or failure.
> You don't get samples for free by counting visits instead of users
Affects multi-armed bandit and fixed tests. Focusing on relevant data increases the power of your experiment.
---
For every single problem, the author admits "A/B tests have the same problem", and then somehow concludes that multi-bandit tests are harder because of these design decisions, despite the fact they affect any experiment process.
If anything, multi-armed bandit shines because it can adapt to trends you don't anticipate.
It can, but the time it takes is exp(# of samples already passed).
You can improve this by using a non-stationary Bayesian model (i.e. one that assumes conversion rates change over time) but this usually involves solving PDEs or something equally difficult.
For every single problem, the author admits "A/B tests have the same problem", and then somehow concludes that multi-bandit tests are harder because of these design decisions, despite the fact they affect any experiment process.
The point the author (me) is trying to make is not that bandits are fundamentally flawed. The point is that for A/B tests, all these problems have simple fixes: make sure to run the A/B test for long enough.
For bandits, the fixes are not nearly as simple. It usually involves non-simple math, or at the very least non-intuitive things (for instance not actually running a bandit until 1 week has passed).
At VWO we realized that most of our customers are not sophisticated enough to get all this stuff right, which is why we didn't switch to bandits.
So what I'm proposing to do is run A/B with a 50/50 split for a full week, then when B wins shift to 0/100 in favor of B.
You seem to be proposing to run A/B with a 50/50 split for a full week, then when B does a lot better shift to 10/90 in favor of B and maybe a few weeks later shift to 1/99.
What practical benefit do you see to this approach? From my perspective this just slows down the experimental process and keeps losing variations (and associated code complexity) around for a lot longer.
First, Google Analytics (for example) runs content experiments for a minimum of two weeks regardless of results. It's hardly an unrealistic timeframe for reliable conclusions.
> What practical benefit do you see to this approach?
Statistically rigorous results, with minimal regret.
In your example, you reach the end of the week, and your 50/50 split has one-sided p=0.10, double the usual p<0.05 criteria. What do you do?
(a) Call it in favor of B, despite being uncertain about the outcome.
(b) Keep running the test. This compromises the statistical rigor of your test.
(c) Keep running the test, but use sequential hypothesis testing, e.g. http://elem.com/~btilly/ab-testing-multiple-looks/part1-rigo... This significantly increases the time to reach a conclusion, and costs you conversions in the meantime.
The essential difference when choosing the approach is that 50/50 split optimizes for shortest time to conclusion, and multi-bandit optimizes for fewest failures.
There are times when the former is more important, e.g. marketing wants to know how to brand a product that is being released next month. These are the clinical-like experiments that frequentist approaches were formulated for.
Statistically rigorous results, with minimal regret.
The results are only statistically rigorous provided your bandit obeys relatively strong assumptions.
As another example, suppose you ran a 2-week test. Suppose that from week 1 to week 2, both conversion rates changed, but the delta between them remained roughly the same. A 50/50 A/B split doesn't mind this, and in fact still returns the right answer. Bandits do mind.
I don't do p-values. I do Bayesian testing, same as you. I just recognize that in the real world, weaker assumptions are more robust to experimenter or model error, both of which are generally the dominant error mode.
In web A/B testing, the latter is usually the most applicable, and for that, you cannot beat Thompson sampling on the average, no matter how clever yur scheme.
This is simply not true. The Gittins Index beats Thompson sampling, subject again to the same strong assumptions.
Look, I know the theoretical advantages of bandits and I advocate their use under some limited circumstances. I just find the stronger assumptions they require (or alternately the much heavier math requirements) mean they aren't a great replacement for A/B tests which are much simpler and easier to get right.
Thompson sampling does not need to assume stability. You can inject time features into the model if you want to model seasonality (or, more accurately, ignorances of seasonality) and you can also have a hidden random-walk variable.
Yes, if you assume stability and things vary, you will not have good results. That seems like any statistics.
However, with an A/B test you don't need to change the math or eliminate the stability assumption from it. You just need to choose a good test duration.
As I pointed out to pauldraper in the other thread, when you start fixing bandits by only changing the split every season, suddenly bandits start to look a lot like A/B testing.
Actually, Chris, I think you misunderstand my comment.
Thompson sampling (and Bayesian Bandits in general) can be applied with a model for estimating conversion that is more complex than P(conversion|A). It can include parameters for time of day, day of week and even be non-parametric.
If you do this, you the standard Thompson sampling framework with nochanges* whatsoever will still kill losers quickly (if the loss is big enough that seasonality cannot save it) and will also wait until it has enough data seasonally to make more refined decisions. This is very different from simply waiting for an even season point to make decisions.
You do need more data to understand a more complex world, but having an optimal inference engine to help is better than having a severely sub-optimal engine.
I understand that. The blog post I linked to describes doing exactly that.
But the point I'm making is different. This is a lot of stuff to get right and most people aren't that sophisticated. Getting A/B tests right is a lot easier, mainly because they are significantly more robust to model error.
Thompson Sampling (at least the standard approach) assumes that conversion rates do not change. In reality they vary significantly over a week, and this fundamentally breaks bandit algorithms.
https://www.chrisstucchio.com/blog/2015/dont_use_bandits.htm...
Furthermore, you do not need to use Thompson Sampling to have a proper Bayesian approach. At VWO we also use a proper Bayesian approach, but we use A/B testing in to avoid the various pitfalls that Thompson Sampling has. Google Optimize uses an approach very similar to ours, (although it may be flawed [1]) and so does A/B Tasty (probably not flawed).
https://cdn2.hubspot.net/hubfs/310840/VWO_SmartStats_technic...
Note: I'm the Director of Data Science at VWO. Obviously I'm biased, etc. However my post critiquing bandits was published before I took on this role. It was a followup to a previous post of mine which led people to accidentally misuse bandits: https://www.chrisstucchio.com/blog/2012/bandit_algorithms_vs...
[1] The head of data science at A/B Tasty suggests Google Optimize counts sessions rather than visitors, which would break the IID assumption. https://www.abtasty.com/uk/blog/data-scientist-hubert-google...