Wow, you convinced me... Sarcasm aside, I've also experienced all of these issue...

pbreit · on May 30, 2012

Sorry, was on an iPad.

Most or all of the points suffer from: * is that actually true? * does regular a/b testing not also face that issue? * was it suggested that you must "set it and forget it"? * are there no mechanisms for mitigating these issues? * would using 20% or 30% mitigate the issues? * are you not allowed to follow the data closely with the bandit approach?

The whole list struck me as a supposed expert in the status quo pooh-poohing an easier approach.

btilly · on May 30, 2012

Most or all of the points suffer from:

Let's address them one by one.

is that actually true?

In every case, yes.

does regular a/b testing not also face that issue?

For the big ones, regular A/B testing does not face that issue. For the more complicated ones, A/B testing does face that issue and I know how to work around it. With a bandit approach I'm not sure I'd have noticed the issue.

was it suggested that you must "set it and forget it"?

Not "must", but it was highly recommended. See paragraph 4 of http://stevehanov.ca/blog/index.php?id=132 - look for the words in bold.

are there no mechanisms for mitigating these issues?

There are mechanisms for mitigating some of these issues. The blog does not address those. As soon as you go into them, you get more complicated. It stops being the "20 lines that always beats A/B testing" that the blog promised.

I was doing some back of the envelopes on different methods of mitigating these problems. What I found was that in the best case you turn into

would using 20% or 30% mitigate the issues?

That would lessen the issue that I gave, at the cost of permanently worse performance.

The permanent performance bit can benefit from an example. Suppose that there is a real 5% improvement. The blog's suggested approach would permanently assign 5% of traffic to the worse version, for 0.25% less improvement than you found.

Now suppose you tried a dozen things. 1/3 of them were 5% better, 1/3 were 5% worse, and 1/3 did not matter. The 10% bandit approach causes you to lose 0.25% conversion for each test with a difference, for a permanent roughly 2% drop in your conversion rate over actually making your decisions.

(Note, this is not a problem with all bandit strategies. There are known optimal approaches where the total testing penalty decreases over time. If the assumptions of a k-armed bandit hold, the average returns of the epsilon strategy will lose to A/B test then go with the winner, which in turn loses to more sophisticated bandit approaches. The question of interest is whether the assumptions of the bandit strategy really hold.)

Whichever form of testing you use, you're doing better than not testing. Most of the benefit just comes from actually doing testing. But the A/B testing approach here is not better by hundredths of a percent, it is about a permanent 2% margin. That's not insignificant to a business.

If you move from 10% to 20%, that permanent penalty doubles. You're trading off certain types of short-term errors for long-term errors.

(Again, this is just an artifact of the fact that an epsilon strategy is far from an optimal solution to the bandit problem.)

are you not allowed to follow the data closely with the bandit approach?

I am not sure what you mean here.