The really scary drawback is what happens if the bandit prefers a suboptimal cho...

saon · on May 30, 2012

This seems like a trivial thing to fix by presenting optimal with limited noise. Let's say it picks the optimal choice x% of the time (some really high number), and when additional changes are made or automatically detected, this percentage drops. If you pick the next most optimal down the line through all of your options, and make x proportional to the period of time since the last change, it should make it reasonably resistant to this kind of biasing in the first place, and can ramp back up at a reasonable rate.

Better yet, make x dependent in some way on time since the last change, and relative change in performance of all options from before and after the change.

3pt14159 · on May 30, 2012

I might not be understanding you correctly, but wouldn't the independent improvement also help the random bandit choices? If you are using the forgetting factor this shouldn't be a real issue.

My problem with the bandit method is that I want to show the same test choice to the same person every time he sees the page so you can hide that there is a test. If I do this with the bandit algo then it warps the results because different cohorts have different weightings of the choices and differing cohorts behave very differently for lots of reasons.

btilly · on May 30, 2012

The independent improvement also helps the random bandit choices. The problem is that you are comparing A from a largely new population with Bs that are mostly from an old population. It takes a long time to accumulate enough new Bs to resolve this issue.

A forgetting factor will help.

This is a variant of the cohort issue that you're talking about.

The cohort issue that you're talking about raises another interesting problem. If you have a population of active users, and you want to test per user, you often will find that your test population ramps up very quickly until most active users are in, and then slows down. The window where most users are assigned is a period where you have poor data (you have not collected for long, users have not necessarily had time to go to final sale).

It seems to me that if you want to use a bandit method in this scenario, you'd be strongly advised to make your fundamental unit the impression, and not the user. But then you can't hide the fact that the test is going on. Whether or not this is acceptable is a business problem, and the answer is not always going to be yes.