20 lines of code that beat A/B testing every time

wisty · on May 29, 2012

A stevehanov.ca link? Wow, HN is getting classy again. Please more articles with code, equations and / well visualizations, and less upvoting of badly thought out infograpics (i.e. pretty numbers which would lose nothing by just being presented in a table) and far less self-help pseudo business articles please.

+1 on an article does not mean "I agree". It means "I learnt something".

timr · on May 29, 2012

Bandit optimization has been discussed previously on HN:

http://news.ycombinator.com/item?id=2831455

The problem with this article is that it's a) not very detailed, and b) the conclusion is linkbait. Bandit optimization is a useful tool, but it has drawbacks, and it's not always better than A/B testing. In particular, bandit approaches take longer to converge (on average), and don't give you reliable ways to know when to stop testing (when all you know is that you're using an approach that's optimal in the limit of a large N, your only guarantee is that things get better as N gets large). These techniques also make assumptions that aren't valid for a lot of web experiments: identical "bandit" distributions that are constant over time. Throw a few choices that are optimal at different times of day/week/month/year at a bandit optimizer, and it'll just happily fluctuate between them.

Also, there's a lot of variation in performance depending on the parameters of your test -- some of which are completely unknowable. So if you want to really learn about this method, you need to read more than a blog post where the author has concluded that bandit optimization is the new pink. For example, here's a pretty readable paper that does an empirical analysis of the various popular bandit algorithms in different paramterizations:

https://docs.google.com/viewer?a=v&q=cache:KgmC8CnPhxwJ:...

This is just one article, but there's tons of literature on this problem. (In fact, if you use the 'softmax' criterion mentioned in that article, you're doing something very similar to simulated annealing, which is a rather elderly optimization technique.)

btilly · on May 29, 2012

The really scary drawback is what happens if the bandit prefers a suboptimal choice at the same time that you make an independent improvement in your website. Then the bandit is going to add a lot of data for that variation, all of which looks really good for reasons that have nothing to do with what it is supposed to be testing.

This type of error (which can happen very easily on a website going through continuous improvement) can take a very long time to recover from.

A/B tests do not have an issue with this because all versions will have similar mixes of data from before and after the improvement.

saon · on May 30, 2012

This seems like a trivial thing to fix by presenting optimal with limited noise. Let's say it picks the optimal choice x% of the time (some really high number), and when additional changes are made or automatically detected, this percentage drops. If you pick the next most optimal down the line through all of your options, and make x proportional to the period of time since the last change, it should make it reasonably resistant to this kind of biasing in the first place, and can ramp back up at a reasonable rate.

Better yet, make x dependent in some way on time since the last change, and relative change in performance of all options from before and after the change.

3pt14159 · on May 30, 2012

I might not be understanding you correctly, but wouldn't the independent improvement also help the random bandit choices? If you are using the forgetting factor this shouldn't be a real issue.

My problem with the bandit method is that I want to show the same test choice to the same person every time he sees the page so you can hide that there is a test. If I do this with the bandit algo then it warps the results because different cohorts have different weightings of the choices and differing cohorts behave very differently for lots of reasons.

btilly · on May 30, 2012

The independent improvement also helps the random bandit choices. The problem is that you are comparing A from a largely new population with Bs that are mostly from an old population. It takes a long time to accumulate enough new Bs to resolve this issue.

A forgetting factor will help.

This is a variant of the cohort issue that you're talking about.

The cohort issue that you're talking about raises another interesting problem. If you have a population of active users, and you want to test per user, you often will find that your test population ramps up very quickly until most active users are in, and then slows down. The window where most users are assigned is a period where you have poor data (you have not collected for long, users have not necessarily had time to go to final sale).

It seems to me that if you want to use a bandit method in this scenario, you'd be strongly advised to make your fundamental unit the impression, and not the user. But then you can't hide the fact that the test is going on. Whether or not this is acceptable is a business problem, and the answer is not always going to be yes.

wpietri · on May 29, 2012

Agreed. And I'd add that for me, the point of A/B testing is to learn something. We're not just interested in whether A or B is better; we're interesting in getting better at doing what we do. Studying the A/B results is half the fun.

timr · on May 30, 2012

"We're not just interested in whether A or B is better; we're interesting in getting better at doing what we do."

Absolutely. That's my biggest objection to bandit methods, but it's also the fuzziest objection, and the one least likely to appeal to hyper-analytical people. There's a strong temptation (as we can see from this article) is to treat bandit optimization as a black box that just spits out an infallible answer (i.e. as a Lazy Button).

It's the same human tendency that has led to "you should follow me on twitter" to be one of the more common n-grams on the interwebs (even though it probably never worked for more than Dustin Curtis, and likely causes a counter-intuitive backlash now).

wpietri · on May 30, 2012

Well put. I get why they don't get it, but from my perspective it looks like technical gold-plating a lot of the time. I generally aim to amplify the power of the human mind, not eliminate it from the system.

Of course, these algorithms are cool in certain contexts where you never want to think about it again. For example, ad placement for a giant ad network running against a lot of UGC content. They remind me of neural networks like that: the post office doesn't care how the machine recognizes hand-written zip codes, just as long as they do it reliably.

But startups are all about learning, and interface design is where you have direct contact with users. I want to soak in the data. I want to hang out in people's living rooms when they're using my site. A fancy Lazy Button (great phrase) is the last thing I need.

vibrunazo · on May 30, 2012

Why can't you do the same study from the bandit results? From what I understand, this the same as A/B testing, except it will only show a suboptimal result to 10% of the users instead of 50% (or more). After a few days of testing, can't you take a look at the statistics and analyze them the same way that you would for an A/B test? Then just stop testing?

The A/B test just gives you the conversion rate of each option. And so does the bandit. As I understand, the only difference is that the bandit will be bugging your users with bad results less often.

wpietri · on May 30, 2012

Sure, but wouldn't that take longer to get us our answer, and therefore keep us from moving on to the next experiment sooner?

Of course, the real answer to why not is "we have an A/B system and I'm not going to add bandit stuff for no benefit". But even if I were doing it from scratch, it seems more complex. The benefit of these approaches seems to be that one no longer has to think. We want to think.

timr · on May 30, 2012

"After a few days of testing, can't you take a look at the statistics and analyze them the same way that you would for an A/B test? Then just stop testing?"

No, because the assumptions that underpin many statistical techniques are violated when you're not assigning people to cohorts consistently, and at random.

conductrics · on May 30, 2012

If you are using an epsilon-greedy approach (or something similar), then I believe that the data collected during the exploration portion - (the random calls) are open, albeit with less power due to reduced sample size, to standard hypothesis testing. Think of it this way, you might normally run your experiment on a subset of your traffic (population) - so only 20%, with the rest (80%) getting the current experience. With the e-greedy type of approach you are just swapping the 'current experience' with the maximum estimated experience, but that other 20% is still a random draw.

timr · on May 30, 2012

The draws aren't independent. At any given time, the probability of assigning a user to a cohort is dependent upon a function of the previous observations (in other words, it's a markov model).

The standard confidence tests -- t-tests, G-tests, chi-squared tests, etc. -- based on distributions of independent, identically distributed (iid) data.

I'd have to think about it more, but I believe that btilly's examples are also the most intuitive reasons why independence matters. If your data is time-dependent, then assigning users to cohorts based on past performance lets the time dependency dominate. There may be other good examples.

conductrics · on May 31, 2012

Is that true in the e-greedy case? Sure, during the exploit call, they are not independent, but during the explore portion I would assume they are, since they have been randomly assigned into the exploration pool (epsilon) and then drawn from a uniform random draw. There is no information that I can see from prior draws being used.

pilom · on May 30, 2012

"and don't give you reliable ways to know when to stop testing"

Stopping a test when you reach a "statistically significant" result is the wrong way to do A/B testing. In both multi-armed bandit and A/B testing you need to set ahead of time the number of users you are going to run your test against and stop the test at that point regardless of if your result is significant or not.

btilly · on May 30, 2012

In theory, yes. In practice, no.

See http://elem.com/~btilly/effective-ab-testing/index.html#asli... for part of a presentation that I did where I actually set up some reasonable fake tests, and ran simulations. What I found is that if there is a significant difference, the probability of coming to the wrong conclusion was (as you would expect) higher, but not that high before the underlying difference made mistakes incredibly unlikely. Conversely if there is only a small real difference, the amount of data needed before you have a significant chance of having accidentally come to a erroneous conclusion is very, very long.

So avoid accepting any result where you don't have at least a few hundred successes and set your thresholds reasonably high. You will make fewer mistakes than you probably fear, and the ones that you make will almost always be very minor. (Oops, I had a 3% chance of accepting the 1% worse solution as probably better.)

Of course if you're aiming to publish academic research, your standards need to be higher. But if you're more interested in getting useful results than publishable ones, you can relax your standards. A lot.

davidcuddeback · on May 30, 2012

That link crashes Safari on my iPad.

janpio · on May 30, 2012

Great presentation.

timr · on May 30, 2012

"Stopping a test when you reach a "statistically significant" result is the wrong way to do A/B testing."

Nobody said that it was. But when you do regular split testing, you can use power analysis to estimate the length of time you need to run an experiment to get a significant result at a certain precision:

http://en.wikipedia.org/wiki/Statistical_power

You can't do this (at least, not easily) when you're using bandit models, because none of the assumptions are valid.

Swizec · on May 29, 2012

I upvoted you because I agree.

But I usually upvote stories based on how interesting they are and whether I feel like they were worth attention. Comments are a mix bag of agreeing, interesting and just well thought out arguments that make me think.

This is completely meta and kind of offtopic, so it should probably be downvoted, but I felt like saying it anyway.

r00fus · on May 29, 2012

I also +1 a comment where I learn something.

As in meat-space voting, you should always reward good behavior.. be part of the fitness function.

gbog · on May 29, 2012

Sadly upvote on stories really means "more of that".

evropamx · on May 30, 2012

And I really did, so +1 from me.

btilly · on May 29, 2012

This is thought-provoking, which is good. However there are significant issues with the approach.

1. Real world performance varies over time. For instance there are typically daily, weekly and monthly conversion rate fluctuations. Not an issue for A/B testing, but a big issue for this approach if a random switch in direction happens at the same time that conversion fluctuations happen to head in a good direction.

2. (This is really a special case of #1 - but a very, very important special case.) This approach creates long-lasting interaction effects between tests and independent changes. That requires explanation. Suppose you're running a test. Version A is somewhat better. But version B is temporarily looking slightly better when you make a significant improvement to your website (maybe you started another test that works). Now you're adding a lot of good traffic to version B (good because of the other change) and very little of the new good traffic to version A. This new version B traffic soundly beats your old version A traffic. This correlation between time and website performance will continue until the old version A traffic is completely swamped by new version A traffic. With only 5% of your traffic going to version A, this can easily take 100x as long as your test has been running - or more. (Properly constructed A/B tests do not suffer this statistical anomaly.)

3. Code over time gets messy. One of the most important characteristics of A/B testing is that you can delete the mess and move on. With this approach you can't - it just hangs around adding to your technical debt.

4. Businesses are complex, and often have multiple measures they would like to balance. For instance in a recent test, conversion to click was hurt, conversion to a person who clicked 5x was helped. A/B testing let us notice that something weird was going on and think about what we really cared about. This automated approach would make a decision and could have hidden a real problem.

5. Many tests perform differently on existing users and new users. A/B testing with proper cohort analysis can let you tease this out and decide accordingly. This approach doesn't give you that kind of sophistication.

mattmanser · on May 30, 2012

These also don't make much sense to me, but to avoid getting d/ved to oblivion I'll say why.

Firstly both you and the author aren't really quoting any maths or real world papers or anything. He's backing his up with saying that all the advertisers are using this over A/B though, which is a pretty strong argument. But it occurs to me that for most of your points to stand you need to tackle this particular paragraph:

Like many techniques in machine learning, the simplest strategy is hard to beat. More complicated techniques are worth considering, but they may eke out only a few hundredths of a percentage point of performance. The strategy that has been shown to win out time after time in practical problems is the epsilon-greedy method.

So to tackle you points:

1. Only stands if you show his paragraph above to be wrong. Does epsilon-greedy only work on consistent payouts, or does it work on fluctuating payouts too? It would seem to me that this would be a common occurrence in advertising on websites. I imagine there is some research out there to settle this!

2. He addresses this directly in the post: This won't adapt to change. (Your visitors probably don't change. But if you really want to, in the reward function, multiply the old reward value by a forgetting factor)

3. There is no difference between this and A/B testing, the mock code he shows is supposed to go in your A/B testing framework, the code in the controllers is supposed to be the same (and you can remove it the same way).

4. Isn't A/B testing just as bad at testing multiple factors? Why wouldn't you 'notice', you should theoretically see the same percentages for each stage. And would be able to notice the oddity.

5. Again this only stands if you show his paragraph above to be wrong. You are suggesting that a complicated strategy will win, which he says isn't true.

btilly · on May 30, 2012

I showed that paragraph to be wrong. Not slightly, but wildly in the common scenario where, while a test is running, you make another change to your website that lifts conversion.

The difference in this case is not a few hundredths of a point of convergence. It is a question of potentially drawing wrong conclusions about the wrong version for 100x as long as you need to.

Retric · on May 30, 2012

What your forgetting is it's adaptive. That 10% random factor means it's constantly adding in new information. Also, you can graph trends over time so if you make a significant change you could reset the historical data to zero, but simply letting it run and it will adapt to the change.

If your really concerned about rapidly changing events just add a diminishing return. AKA multiply both the success and failure number by say .9999 after each test. so 34/2760 = 34.9966/2760.724 on success or 33.9966/2760.724 on a failure.

btilly · on May 30, 2012

I am not forgetting that it is adaptive. I'm pointing out that the new information that is added will cause it to mis-adapt for a surprisingly long time. Adding in a diminishing return is possible. But by what factor do we diminish? We could easily get a disturbing amount of customization required.

Retric · on May 30, 2012

The time it takes to adapt is directly related to the magnitude of the difference if it takes six months to from 1.006% efficient strategy to a 1.007% efficient strategy it's not that important. The goal is to find significant wins quickly and any strategy that focuses on micro optimization will tend to find local maxima not global ones. If the top two strategy's are close enough this greedy algorithm will tend to bounce between them and that's ok.

As to diminishing factor you diminish both the numerator and the denominator for a bucket every time you test that bucket. If you want something next to perfect try http://en.wikipedia.org/wiki/Bayesian_statistics, but that eat's a lot of CPU and is harder to code for minimal gain.

pilom · on May 30, 2012

I got a very different impression of this method than you did because I saw it as, just as with an A/B test, when you make a major change to the site, you reset your counters. This makes your point #2 mute. As for number 5, why couldn't you use the same cohort analysis with this method?

ramses0 · on May 30, 2012

moot.

mute == "does not talk"

moot == "of little or no practical value or meaning; purely academic."

moot/mute is a tricky word combo because smart people can justify using "MUTE" instead of moot, and english is such a terrible language w.r.t. spelling there are no good clues to use one versus the other (and "moot" is far less commonly used).

pooriaazimi · on May 30, 2012

Off-topic: Most of the time, such grammar/spelling corrections get downvoted because people think they are nitpicking and don't add anything to the conversation, or are rude somehow, but as a non-English, I appreciate them a lot!

btilly · on May 30, 2012

What is a major change?

I've seen minor tweaks to a form raise conversions by 20%. The person running a particular test may not even know that a particular change was significant, or even that it happened. The change could be as subtle as another running test realized that version x is better, interfering with existing tests.

As for #5, with an A/B test you run into these situations, you're able to break down and crunch the numbers in multiple ways, and then have a discussion about how you want to proceed. But with a multi-armed bandit approach whatever complexities you have not thought of and baked into your approach, are not going to be noticed.

jleader · on May 30, 2012

It seems to me that A/B (or A/B/C/...) testing as described by btilly, and epsilon-greedy multi-armed bandit optimisation as described by Steve Hanov are two points on a continuum. In A/B/... testing, you're exploring 100% of your traffic, and eventually you declare the test "done", and start exploiting 100% of your traffic (sending it to the "best" slice). You have the advantage that during the exploration phase, your assignment of users to test slices is completely random, uncorrelated with anything else. In the epsilon-greedy approach, you're exploring with 10% of your traffic, and exploiting the current best-looking choice with 90% of your traffic. The trouble is that now 90% of your data is coming from one slice, and which slice it's coming from could vary over time. This means, as Ben points out, that your choice of slice could be 90% correlated with anything else that varies over time.

One approach would be to ignore the data from the 90% exploitation; that way, you only get 10% of the data, but its slice assignment is completely random and uncorrelated with anything else that might be happening. The trouble is that now you're running an A/B/... test on only 10% of your traffic, which means that it will converge 10x slower than if you were running it on 100% of your traffic.

However, it seems to me that the extra 90% of data that I've proposed ignoring isn't that useful, because it's only coming from one slice at a time. What you really want is to get more data from the slices you know least about. I suspect there are reinforcement learning algorithms that take into account not just the reward rate for each slice, but the current level of certainty with which the algorithm knows the reward rate, so it can collect more data about the slices it knows the least about, and stop collecting data about the slices for which it already has a fairly accurate reward estimate. The question is, are there such algorithms that can also handle non-stationary reward distributions? And how much tuning and tweaking do they require?

zopa · on May 30, 2012

That data from the 90% on slice X is valuable because you're trying to confirm, or falsify, that X is better, quickly. The strategy is "look closely at this pointy thing to see if it's a needle or a sharp piece of straw."

jleader · on June 2, 2012

But... "better" than the other choices, which are only getting 1/10th the traffic. The amount of traffic sent to a choice should be a function of how good you currently think it is, and how much more data you need to be sufficiently certain about that choice. So choices that have insufficient data should get more traffic, and choices that already have sufficient data to be sure they're worse than some other choice should get very little traffic. How much traffic they should get depends on how certain you are of stationarity (is that a word?).

51Cards · on May 30, 2012

re: #3... you can also remove items at will. If you look at the counters and find option C is consistently the lowest performer, then you can remove it from the code and remove the counter. It doesn't mean cruft needs to pile up.

btilly · on May 30, 2012

But how do you know when you can do that? Particularly with the slower convergence issue, and the potential presence of confounding factors like #2?

noelwelsh · on May 30, 2012

So Ben, looks like it's time to cross swords again :)

1. Real world performance varies over time...

I don't really understand this complaint. With A/B testing, you have to collect enough data before making a decision that the conversion rate fluctuations average out. If you're worried about monthly fluctuations (which is reasonable), and you're doing A/B testing, you need to collect data for a few months. This means months where you aren't optimising. With a bandit algorithm conversion rate fluctuations will cause changes in the bandit behaviour, but they will average out over time just like A/B testing. We can prove this, so I don't know what justification there is for the statement that it is "a big issue for this approach". Indeed I think this points out a weakness of A/B approaches. Firstly, it's arguable that tracking fluctuations in conversion rate is a good thing. They may be temporary, but why shouldn't the algorithm react to temporary changes? Secondly, unlike A/B testing you're not spending months collecting data -- you're actually optimising while you collect data. Finally, what site can afford to spend months not changing, while data is collected?

2. (This is really a special case of #1 - but a very, very important special case...

I see two issues here: 1) the fact the traffic has changed and 2) how long does it take the system to recognise this change?

Addressing 1: Getting technical for a moment, A/B tests assume the distribution generating the data is stationary. This situation violates this assumption, so one shouldn't be using A/B testing here. I'd like to know what is meant by a properly constructed A/B test in this case. Does it mean throwing out all the data before the significant improvement? If you say that a non-stationary distribution is fine, then you can develop a bandit algorithms for non-stationary situations (e.g. http://arxiv.org/abs/0805.3415). Myna handles non-stationary situations, so long as they change quite slowly.

Point 2 is really about convergence rates. There is some work here (http://imagine.enpc.fr/~audibert/Mes%20articles/ALT11.pdf) and the upshot is that convergence rate can be a problem in theory though you can design for it. Notably we haven't observed this issue in practice.

3. Code over time gets messy...

This is a non-issue in my experience. At some point you decide that an experiment is no longer worthwhile. Either you're redesigning things so it doesn't make sense anymore, or the experiment has been running long enough that you are confident of no more changes. When you hit this point you remove the code. It can be a year or a hour after starting the experiment. Either way, you know with a bandit algorithm you will have optimised what traffic you did receive.

4. Businesses are complex, and often have multiple measures they would like to balance...

The glib answer to this is to set your reward criteria correctly, but more below.

5. Many tests perform differently on existing users and new users...

Both of these are really about performing extra analysis on the data. Myna doesn't help here, and we could make it better in this regard by, for example, exporting data. Some of this extra analysis, such as cohort analysis, we can and will automate in the future. There will always be analyses that we can't or don't perform, so there will always be jobs for people who perform these analyses :-) On the other hand, bandit algorithms are used enough in practice (e.g. Google uses them for AdWords) that their utility has been validated. It's a trade-off between time and return. We want to automate the common cases so the analysts can concentrate on the interesting parts.

mjw · on May 30, 2012

Do you attempt to control for seasonal effects at all in your models? I'm wondering if allowing for a dependence on day-of-week, time-of-day and other relevant covariates, has any benefits over tracking fluctuations in a generic way?

btilly · on May 30, 2012

In order to do anything in practice, you have to make assumptions. With A/B testing your key assumption is that differences in version behavior are relatively stable over time, even if absolute behavior changes. This is not always true - for instance a version that makes references to Christmas will perform differently before and after Dec 25. But it tends to be true often enough that people can rely on that assumption. (With no simplifying assumptions of any kind you get analysis paralysis - that would be bad.)

Now to reply point by point.

1. Real world performance varies over time...

With the assumption that I provided above, A/B testing merely needs to collect enough data to detect real differences in user behavior. After a few days you may not know what the average long-term rate is, but you have strong evidence that one is better.

I have absolutely no idea where you got the impression that A/B tests typically run for months, or that you can't touch anything else while it is running. How long they need to run varies wildly (depends on traffic volume, conversion rates, etc), but for many organizations a week is a fairly long test.

2. (This is really a special case of #1 - but a very, very important special case...

Contrary to your point, A/B tests DO NOT assume that the distribution is stationary. They make the much weaker assumption that if a relative difference is found, that difference will continue to be true later. (If this assumption is only true 95% of the time, it is still useful...)

When you're running multiple tests the time-dependence of performance is often extreme. For instance while running test 1, you introduce test 2, it wins, and you adopt it. Now the time period you're running test 1 has 2 conversion bumps of around 5% in the middle. This is not a problem for properly constructed A/B tests. But it is a serious violation of, Myna handles non-stationary situations, so long as they change quite slowly.

Now what is a properly constructed A/B test? A properly constructed A/B test is one where membership in any particular test version has only random correlations with any other factor that could materially affect performance. Such factors include time, and other running tests. If each user is randomly placed in a test version upon first encountering them, independently of anything else that happens, then you have a well-constructed A/B test. Your versions will be statistically apples/apples even though early and late users are apples/oranges.

3. Code over time gets messy...

The degree to which this will be seen to be an issue depends on personal taste, and how much you're the person who has to dive through templates with a maze of if conditions. If you talk to management, it is almost never a (visible) condition. Programmers on the ground often disagree.

4. Businesses are complex, and often have multiple measures they would like to balance...

Getting the business to have the necessary discussions in an abstract volume is a non-starter in my experience. When you have a concrete example, decisions are easier to make.

Also (as happened in the test that I saw last week) often the discrepancy is a sign to another problem. A better reward criteria would have resulted in making a good pick, but looking at the discrepancy found an issue that wouldn't have been found otherwise, and we'll hopefully wind up with an even better option that we would not have found correctly.

5. Many tests perform differently on existing users and new users...

My point is that the extra analyses are useful. However this is really a secondary or tertiary point since most companies doing A/B testing are not going this extra mile.

As for AdWords, Google has sufficient volume for a traditional A/B test to converge in minutes. They have so much volume that continuous adaptation can just happen for them. Most businesses are not in such a fortunate place.

hoddez · on May 30, 2012

1 is definitely an issue, but has a solution. I posted an article about this a while back with a simple example of the problem and a possible solution. The example is a staged rollout but it illustrates the same point. If you change proportions of tests over time as there are independent changes, you can get skewed results: http://blog.avidlifemedia.com/2011/12/23/advanced-ab-testing...

adamtj · on May 30, 2012

2. If you change the thing you're testing, you need to restart the test. Otherwise, you'll have a meaningless jumble of results from two separate things.

andrewreds · on May 30, 2012

re #1,

could this be solved with an exponential decay?

aka, saying that a click from 1 month ago is less valuable than someone clicking today.

By tweaking the decay you could change how quickly the algorithm will sway when the conversion rate changes.

EDIT: I just saw that rauljara suggested this below: https://news.ycombinator.com/item?id=4040230

pbreit · on May 30, 2012

I don't think any of those points are very true.

btilly · on May 30, 2012

I've been involved with A/B testing for nearly a decade. I assure you that none of these points are in the slightest bit hypothetical.

1. Every kind of lead gen that I have been involved with and thought to measure has large periodic fluctuations in user behavior. Measure it, people behave differently on Friday night and Monday morning.

2. If you're regularly running multiple tests at once, this should be a potential issue fairly frequently.

3. If you really fire and forget, then crud will accumulate. To get rid of that you have to do the same kind of manual evaluation that was supposed to be the downside of A/B testing.

4. Most people do not track multiple metrics on every A/B test. If so, you'll never see how it matters. I make that a standard practice, and regularly see it. (Most recently, last week. I am not at liberty to discuss details.)

5. I first noticed this with email tests. When you change the subject line, you give an artificial boost to existing users who are curious what this new email is. New users do not see the subject line as a change. This boost can easily last long enough for an A/B test to reach significance. I've seen enough bad changes look good because of this effect that I routinely look at cohort analysis.

zader · on May 30, 2012

What do you think of Myna, in these respects? Does it suffer from the same disadvantages as other bandit optimization approaches?

http://mynaweb.com/docs/

btilly · on May 30, 2012

Does it suffer from the same disadvantages as other bandit optimization approaches?

Yes.

That said, the people there are very smart and are doing something good. But I would be very cautious about time-dependent automatic optimization on a website that is undergoing rapid improvement at the same time.

enjo · on May 30, 2012

#1 certainly is, particularly for businesses prone to seasonal fluctuations. In local lead-gen (for instance) you see big changes in conversion based on the time of year.

jamiequint · on May 30, 2012

Wow, you convinced me...

Sarcasm aside, I've also experienced all of these issues with real world testing and would be interested in hearing your argument as to why you think this is not the case.

pbreit · on May 30, 2012

Sorry, was on an iPad.

Most or all of the points suffer from: * is that actually true? * does regular a/b testing not also face that issue? * was it suggested that you must "set it and forget it"? * are there no mechanisms for mitigating these issues? * would using 20% or 30% mitigate the issues? * are you not allowed to follow the data closely with the bandit approach?

The whole list struck me as a supposed expert in the status quo pooh-poohing an easier approach.

btilly · on May 30, 2012

Most or all of the points suffer from:

Let's address them one by one.

is that actually true?

In every case, yes.

does regular a/b testing not also face that issue?

For the big ones, regular A/B testing does not face that issue. For the more complicated ones, A/B testing does face that issue and I know how to work around it. With a bandit approach I'm not sure I'd have noticed the issue.

was it suggested that you must "set it and forget it"?

Not "must", but it was highly recommended. See paragraph 4 of http://stevehanov.ca/blog/index.php?id=132 - look for the words in bold.

are there no mechanisms for mitigating these issues?

There are mechanisms for mitigating some of these issues. The blog does not address those. As soon as you go into them, you get more complicated. It stops being the "20 lines that always beats A/B testing" that the blog promised.

I was doing some back of the envelopes on different methods of mitigating these problems. What I found was that in the best case you turn into

would using 20% or 30% mitigate the issues?

That would lessen the issue that I gave, at the cost of permanently worse performance.

The permanent performance bit can benefit from an example. Suppose that there is a real 5% improvement. The blog's suggested approach would permanently assign 5% of traffic to the worse version, for 0.25% less improvement than you found.

Now suppose you tried a dozen things. 1/3 of them were 5% better, 1/3 were 5% worse, and 1/3 did not matter. The 10% bandit approach causes you to lose 0.25% conversion for each test with a difference, for a permanent roughly 2% drop in your conversion rate over actually making your decisions.

(Note, this is not a problem with all bandit strategies. There are known optimal approaches where the total testing penalty decreases over time. If the assumptions of a k-armed bandit hold, the average returns of the epsilon strategy will lose to A/B test then go with the winner, which in turn loses to more sophisticated bandit approaches. The question of interest is whether the assumptions of the bandit strategy really hold.)

Whichever form of testing you use, you're doing better than not testing. Most of the benefit just comes from actually doing testing. But the A/B testing approach here is not better by hundredths of a percent, it is about a permanent 2% margin. That's not insignificant to a business.

If you move from 10% to 20%, that permanent penalty doubles. You're trading off certain types of short-term errors for long-term errors.

(Again, this is just an artifact of the fact that an epsilon strategy is far from an optimal solution to the bandit problem.)

are you not allowed to follow the data closely with the bandit approach?

I am not sure what you mean here.

rauljara · on May 29, 2012

This is an interesting technique, but it too has flaws. If there is a period of buzz and excitement surrounding your app, whatever design was most popular at that time will be rewarded accordingly, and accrue a high click through rate with tens of thousands of case. If you introduce a new superior design after the period of buzz has gone away, the new design may take a very long time to catch up. Even though it is currently outperforming the old, the few hundred cases that are being added won't be enough to offset the tens of thousands that came before.

With all metrics, it's important to understand what's actually going into the measure and where it might get tripped up.

A potential solution might be to add a decay factor, so that the older data carries less weight.

IgorPartola · on May 29, 2012

Better than a forgetting factor, add a Kalman filter (http://en.wikipedia.org/wiki/Kalman_filter). This way you can trust your "new" data more than really "old" data, etc. The beauty of it is that it only adds three attributes to each data sample.

treeface · on May 29, 2012

Could you expound on this a bit? What attributes would you have to add? How would you calculate scores?

IgorPartola · on May 30, 2012

You would add a variance (P), estimate of the value and the timestamp of the last measurement. Using the last timestamp you can calculate Q. Generally, the older the last measurement, the higher Q.

The calculation is straightforward once you let some things be the value of identity:

  P1 = P0 + Q
  K = P0 / (P0 + R)
  x1 = x0 + K * (z - x0)
  P1 = (1 - K) * P0

Now you have the new score for your data (x1) and a new variance to store (P1). Other values are:

x0, P0 - previous score, previous covariance Q - Roughly related to the age of the last measurement. Goes up with age. R - Measurement error. Set it close to 0 if you are sure your measurements are always error-free. z - the most recent measured value.

Let's say you measure number of clicks per 1000 impressions. Now you can estimate the expectation value (x1) for the next 1000. After the second 1000 re-estimate again.

treeface · on May 30, 2012

Thanks for explaining that!

JeanPierre · on May 29, 2012

How does a decay factor not "trust your 'new' data more than really 'old' data"?

IgorPartola · on May 30, 2012

The Kalman filter is much more sophisticated. Typical re-estimation will be:

  x1 = x0 + alpha * (z - x0)

where alpha is static. The Kalman filter will make it dynamic, taking into account how you obtained the measurements, how old the last re-estimation was, how noisy the process is, etc. Want to do multi-variate analysis? Make alpha a matrix transform.

emiliobumachar · on May 29, 2012

I don't think this flaw is real.

If "whatever design was most popular at that time" has a billion trials and 90% successes, and "a new superior design" has 100 trials and 97% successes (97 out of 100), than the new design is favored by the algorithm. No need to "catch up" to the absolute number of successes.

barefoot · on May 30, 2012

Exactly what I was thinking. What are we missing?

achllies · on May 30, 2012

What was meant was: If there is a period of time when you get a lot of visits, lots of clicks, and abnormally high CTR. This could happen due to external factors, for example if make the front page of HN. Over time, this effect will vanish, but you will be stuck with high CTR estimates for the design that was in place when this happened for a long time.

psykotic · on May 29, 2012

If anyone wants to search the literature, the terminology for this is the non-stationary bandit problem. The classical bandit problem is stationary but there has been plenty of research done into non-stationary variants as well.

medell · on May 29, 2012

Nice post. One hypothetical case it could end up serving the worse design more often is if you had times of the day where your users behave very differently due to time zones, i.e. Europe vs North America.

Say you have a sports site and test a new soccer oriented layout vs an old baseball heavy one. In the day, the old baseball version wins easily. When NA goes to sleep it would serve up baseball to the Europeans until it loses, then after several hours soccer is the winner. But then it is too late and the Europeans go to sleep and on and on. This is an odd example and assumes equal balance, and the site should really be localized, but you get the point.

mattj · on May 29, 2012

This is actually pretty straightforward to overcome (and one of the real strengths of the bayesian approach). Rather than using the direct counts of success for each group, add a prior belief that americans will favor baseball and non-americans will favor soccer (you can experiment to determine this number).

You can now evaluate the results conditioned on each group (american / non-american).

nickzoic · on May 29, 2012

Or if you don't want to add in a prior specific belief about americans vs. non-americans, just keep one counter per option per continent-of-source-IP (or whatever) and the learning algorithm should work it out on its own. Of course, if you use too many bins then learning is going to take far too long.

james4k · on May 29, 2012

At this point you could flush out the inferior options and start anew with just the winning design and the new one.

haffi112 · on May 30, 2012

You can instead of selecting according to the largest expectation select according to the largest value you achieve with a certain variance deviation from the calculated expectation. Law of large numbers will apply and you will have tighter and tighter bounds around the expectations. That is a superior and more efficient method used in monte-carlo simulations, e.g. for AI playing the game Go.

ivankirigin · on May 29, 2012

He mentions this flaw and proposes the same solution.

equark · on May 29, 2012

This is the most important critique of A/B testing. It far outweighs the traditional hoopla about simultaneous inference and Bonferonni corrections.

Epsilon greedy does well on k-armed bandit problems, but in most applications you likely can do significantly better by customizing the strategy to individual users. That's a contextual bandit and there are simple strategies that to pretty well here too. For instance:

http://hunch.net/?p=298

http://hunch.net/~exploration_learning/main.pdf

http://web.mit.edu/hauser/www/Papers/Hauser_Urban_Liberali_B...

noelwelsh · on May 30, 2012

First up, the sales pitch: we provide bandit optimisation SaaS at Myna: http://mynaweb.com Now, that's out of the way, let's discuss the article.

I like the epsilon-greedy algorithm because it's simple to understand and implement, and easy to extend. However, to claim "The strategy that has been shown to win out time after time in practical problems is the epsilon-greedy method" is false. The standard measure of performance is called regret. You can think of it as the number of times you choose the sub-optimal choice. It is clear that this grows linearly in e-greedy, as there is a constant probability of exploring. The same is true in A/B testing (you show 1/2 the people the suboptimal choice in the data gathering phase and then make a decision that you have some probability of getting wrong.) A good bandit algorithm has regret that grows logarithmically with time, which is a huge difference! This result holds out in practice as well. If you look at some of Yahoo's papers (John Langford, for example; sorry no links as writing this while getting the kids ready!) you'll see comparisons to e-greedy where they significantly out-perform it. We've had the same results in our testing.

cbsmith · on May 30, 2012

Yeah, I think the problem here is that trying to be a little bit smart kind of gets you in to the space where really you should be doing things a LOT smart. A/B testing provides data that doesn't require much in the way of brains to interpret and is hard to draw poor conclusions from (beyond treating something as statistically significant that is not). Once you step off in to epsilon-greedy, you fall in to the whole reinforcement learning space.

To that end, btw, I think a service like yours is potentially quite valuable!

conductrics · on May 30, 2012

Actually, you kind of are already in the RL space when using AB testing to make online decisions, you just may not be thinking of it that way. From Sutton & Barto "Reinforcement learning is learning what to do--how to map situations to actions--so as to maximize a numerical reward signal." That is exactly what you are doing when applying A/B style hypothesis testing to inform decisions in an online application. Plus, personally, I think A/B testing is, in a way, much harder to interpret, at least most folks interpret wrong, which isn't a knock, since it is provides a non-intuitive - at least to me ;) - result.

wpietri · on May 30, 2012

Maximizing a numerical reward signal is definitely not what we're doing when we do an A/B test.

We collect a variety of metrics. When we do an A/B test, we look at a all of them as a way of understanding what effect our change has on user behavior and long-term outcomes.

A particular change may be intended to effect just one metric, but that's in an all-else-equal way. It's not often the case that our changes affect only one metric. And that's great, because that gives us hints as to what our next test should be.

conductrics · on May 30, 2012

Well I guess you could be running a MANOVA or something to test over joint outcomes, but the AB test is over some sort of metric. I mean, when you set up an experiment, you need to have defined the dependent variable first. Now, after you have randomly split your treatment groups you can do post hock analysis, which I think is what you are referring to. But if you are optimizing, here needs to be some metric to optimize over. Of course at the end of the day the hypothesis test just tells you prob(data or greater| null=true) which I am not sure provides a direct path to decision making.

tomp · on May 30, 2012

As far as I understand it, an advantage of the epsilon greedy algorithm is that it will relearn the best choice if it changes over time. Now, you could do that with a logarithmically-regretful algorithm as well, but it would take more time to relearn.

tagawa · on May 30, 2012

John Langford's page is here: http://hunch.net/~jl/

steve8918 · on May 30, 2012

I actually noticed that Google was doing this with my Adwords account a couple of weeks ago.

I have 2 ads in an Ad Group, and the wording between the two is different by a single word. One ad had over double the clickthru rate than the other one, just because of that single word difference.

I noticed that Google was serving the two ads about 50% of the time, and was going to shut off the one ad that had the lower CTR, but then I let it go, and the next day, I saw that the more successful ad had almost all the views, and the less successful one was barely being served.

TimJRobinson · on May 30, 2012

Yea this is an option in the Adwords interface, it is enabled by default but can be turned off. Google probably discovered that most of their advertisers don't check on their ad results very often and added the auto optimization so that they could show the better ads and make more profits even without the users interaction.

robertskmiles · on May 29, 2012

The 'set and forget' aspect of this is appealing. I've sometimes wondered if you could automate the whole thing, including option generation. If you can define good enough mutation functions you could have your features literally evolve over time, without developer input. You'd need a lot of throughput to get reasonable evolution rates though. Jacking up the mutation rate won't help because really big mutations will break the layout.

It's almost certainly impracticable, but fun to think about.

krupan · on May 29, 2012

I'd love to see a website designed entirely by statistical machine learning :-)

olefoo · on May 29, 2012

Google leans pretty heavily on machine learning for even trivial design choices http://stopdesign.com/archive/2009/03/20/goodbye-google.html mentions the infamous testing of 41 shades of bue to decide which was the correct one.

But truly someone has to design the learning system, at some level you will always have design, if it is the design of how the design is to adapt to it's environment.

r00fus · on May 30, 2012

What's to say those spam websites aren't exactly that?

roryokane · on May 31, 2012

Well, this website modifies a lot of variables, though it doesn't make everything variable: http://www.metamorphosite.com/about

conductrics · on May 30, 2012

This is part of a larger class of problems known as reinforcement learning problems. A/B testing when used for decision optimization can be thought of (sort of) as just a form of bandit using an epsilon-first approach. You play random until some threshold (using some sort of arbitrary hypothesis test), which is the learning period, then you exploit your knowledge and play estimate best option. Epsilon-greedy is nice because it tends to work well regardless, and isn't completely affected by drift (nonstationarity of the environment). One heuristic to use for deciding between using a bandity approach is to ask , is the information I will glean perishable or not pershible? For perishable problems the opportunity cost to learn is quite high, since you have less time to recoup your investment in learning (reducing the uncertainty in your estimates). Also, finding the optimal answer in these situations may be less important than just ensuring that you are playing from the set of high performing actions. We have a couple of blog posts on related issues http://www.conductrics.com/blog/

aaronjg · on May 29, 2012

There are appropriate solutions to the multi-armed bandit problem, and a wealth of literature out there, however this is not one of those solutions.

Here's a simple thought experiment to show that this will not 'beat A/B testing every time.' Imagine you have two designs, one has a 100% conversion rate, one has a 0% conversion rate. Simple A/B testing will allow you to pick the the winning example. Whereas this solution is still picking the 0% design 10% of the time.

For some other implementations check out the following links:

For Dynamic Resampling:

http://jmlr.csail.mit.edu/papers/volume3/auer02a/auer02a.pdf

http://www.mynaweb.com

For Optimal Termination Time:

http://blog.custora.com/2012/05/a-bayesian-approach-to-ab-te...

phildeschaine · on May 29, 2012

OK, in AB testing that same 0% design, you're showing it 50% of the time.

You seem to be saying "I'll AB test it just for a little, then weed out the 0% one. but in the case of this new algorithm, I'll let it run for a long time." That's not exactly fair. Not to mention, both algorithms would allow you to clearly see the 0% option sucks.

chc · on May 29, 2012

But the only way this testing method is superior (at least as explained in the article) is that it automatically adjusts itself. If you're going in and adjusting manually, it sounds like this is — at best — precisely as reliable as A/B testing and subject to the same critique the OP levels at A/B testing.

Vindexus · on May 29, 2012

"But the only way this testing method is superior (at least as explained in the article) is that it automatically adjusts itself."

That's actually very useful for me though. Especially if a site has a lot of tests, or I'm running tests for a multitude of clients. It means I have to babysit the tests less frequently.

Jare · on May 30, 2012

It will adjust itself in the sense that while it is still enabled and testing (a suboptimal state for your product), it will let your product perform better than plain old A/B.

At some point you still step in and decide based on the data, especially if you detect degenerate cases, and move on to the next experiment.

theootz · on May 29, 2012

There's no point in continuing to run A/B testing indefinitely once a winner has been chosen, and like-wise there is no point in running this indefinitely either.

Sure the OP advocates "set and forget", but it seems simply silly to do this indefinitely. Obviously, after a while, you're going to want to check in on the results and prune out low performers.

Is that not the entire point of all these different testing methodologies to begin with?

Also, if you were to follow the same "set and forget" strategy with A/B, you'd have a 50% chance of showing the 0% CR design, where-as it's only 10% with this method.

Seems like it wins out in both cases to me.

bbatchelder · on May 30, 2012

Possibly pedantic - but it would only show the loser 5% of the time. 10% of the time it picks a random design, including the winner.

equark · on May 29, 2012

Obviously epsilon-greedy is not optimal. But I think the point is it's surprisingly effective and probably better than A/B testing in most applications. Most alternatives get very complicated very quickly.

Even your thought experiment is not conclusive. First, it will take some time to be convinced which variation is better. The lost revenue during the 50/50 experimentation phase may outweigh the long-run benefit if your discount rate is high. This can happen in markets where customer acquisition is slow and expensive (i.e. a niche targeted via AdWords). Second, except in the unrealistic 0% vs 100% case, you could get stuck on wrong choice forever if you pass a statistical significance test by chance. Epsilon-greedy will correct itself over time and is asymptotically guaranteed to make the correct choice most of the time. A/B testing only has this asymptotic guarantee if you spend asymptotically long experimenting, which defeats the whole exercise.

IgorPartola · on May 29, 2012

Or, get complicated and vary the epsilon based on the recent change in conversions.

For example, record the conversions from the exploration path vs each individual exploitation attempt. Then, adjust epsilon accordingly, so that it is within a certain percentage of one of your best choices.

shadowmint · on May 30, 2012

I like this.

See, we've spent the last year doing an Omniture implementation, using T&T, and frankly, its been a waste of money and time.

This isn't a criticism of the T&T tools; its a criticism of our own internal analytics handling and analysis team, and possibly the implementation guidance we received from a few places.

You can guess at generalizations from the A/B/N changes you've made and try them again, but practically speaking? Meh. It seems like the learnings from one page are very hard to transfer to another page.

That's why you don't see posts like "Top ten generalizations about how to make your website better!" instead you pay "SEO Expects" and "Analytics Ninjas" as consultants and they tell you things like "pages with images convert more, generally, but that may not be the case for your specific website because of blah". Handy. Do you have any more generic and obvious advice that doesn't tangibly translate into practical changes to my website?

Here's the thing: Running an A/B is easy, but analyzing the results is really hard. Generating some kind of general analytics rules is very hard.

The reason I like this approach is simple: it's easy. It's easy to implement. It's easy to explain. It's easy to convey to the designers that you're doing a 'throw at the wall and see what sticks' approach to pages. It's easy to explain to managers how why the best page has been picked. You don't need a self important analytics ninja who will go on endlessly about user segments and how if you segment the data differently you can see kittens in the page view statistics.

Just make lots of pages, and see what happens.

It's not perfect, but it's enough to get started~ (...and honestly, that's the most important thing when you're working with analytics; it beats the hell out of a 3 month implementation cycle that ends up with... page views <-- to be fair, personal opinion about uselessness of implementation not presently shared by marketing department, who likes the pretty graphs. Whatever they mean.)

joelthelion · on May 30, 2012

Add the UCT monte-carlo tree algorithm, and you have a strong, generic AI for two-player games such as chess, go, othello, etc.

Here's a C++ implementation: https://github.com/joelthelion/uct

heyitsnick · on May 29, 2012

Maybe i'm missing it (it's late), but nowhere in the article does it explain why 10% of the time it picks a choice at random ("explores"). In fact, the article basically argues why it's not needed (it self-rights if the wrong choice becomes temporarily dominant). It also doesn't explain why specifically it should be a 10% randomization.

daeken · on May 29, 2012

A random choice is needed to allow people to give rewards to options other than the dominant. I'm sure this doesn't have to be random -- and I'd be curious to see the logic behind the 10% choice -- but you have to have something that gives the other options a chance.

Makes me wonder if the 10% number couldn't be changed to something that's a function of the number of rewards total; the longer it runs, the less variation there is and the more confident you are in the choice made.

nerdo · on May 30, 2012

That would be Epsilon-decreasing strategy or VDBE:

http://en.wikipedia.org/wiki/Multi-armed_bandit#Semi-uniform...

jasonwatkinspdx · on May 30, 2012

Like many things in machine learning, this is a simplified version of Metropolis-Hastings.

azernik · on May 29, 2012

The problem is that, depending on your initial settings and early results, certain settings could get such low payoff estimates that they're never tried at all (let's say one gets to 50% after one trial, and the rest all stay above 75%). You want to make sure that your solver adequately explores all choices.

tshaddox · on May 29, 2012

Presumably you would want the randomness to give choices that got to 0% another chance. It seems like a better way could be to use the previous success ratios as a probability density function.

conductrics · on May 30, 2012

I think rather than get hung up on e-greedy vs. A/B testing vs UCB (Bayesian vs. non Bayesian), it is helpful to first step back and think about the larger problem of online learning as a form of the prediction/control problem. The joint problem is to 1)LEARN (estimate) the values of possible courses of action in order to predict outcomes. and 2)CONTROL the application by selecting the best action for a particular situation.

I noted elsewhere that A/B can be though of as an epsilon-first learning approach, Play random 100% till P-value<alpha, then play greedy(play the 'winner'). As an aside, it is unclear to me how using p-values is a clearer, easier, or more efficient, decision rule for these types of problems. It is almost always misinterpreted as the Prob(B>A|Data), choice of alpha determines threshold but is arbitrary, and often a straw-man default - implicitly biasing your confusion matrix. Not saying that you won't get good results, just that it is not clear that is a dominate approach.

This simple post I wrote on agents and online learning might be informative http://mgershoff.wordpress.com/2011/10/30/intelligent-agents...

But don't take my word for it (disclaimer: I work for www.conductrics.com, which provides decision optimization as a service) take a look at a great intro text on the topic by Sutton & Barto http://webdocs.cs.ualberta.ca/~sutton/book/ebook/the-book.ht...

yahelc · on May 29, 2012

Seems like what Myna is doing: http://mynaweb.com/

kevinpfab · on May 29, 2012

Conductrics as well: http://conductrics.com

josscrowcroft · on May 29, 2012

Any more tool suggestions that implement more complex forms of testing?

TimJRobinson · on May 30, 2012

I created a tool a few years ago built on a similar strategy, but instead of only showing the best performing variation the chance of each variation showing was based on how well it was converting (so in an a/b/c test with conversion rates of 3%/2%/1% version a would show 1/2 of the time, version b would show 1/3 of the time and version c would show 1/6th of the time).

There was one major flaw with this strategy though:

Lets say you're testing a landing page and have had 1000 visitors and version A is converting at 40% while version B is converting at 30%. So it looks like so:

Version A - 200 / 500 - 40% Version B - 150 / 500 - 30%

A new affiliate comes on board and decides to send 200 visitors to your page from some "Buy 200 visitors for $1" domain redirection service. These visitors are such low quality that they will never ever buy anything and will probably just close the window immediately (or are bots). Now your results look something like this:

Version A - 200 / 680 - 29.4% Version B - 150 / 520 - 28.8%

And with just 200 visitors some random affiliate has killed all your results. Now you could add filtering and reports based on the affiliate or traffic source but this is more code and more attention you have to pay to the test.

If you were running a traditional A/B test your results would look like this:

Version A - 200 / 600 - 33% Version B - 150 / 600 - 25%

And even though the overall conversion rate is lower you can still see version A is better than B.

The idea is good and I love the idea of auto optimization, but it does have it's flaws which require more than 20 lines of code to overcome.

conductrics · on May 30, 2012

You might want to look at botlzman/softmax if you want to weight the prob of selection as a function of the current estimated value. One tricky bit is figuring out a good setting for the temperature parameter. Another poster alluded to softmax. In my experience it dosn't really perform better than a simple e-greedy approach, but maybe it has worked well for others?

nphrk · on May 29, 2012

There are better approaches for tackling this problem (with 0-regret asymptotically). You can take a look at the UCB (Upper Confidence Bound) algorithm, and you can do even more if you assume some continuity, e.g. what is commonly done is to assume that the whole distribution is from a Gaussian Processes. Many interesting ideas in the literature indeed :)

6ren · on May 30, 2012

ASIDE: I love the idea of combining this kind of approach with a random site generator. You can then totally automate the business, from inception onwards. If incorporated as a company, it's an artificially intelligent artificial person.

A problem with this (well one of them) is that it could home in on the worst of spammy techniques, like masquerading as a legitimate message, shaking animation, illegal claims and "simple" lies/promises. A solution is to filter these out - one could hand-code specific cases, but a general solution seems as difficult as AI in the first place, and requires external knowledge (such as of infrastructure messages, human visual physiology, laws and their interpretation, the concept of lying). It's hard to gather data (e.g. each lawsuit); but maybe something along the lines of a simple "complaint" button would do (use statistics to discount accidental/abusive clicks).

zawaideh · on May 30, 2012

something along the lines of giving the bad behavior a huge negative reward (penalty) and you end up automating that as well :).

robomartin · on May 30, 2012

A lot of the reasons for which this algo can optimize in the wrong direction have been covered here. In general terms, I agree with all of it.

I am a big fan of two techniques that I feel would enhance an approach like the one suggested: FIR filters and Decay.

Simply put: I believe it is important to have a mechanism through which decisions are not made too quickly. A finite impulse response filter would take care of this very well.

In addition to that, older measurements should not carry the same importance as nice-fresh data. Who cares what people thought about the buttons (or whatever) a month or two ago? The crowd visiting the site this week could have been affected by solar flares. Maybe they prefer something else.

Obviously you need enough traffic to be able to use such techniques.

Not sure it's worth the complexity in all cases.

micheljansen · on May 30, 2012

Although I can imagine this works very well for ecommerce websites and other things where there is a very obvious single measure of success, like for example:

  * the user clicked the button
  * the user signed up
  * the user put something in their cart
  * the user paid X

In this case, it's easy to create the feedback loop that is required for this testing method.

However, in the real world, things are not always that simple. What if you want to optimise:

  * the percentage of users that returns to the site 
  * the time that users spend on your site
  * the number of pages that they view in a session

I'm sure some of these metrics can also be plugged back into the bandit algorithm, but it's a lot more complicated.

toemetoch · on May 29, 2012

Next step: close the loop.

Option 1: Let the next set of values (colors of the button in the example) themselves be generated after n runs.

Option 2: let users modify the css as part of a "customization " feature. I remember they did this before the new BBC site was officially launched a few years ago.

K2h · on May 29, 2012

This epsilon-greedy [1] thing looks just like what my old boss use to tell me, 'trust but verify'. ohh.. I got so sick of those words, but at least now I have an algorithm for it.

Just change the epsilon = 0.1 (10%) higher or lower depending on your initial (personal) confidence, and if your guess was right, and your epsilon low, then the overall impact to 'optimal' solution is negligible, but you have built in a fail safe in case you were human after all.

[1] https://en.wikipedia.org/wiki/Multi-armed_bandit

chargrilled · on May 29, 2012

Bayesian bandits is also a very interesting approach: http://tdunning.blogspot.co.uk/2012/02/bayesian-bandits.html

conductrics · on May 30, 2012

Of course, of the environment is truely stationary, then the easiest simpest hack method for exploration is just seed the initial values for each option (A/B../Z) with a an optimistic guess (so something you know is higher than the true value. Then just make decisions based on the current best estimate. The estimates will be driven down over time to their true values. Not claiming you should do this or that is optimal or anything but keep it in mind as a quick hack to solve the problem.

Drbble · on May 30, 2012

It's funny to see the startup crowd rediscovering age-old techniques... Seen the Amazon.com home page recently? Does it look the same every day for a month?

sravfeyn · on May 30, 2012

I can draw many parallels between this and Genetic Algorithms. There is percentage of probability in choosing next choice(In GA, children), and we have highest probability for the most profitable(In GA, fittest) choices. The solutions evolve. And the most profitable solutions (Most fittest in GA) remains.

How is this different from Genetic Algorithms?

conductrics · on May 30, 2012

A GA is a zeroth order optimization method. A Bandit is a type of decision problem. So, bandit is a single state RL problem were one is trying to make decisions in an environment in order to min regret. GA is a general optimization approach when there is no gradient or second order info about the problem to use. Take a look at XCS classifiers for an approach that can solve bandit type problems, but uses GAs to estimate the mailings between features and rewards.

sravfeyn · on May 31, 2012

Thank you.

viggity · on May 30, 2012

what?!? No patio11 comment?

patio11 · on May 31, 2012

I upvoted btilly's at the top of the thread, considered saying "FWIW: my opinion is this, particularly #2", and then decided that added very little value.

hnal943 · on May 30, 2012

I actually jumped straight to the comments on this article to see if he responded. Not that he's required to, but I do find his thoughts valuable on this topic.

loceng · on May 29, 2012

This data isn't fully valuable unless you know what alternate behaviour the users are performing (instead of clicking the alternate-coloured buttons). They still could be staying on your site, just not following the same funnel. They could be return visitors as well.

DivisibleByZero · on May 29, 2012

The example highlights the reward of button pressing, but most likely you would be measuring your "success" differently in your own application. You could add different behavior categories and weight each accordingly to how much it is worth to your product.

loceng · on May 30, 2012

Right. If you totaled the importance values then you could see a nice funnel.

ArekDymalski · on May 29, 2012

There's one thing that keeps me concerned. Time (actually number of responses required to pick up the best option). Please correct me if I'm wrong, but I've got a feeling that this process requires more displays to effectively determine 'the best' option.

ComputerGuru · on May 31, 2012

That's actually easy enough to fix. You trade off between certainty and time by setting the initial confidence values.

For instance, if you're testing A, B, and C; you can start off with success/total values of 1/1 for each or 100/100 (for extreme values). If you start off with 1/1, a single hit or a single miss will swing the algorithm quickly and heavily in that particular direction; e.g. 1 miss for C results in 1/2, and brings its success rate down from 100% to 50% immediately, giving instant precedence to A and B. Whereas if you used 100/100 to start, a single miss for C would only bring it down to 100/101, letting the algorithm take much longer to "settle," but with far more confidence.

The trick is in picking a number that suites your needs, e.g. for expensive traffic sources (AdWords) pick smaller numbers to minimize the cost of the experiment and for cheaper, more often sources use larger numbers because you can afford the extra time to be sure.

blauwbilgorgel · on May 30, 2012

http://genetify.com/demo/ is an interesting related js solution.

https://github.com/gregdingle/genetify/wiki/

perssontm · on May 29, 2012

This sounds really interesting, I might have been close to building this before without realizing. :) Seems really easy to setup as well, will be interesting to hear any counter-argument in the comments.

thenomad · on May 29, 2012

Seconded. As a non-mathsy marketing/biz guy, my response was "whoa, sounds great.", but with the caveat that my response to the actual statistical math side was "Derp."

I'm really looking forward to hearing comments from people With Actual Maths!

wildmXranat · on May 30, 2012

I've made it a habit to ready his articles on regular basis. There's always something thought-provoking that sticks for a long time.

theDoug · on May 29, 2012

There's a lot out there beyond A/B testing, most of what people are looking for tends to be multivariate.

wdewind · on May 29, 2012

This does require a significantly larger sample size for accuracy though.

TomGullen · on May 29, 2012

A larger sample size than A/B? Why?

daeken · on May 29, 2012

I don't know for sure, but it (intuitively) seems to me that you'd require a larger sample size because only 10% of the time will it explore other options. With A/B testing, you've got equal odds to hit either option, so each step is independent of the last; with this, each step depends on the previous results and thus if you have one option jump out into the lead (for whatever reason) it'd make it less likely that the other gets a good sample.

rwhitman · on May 29, 2012

Does anyone know of any split test products on the market that do this?

iworkforthem · on May 30, 2012

Anyone has a js implementation of the multi-armed bandit algorithm?

tzaman · on May 30, 2012

Can I get this in Ruby, please? :)

dremmettbrown · on May 30, 2012

https://github.com/bmuller/bandit

duaneb · on May 31, 2012

Great! Now how about COBOL?

its_so_on · on May 30, 2012

I always wondered why the great people I know can do so much better than having to do A/B testing in their own businesses. Sure they try new things, but they are most certainly not applying an A/B type algorithm.

It seems the article mentions something quite important in their algorithm: mostly do what has the most expected value. The people who are great just have a much better way to judge that. They can make a poster with 20 design choices (or 50 or 100) and make an estimate of what would probably work and what probably wouldn't on each one, from the size of poster, to the font sizes and types, whitespace, where to place different elements, graphics choices, etc etc etc. They certainly include a random aspect, but this is the exception rather than being the norm. Mostly what dictates choices is your expected returns on them, and the random aspects are compared with these.

They do pay exquisite attention to their random choices: "This week I decided to see what would happen if I mixed up the day, date, location, and description rather than have it be in logical order, to see if this engaged people any more" (or: to change any other choice randomly). But it is the exception rather than the norm, and done rarely rather than often. They still pay attention to the results, which helps inform their "expected value" function.

(I didn't spend much time on the article or the linked papers, so please feel free to correct me if I'm misinterpreting. A rigorous algorithm doesn't have much to do with real-world choices, and we are simply nowhere near having an automated web service to write your copy, regardless of how many users get to give feedback on it. So the whole thing isn't very interesting to me, and the above is just my impression of 'why'.)

mayop100 · on May 30, 2012

Nothing new here. This is a common technique. Any email newsletter service worth its salt has been doing this for years, and I think many A/B testing tools have as well. A/B testing doesn't mean you'll assign 50% of users to each version.

eta_carinae · on May 29, 2012

There is no evidence in this article that

1) this approach is better than A/B testing. 2) it doesn't suffer from the the Heisenberg principle

Also, the call out box at the top is obviously an ad but it's not marked "Sponsored link", which is really scammy.

sarchertech · on May 29, 2012

>Also, the call out box at the top is obviously an ad but it's not marked "Sponsored link", which is really scammy.

That's because it's a link to a webapp the author built.