<i>Any A/B testing article will generally explain that you can split users using the modulo operator on some user id</i>
The good ones will tell you that you can do that, but then explain that doing so gives you problems if you want to run more than one A/B test at the same time. They also should explain that there can be large chance fluctuations, so you need to be able to tell the difference between significant results and chance ones. Else people will draw lots of wrong premature conclusions.
I must say I liked the article itself but the presentation you linked a lot better. It's definately a good, detailed explanation of A/B testing, including all it's gotcha's, metrics etc. :)
Rather than using the modulo on some user id, you'd generally want to use the modulo on md5(test_name + user_id). You can't be sure that user ids are evenly distributed -- for example, if they're monotonically increasing, then there might be subtle interactions between signup order and your A vs. B variations which could poison the validity of your results.
Bonus points: if you modulo on the user ID then Bob will always get whatever alternative you write first, which could be suboptimal unless you write your code in random order (which you don't). If you modulo the above MD5 thing, Bob will sometimes get A and sometimes get B for different tests, but always get consistent results within the same test.
You recommended the same thing at http://news.ycombinator.com/item?id=812421. My question to you then was what you'd do if you started with 4 versions then proved that one was bad. Your response was to start over.
If you instead keep a table of which people are in which version, progressively removing versions while not throwing away existing data is easy to do. This also allows you to set up tests where A and B get unequal splits of the data. The drawback is that you need a new table that grows fast and is heavily accessed. However I've found that cost to be quite reasonable.
I think we're largely on the same page vis-a-vis the performance vs. ease of mutating tests tradeoff. In my personal workflow, tests are cheap, so throwing them away doesn't really bother me. The performance hit bothers me a little, as it would potentially cause problems for some people using my A/B testing library. If your performance needs tolerate a read on every render of an A/B tested element and a write for every unique visitor seeing an A/B tested element, though, mazel tov, your way should work swimmingly.
For folks interested in weighting without the database: you can do unequal weights with the modulo method, too. Imagine if you had five choices as follows: [A, A, B, B, B]. This array can get computed once from an arbitrarily simple (or complex) representation in your program, and then cached somewhere where accessing it is very fast. (I stick it in Memcached.) Thus you get all the weights with none of the calories. You don't get trivial re-weighting, though.
The problem isn't how expensive tests are, it is that someone is waiting for the test answer. Throwing away several days of data causes them to wait longer, which makes for unhappy product managers.
Agreed. Even if not ran at the same time, it still seems dangerous to split by modulo. Why? Suppose some test is set up with the end result that group A visits the site 5% less in the future, and B is unchanged. After this result is discovered, the next test is run, which is to test whether the background should be white (A) or puke green (B).
Now even if in actuality people don't care one way or the other about the color, it may well be discovered that group A does worse, because there are still people among this group that were exposed to the previous test and are now coming back less (or have stopped coming at all).
I have been trying to decide what app to make next, and you have given me the answer (or at least reminded me of IMVU interview on mixergy a while back http://mixergy.com/ries-lean/)
Following the IMVU method I plan to add a few made-up product links at the top of my site (with madeup screenshots) and see how many clickthroughs I get for each one. This will help me decide what people are looking for.
Awesome.
Edit: I wonder how I can use this analytical, A/B testing approach with everything in life. I'll have to keep a notebook.
The good ones will tell you that you can do that, but then explain that doing so gives you problems if you want to run more than one A/B test at the same time. They also should explain that there can be large chance fluctuations, so you need to be able to tell the difference between significant results and chance ones. Else people will draw lots of wrong premature conclusions.
For a better (albeit much longer and more detailed) explanation of how to A/B test read the one I did last year at http://elem.com/~btilly/effective-ab-testing/.