How Optimizely Almost Got Me Fired (2014)

iamleppert · on Jan 9, 2016

A/B testing is high on hype and promise but low on actual results if you follow it though to actual metrics. I've done various forms of A/B throughout most of my career and found them to be cinsistent with the OP's results.

A much better approach is to install significant instrumentation and actually talk to users about what's wrong with your sign up form.

That, or actually build a product that users want instead of chasing after pointless metrics. I mean, really, you think changing the color of text or a call-out is going to make up for huge deficiencies in your product or make people buy it? The entire premise seems illogical and just doesn't work. The only time I've seen a/b tests truly help was when it accidentally fixed some cross browser issue or moved a button within reach of a user.

Most of the A/B website optimization industry is an elaborate scam, put off on people who don't know any better and are looking for a magic bullet.

jrpt · on Jan 10, 2016

How is this the top comment? It's recommending changing your website without knowing what the effect is, which is terrible advice.

I've personally run tests and seen tests achieve statistically significant (and very large) results that weren't merely fixing bugs or moving buttons within reach of a user. It could be things like adding a modal, changing copy (yes, literally just adding or changing words on the page), removing content, changing an image, or reordering funnel steps.

Think about that: by merely changing some sentences, your website could be 10% more successful next month. But you won't know whether it's working without a/b testing.

You don't just create tests randomly, but based off of an understanding of the product, goals, signup funnel, etc. which may involve talking to users, but ultimately verifying the change is for the better through a/b testing.

I would agree that there are a number of people running a/b tests that don't understand the math or know what they are doing, but that isn't an indictment of a/b testing.

bildung · on Jan 10, 2016

> I've personally run tests and seen tests achieve statistically significant (and very large) results that weren't merely fixing bugs or moving buttons within reach of a user. It could be things like adding a modal, changing copy (yes, literally just adding or changing words on the page), removing content, changing an image, or reordering funnel steps.

You should read the article. It explained very well why a) statistical significance is meaningless when split testing websites (you are guaranteed to archive significance if you run the test long enough; the math behind significance testing doesn't work like assumed by Optimizely et.al.) and b) most tools on the market are inherently unable to prove effect size by the way tests are designed.

Just do this short experiment yourself: Do an a/b test with identical versions of you site. Spoiler: With 100% certainty, you'll see both significant results and a sizable effect size difference.

jefftk · on Jan 10, 2016

    statistical significance is meaningless when split
    testing websites (you are guaranteed to archive
    significance if you run the test long enough

Wait, what? Not if you use the right test to evaluate your change. This article isn't showing that A/B testing is bunk, it's showing that Optimizely is using an inappropriate test.

    Do an a/b test with identical versions of you site.
    Spoiler: With 100% certainty, you'll see both significant
    results and a sizable effect size difference.

The whole point of significance testing is that, done properly, it tells you how likely you would be to get these results by chance if the two sides of the test were actually identical. Running your A vs A experiment and evaluating it correctly you'll find results significant at the p < N rate N% of the time. And the longer you run it the lower your estimate of the effect size will be.

bildung · on Jan 10, 2016

I think we agree. I should have elaborated more.

> Wait, what? Not if you use the right test to evaluate your change.

Sure, but how often to people actually choose the right combination of test, testing tool and estimated effect size for their significance to have actual meaning?

The problem with experiments like your typical website split test is the potentially infinite population. With a big enough N, the probability of reaching significance approaches 1.

> The whole point of significance testing is that, done properly, it tells you how likely you would be to get these results by chance if the two sides of the test were actually identical.

Yes, I know. But the magic lies in that "done properly". Have you ever approached a tool in the field that would fail my proposed test? I haven't.

jefftk · on Jan 10, 2016

    With a big enough N, the probability of reaching
    significance approaches 1.

Well, not with an A vs A experiment. Then the probability of significance at the p < 0.05 level should stay 5%.

On the other hand, in an A vs B experiment where B is very slightly different from A you're right that with a large enough population you become very likely to reach a p < 0.05 significance level, just with an extremely tiny effect size. The experiment has told you something about the difference between A and B, and you now have high certainty, but it's just such a small difference that you probably don't care.

    Have you ever approached a tool in the field that
    would fail my proposed test?

Your proposed test being, run an A vs A experiment and a tool is faulty if it concludes there's a difference significant at the 0.05 level more than 5% of the time? All the tools I've used for A/B testing would do fine on this test. This includes a custom tool at a computational advertising startup I worked at, Google's internal experiment platform, and Google Analytics' Content Experiments.

(If I didn't trust a tool to pass this test I'd just use it to get the counts for both sides of the experiment and do that stats on my own.)

matwood · on Jan 10, 2016

A/B (or champion/challenger as we called it in the 90s) testing is fine. The problem with these tools is that they do not alleviate the need to understand statistics.

A common place where people get stuck when A/B testing is in a local minimum or maximum. For example, instead of tweaking button colors someone should instead redesign the whole form. Talking to users can help with this, but users are notoriously bad at not knowing what they really want.

TheLogothete · on Jan 10, 2016

The most common place people get stuck in a/b testing is proper design of experiments, i.e. absolutely no methodology (i.e. proper blocking) at all. It's ridiculous how a/b testing is perpetuated to be of paramount importance. A/B testing would be one of the last things I would do, if I worked at a small ecommerce company, for example. What lead to the current state is venture-backed companies and "growth hackers" and "conversion optimization experts" selling false promises to people who haven't read a marketing textbook in their life.

simonswords82 · on Jan 10, 2016

I'm very surprised this is the top comment too, where's patio11 and his ridiculously large amount of positive and useful split testing examples when you need him?

Anyway, on to your comment:

> A much better approach is to install significant instrumentation and actually talk to users about what's wrong with your sign up form.

I don't know what this means. Can you elaborate? What exactly is "significant instrumentation"?

> That, or actually build a product that users want instead of chasing after pointless metrics.

Pointless metrics? I'm not sure what metrics you measure but visitor engagement and visitor to trial conversion rates are the lifeblood of a software sales and marketing funnel.

When somebody in marketing considers A/B testing as a means to increasing trial and paying customer conversions, they're operating on the basis that their product already provides value. The A/B tests are a means of establishing the best method for communicating that value. Better communication = more trials. More trials = a larger pool of potential customers to convert to paying customers. It's pretty basic.

> Most of the A/B website optimization industry is an elaborate scam

What?? Had you said a number of people in marketing could be in danger of blindly relying on their tools (e.g. Optimizely) and not applying common sense or checking the maths I would have wholeheartedly agreed. In my personal experience with the various websites I market, and given the hundreds of split testing examples I've read from people I do not feel are in on the "scam", I've got to strongly disagree with this point.

sweezyjeezy · on Jan 9, 2016

Disclaimer : I work for an A/B testing platform.

There is nothing inherently wrong with A/B testing, and it's basically the only way to verify that anything you do has a positive impact. What's created a hype bubble is articles and platforms saying that minor UI changes, colour of buttons etc. make a real difference to conversion rates. Good tests based on well thought out hypotheses relevant to your actual customers, can and sometimes do help websites find measurable gains. It is not a silver bullet, and no matter how careful you are there is always statistical uncertainty, it won't always help.

enraged_camel · on Jan 9, 2016

>>A much better approach is to install significant instrumentation and actually talk to users about what's wrong with your sign up form.

In my experience this only works for obvious, glaring flaws, such as "I hate that the phone number field enforces a specific format" or "Why the hell are your password requirements so stringent?"

But a lot of UX is about what happens on the unconscious level. A user will never tell you, "Your bright red background feels too aggressive/hostile" because they simply don't recognize it at the conscious level to be able to verbalize it. The signup or checkout form might make them feel uneasy, but they won't know why.

I find that the best way to optimize websites is to do A/B testing first, and then talk to your users about the subset of the results (which might be the majority) that you find confusing or surprising. Basically, never rely on just one method, because then you will be hamstrung by the shortcomings of that method.

cle · on Jan 9, 2016

One underused but valuable use of A/B testing is to safely roll out new features--even if you aren't really interested in optimization. I've seen many cases where A/B testing has caught bugs that would have cost the company a significant amount of money and customer trust.

The overall utility of A/B tests is a complex topic and largely depends on the skill of the experimenter. Well-designed experiments are extraordinarily useful. A/B tests are one tool for product development, but they can't replace it.

jacalata · on Jan 9, 2016

I wouldn't call that A/B testing. It looks pretty similar under the hood but you'd normally call that a phased or gradual rollout and have different goals and metrics for it.

ampersandy · on Jan 10, 2016

As you said, what cle described is a gradual rollout. You can do both at the same time though -- simply start your test on X% of the population and then roll it out to more people as you run your test. This is also a great way to run long-term holdouts.

ryporter · on Jan 10, 2016

A/B Testing is a way to conduct an experiment. Instrumentation and talking to users is another good way to gain insights, but it not an experiment. They are two different (and often complementary) activities.

Many, many people have successfully used A/B Testing. I've personally used it to great effect several times. I certainly don't make decisions purely based on the statistical results, but I find it to be an extremely useful input to the decision making process. All models are flawed; some are useful.

michaelcampbell · on Jan 9, 2016

Perhaps an A/B test on A/B testing vs. not is in order.

kevan · on Jan 9, 2016

At small scales it's probably better to spend your time working on the core product than to make small tweaks, but once you're operating at a large scale a 0.1% conversion bump from a small change (like making the cents display in the price smaller) can mean millions of dollars in increased conversion.

arihant · on Jan 10, 2016

There is a big difference between conducting an experiment and understanding the results.

Even talking to users would not help much if one continues to build a faster horse.

rubberstamp · on Jan 10, 2016

Each page could contain a small feedback area in which the actual product user can specify what feature he/she would like to see improved.

Changing color of text is not much improvement, unless the text was unreadable or same color as background.

No amount of data driven can replace what the actual product user says.

madeofpalk · on Jan 10, 2016

The problem is that actual product users would probably say they want faster horses.

Aleman360 · on Jan 9, 2016

> A much better approach is to install significant instrumentation and actually talk to users about what's wrong with your sign up form.

But but but telemetry is evil!

walterbell · on Jan 10, 2016

Instrumentation can be activated with customer consent, if support is requested by the customer.

ross-life · on Jan 9, 2016

Telemetry in a specific product? Fine.

Telemetry in my OS* that has access to everything? No.

* That I cannot turn off.

frik · on Jan 10, 2016

Exactly. This is why Win10 is so bad. You simply cannot deactivate the telemetry at all (as an end user)!

Aleman360 · on Jan 10, 2016

A modern OS is nothing but a bunch of apps and the kernel. Most of the apps run in sandboxes.

quanticle · on Jan 10, 2016

What if the OS is the product?

ross-life · on Jan 10, 2016

An OS is a product of course, but to me it's on a completely different level to "just" an application product. It has access to everything, I want to be able to trust it and know exactly what it's doing. Telemetry I can't turn off ruins that. Having it is fine, just let me turn it off...

I like Windows and really like VS, I was often literally the only (non-VM) Windows user in a sea of OS X at several offices but Windows 10, the flip-flopping UI, the ads, and the telemetry pushed me to Linux and building my own desktops.

frozenport · on Jan 10, 2016

You described something with small benefits but limited downside, especially if your company has a team of a dozen or so people.

fergal_reid · on Jan 10, 2016

[I'm a PM @ Optimizely]

We were asked about this article before, on our community forums, and one of our statisticians, David, wrote a detailed reply to this article's concerns about one- vs two-tailed testing, which might be of interest [3rd from the top]:

https://community.optimizely.com/t5/Strategy-Culture/Let-s-t...

Additionally, since then, as other commenters have mentioned, we've completely overhauled how we do our A/B-testing calculations, which, theoretically and empirically, now have an accurate false-positive rate even when monitored continuously. Details:

https://blog.optimizely.com/2015/01/20/statistics-for-the-in...

yummyfajitas · on Jan 10, 2016

Disclaimer: I'm the director of data science at VWO, an Optimizely competitor.

In my view, the issue is not one-tail vs two-tail tests, or sequential vs one-look tests at all. The issue is a failure to quantify uncertainty.

Optimizely (last time I looked), our old reports, and most other tools, all give you improvement as a single number. Unfortunately that's BS. It's simply a lie to say "Variation is 18% better than Control" unless you had facebook levels oftraffic. An honest statement will quantify the uncertainty: "Variation is between -4.5% and +36.4% better than Control".

When phrased this way, it's hardly surprising that deploying this variation failed to achieve an 18% lift - 18% is just one possible value in a wide range of possible values.

The big problem with this is that customers (particularly agencies who are selling A/B test results to clients) hate it. If we were VC funded, we might even have someone pushing us to tell customers the lie they want rather than the truth they need.

Note that to provide uncertainty bounds like this, one needs to use a Bayesian method (only us, AB Tasty and Qubit do this, unless I forgot about someone).

(Frequentist methods can provide confidence intervals, but these are NOT the same thing. Unfortunately p-values and confidence intervals are completely unsuitable for reporting to non-statisticians; they are completely misinterpreted by almost 100% of laypeople. http://myweb.brooklyn.liu.edu/cortiz/PDF%20Files/Misinterpre... http://www.ejwagenmakers.com/inpress/HoekstraEtAlPBR.pdf )

paulddraper · on Jan 10, 2016

I pushed for this really hard at one company. I was just a dev, not one of our Harvard MBAs, but I know math.

I got lots of pushback because it's harder to remember two numbers instead of one, and it makes reporting confusing because executives are also used to just the one. Throw in the business guys who did statistical significance by (literal quote) "experience and gut feel" instead of reliable, rigorous quantification of uncertainty, i.e. statistics, and I really couldn't do anything.

lifeisstillgood · on Jan 10, 2016

Could you point to the math/protocol that goes from a sample set, to a range "-4 to +34" and of course how one goes from a range to giving a single number?

I feel like the discussion on this thread is missing the underlying "source code"

yummyfajitas · on Jan 10, 2016

Here are slides on the topic and the full math paper. But we dont go from a credible interval to a single number - we just report credible intervals. We find that to be the only honest choice.

https://www.chrisstucchio.com/pubs/slides/gilt_bayesian_ab_2...

https://www.chrisstucchio.com/blog/2013/bayesian_analysis_co...

https://cdn2.hubspot.net/hubfs/310840/VWO_SmartStats_technic...

lifeisstillgood · on Jan 10, 2016

thank you - bedside reading :-)

tomjen3 · on Jan 10, 2016

I don't know about the one that goes from a sample set, but the range provided is likely just the 90 or 95% confidence interval, which means you can get the single number by taking the average of the two numbers. Don't do it though, it will tell you a lot less than the two numbers; a range that goes from -50 to +60 means that you have a lot of uncertainty and should probably not deploy, whereas +2 to +6 means you have little uncertainty and should deploy since it is almost certainly a win, if a small one.

yummyfajitas · on Jan 11, 2016

It's not a confidence interval, it's a credible interval. It's derived from a Bayesian posterior.

Confidence intervals are confusing and should rarely be used to communicate statistics to laypeople. (See the links in my above post.) Most of the time, you can tell people the definition of a confidence interval, and they will misinterpret this and think it's a credible interval.

This is why we went Bayesian at VWO - we felt it would be easier to change the statistics to match people's thinking than to change people's thinking about statistics.

scuba_man_spiff · on Jan 10, 2016

One thing I noticed that I haven't seen commented on yet:

The solution mentioned of running a two tailed test would not have solved the problem of a false result the author demonstrated through conducting an A/A test.

According to the image in the article: http://blog.sumall.com/wp-content/uploads/2014/06/optimizely...

The A/A test had: A1: Population: 3920 Conversion: 721 A2: Population: 3999 Conversion: 623

    Z-Score: 3.3
    2-tailed test signifiance: 99.92%

Looks like the one-tail vs. two tail test doesn't make huge difference in this case.

So, maybe a larger sample size would have seen a reversion to the mean, but given the size and high significance that would be unlikely (interesting exercise to try different assumptions to calculate how unlikely, with the most overly generous obviously just being the stated significance).

Yes, the test was only conducted over one day, but if it was the exact same thing being served for both, that shouldn't matter.

If there was a reversion to the mean due to an early spike, we would expect to see the % difference between the two cells narrow as the test kept running. You can see in the chart that the % difference (relative gap between the lines) stays about the same after 8pm on the 9th.

So if it's not the one-tailed test at fault, and it's not the short duration of the test at fault, what is?

Don't know.

I have seen in the past that setup problems are incredibly easy to make w/ a/b testing tools when implementing the tool on your site. I've seen in other tools things like automated traffic from Akamai only going to the default control, or subsets of traffic such as returning visitors excluded from some cells but not others.

Based on those results, I'd be suspicious of something in the tool setup being amiss.

closed · on Jan 9, 2016

> This usually happens because someone runs a one-tailed test that ends up being overpowered.

It always pains me a little when people doing research describe statistical power as a type of curse. Overpowered? Should we reduce it? The risk isn't having too much power, the risk is that someone will incorrectly interpret their Null Hypothesis Significance Test (NHST). They need to shift their focus to measuring something (and quantifying the uncertainty of their measurements), rather than think of "how likely was this result given a null hypothesis", whether that hypothesis is..

something is not greater than 0 (one-tail), or

something is not 0 (two-tail).

> You’ll often see statistical power conveyed as P90 or 90%. In other words, if there’s a 90% chance A is better than B, there’s a 10% chance B is better than A and you’ll actually get worse results.

This isn't necessarily true. A could be the same as B. Also, these tests are being done from the frequentist perspective, so saying "there's X chance B is better than A" is inappropriate, unless you're talking about the conclusions of your significance test (e.g. 90% chance you correctly detect a difference between them--a difference you assume is fixed to some true underlying value). Overall, being aware that a one-tail test is taking the position that nothing can happen in the other direction is useful, but a good next step is understanding what NHST can and cannot say.

This even a frequentist vs bayesian problem, since you could create situations where a person felt a study was overpowered in either framework.

jacalata · on Jan 9, 2016

Don't just run tests longer - run tests for a pre-defined amount of time instead of "until you see a result you like".

yummyfajitas · on Jan 10, 2016

This is necessary with textbook statistical methods which are explicitly based on the assumption of looking at exactly one time. However, modern tools (Optimizely, VWO, Qubit) do not require this. [1]

If you use one of these tools it's completely safe to run tests until the test says win.

There are still caveats to account for the real world - e.g., only stop the test after an integer number of weeks - but statistically this is a solved problem.

[1] The blog post under discussion was written before Optimizely's StatsEngine and VWO's SmartStats.

scuba_man_spiff · on Jan 9, 2016

Your comment hits the nail on the head here.

Standard statistical tests used in a/b testing are based on one check. If someone is checking repeatedly on a test until they get a 'significant' result, your chance of getting a getting a false positive is many X the stated significance.

Best practice - set a pre-defined end, and one or two defined early check-in points where only make an early call if result is overwhelmingly significant or if the business has fallen off a cliff.

stdbrouw · on Jan 10, 2016

That would be best practice if you insist on using null hypothesis significance testing and only wanted to use classical frequentist statistics. We really can do much better these days with multi-armed bandits, and by focusing on effect sizes and credible intervals rather than a yes/no answer to a hypothesis.

jacalata · on Jan 11, 2016

Sounds like I need to update my stats knowledge! Do you happen to know of a good place to start learning about today's state of the art?

anarchitect · on Jan 10, 2016

Exactly. And for whole business cycles (for us, that's weeks) to accommodate changes in behaviour and traffic throughout the cycle.

We try to run as many experiments as our traffic can handle, but we always estimate the sample size upfront when planning the test.

elliptic · on Jan 9, 2016

I agree with the main point of the article, but I'm somewhat disturbed by the statistical errors and misconceptions.

>Few websites actually get enough traffic for their >audiences to even out into a nice pretty bell curve. If >you get less than a million visitors a month your >audience won’t be identically distributed and, even >then, it can be unlikely.

What is the author trying to say here? Has he thought hard about what it means for "an audience" to be identically distributed?

>Likewise, the things that matter to you on your website, >like order values, are not normally distributed Why do they need to be?

>Statistical power is simply the likelihood that the >difference you’ve detected during your experiment >actually reflects a difference in the real world. Simply googling the term would reveal this is incorrect.

hb42 · on Jan 10, 2016

> In most organizations, if someone wants to make a change to the website, they’ll want data to support that change.

So true and sad. In all the so called data-driven groups I have worked for, the tyranny of data makes metrics and numbers the justification for or counter to anything, however they have been put together.

> The sad truth is that most people aren’t being rigorous about their A/B testing and, in fact, one could argue that they’re not A/B testing at all, they’re just confirming their own hypotheses.

The sad truth is that most people aren’t being rigorous about anything.

RA_Fisher · on Jan 9, 2016

Here's a great article about how Optimizely gets it wrong: http://dataorigami.net/blogs/napkin-folding/17543303-the-bin...

There are _many_ offenders. I've yet to see a commercial tool that gets it right.

Tragically, the revamp by Optimizely neglects the straightforward Bayesian solution and uses a more fragile and complex sequential technique.

emcq · on Jan 9, 2016

The article you mention names RichRelevance, but there are others who implement Thompson Sampling or other forms of Bayesian Bandits, such as SigOpt [0] and Dynamic Yield [1]. Various adtech companies also use it underneath the hood.

[0] https://sigopt.com/

[1] https://www.dynamicyield.com/

yummyfajitas · on Jan 10, 2016

Thompson Sampling is not a replacement for A/B tests. Unfortunately, the real world violates the assumptions of Thompson sampling virtually all the time.

https://www.chrisstucchio.com/blog/2015/dont_use_bandits.htm...

Bandit algorithms do have some important use cases (optimizing yield from a short lived advertisement, e.g. "Valentines Day Sale"), but they are not suitable for use as an A/B test replacement.

Also, I'd steer away from dynamic yield - I've found their descriptions of their statistics to take dangerous (i.e. totally wrong) shortcuts. For example, counting sessions instead of visitors as a way to avoid the delayed reaction problem and increase sample size (as well as completely breaking the IID assumption).

emcq · on Jan 11, 2016

I love your posts, and completely agree that Bayesian Bandits are not replacements for A/B tests.

To be fair though, the realworld issues like nonstationarity and delayed feedback are also concerns for A/B tests (which you also bring up in your great post), and you can tweak the bayesian bandits to handle these cases decently.

How does counting sessions instead of visitors avoid delayed feedback? I read your post [0] but dont remember anything about that. Is it just that they say that after a session is completed (which is somewhat nebulous to measure in many cases), then you have all the data you need from the visit?

[0] https://www.chrisstucchio.com/blog/2015/no_free_samples.html

yummyfajitas · on Jan 11, 2016

Absolutely agree you can tweak Thompson sampling to handle nonstationarity, periodicity and delayed feedback. I even published the math for one variation of nonstationarity: https://www.chrisstucchio.com/blog/2013/time_varying_convers...

(I've also dealt with delayed reactions, but I've never published it, and probably won't publish until I launch it.)

Dynamic yield has the delayed feedback problem because users might see a variation in session 1 but convert in session 2 (days later). They "solve" this by doing session level tracking instead of visitor level tracking - the delayed feedback is now only 20 minutes (same session) instead of days.

The problem is that session A and session B are now correlated since they are the same visitor. IID is now broken.

emcq · on Jan 11, 2016

Makes sense, thanks for the explanation!

kylerush · on Jan 9, 2016

Optimizely rolled out a huge update to the way it handles statistics called Stats Engine last year. That update resolves the issues discussed in this article. You can read more about Stats Engine here: https://www.optimizely.com/statistics/

stdbrouw · on Jan 9, 2016

Their Stats Engine does resolve the issues, but I have to laugh at their marketing materials calling it 21st century statistics, because even the new approach is comically behind the times. Looks like the poor schmucks accidentally outsourced the project to some really old-school one trick pony "let's throw some maximum likelihood theory at this" statisticians. I would've hoped even die-hard frequentists would see the value of a multi-armed bandit and the irrelevance of declaring a winner rather than quantifying the effect of each variant and the uncertainty around it.

Cf. the technical paper: http://pages.optimizely.com/rs/optimizely/images/stats_engin...

TheLogothete · on Jan 10, 2016

[flagged]

stdbrouw · on Jan 10, 2016

I can see why my comment could come across as arrogant and in hindsight shouldn't have written it with that tone. However, I wrote it immediately after reading https://www.optimizely.com/statistics/ which makes a number of incredibly boastful claims about how Optimizely's approach is better than anything you've ever seen, and couldn't help but point out the hubris.

If you ever come up with a statistical method for finding the global maximum among an infinite number of unspecified alternatives, let me know.

cwyers · on Jan 9, 2016

> some testing vendors use one-tailed tests (Visual Website Optimizer, Optimizely)

> Most A/B testing tools recommend terminating tests as soon as they show significance, even though that significance may very well be due to short-term bias. A little green indicator will pop up, as it does in Optimizely, and the marketer will turn the test off.

People pay brisk money for this?

stdbrouw · on Jan 10, 2016

After reading the blog post and reading through the comments, it looks like people are drawing the wrong conclusion from this. The problem is not that AB-testing is overrated, doesn't work, is bullshit etc. but that Optimizely used to do it wrong.

ErikVandeWater · on Jan 10, 2016

Exactly. A/B testing is very useful in some situations (i.e. where traffic to the page is great enough, product-market fit is achieved, and the product is well-developed enough that any changes to it after the campaign is run will be minimal). But many companies, usually startups, scale too early and use the products poorly. Companies that sell these products have very little incentive to discourage use of their products by ill-informed users, so the misuse continues.

TheLogothete · on Jan 10, 2016

Optmizely did (and does) it wrong because if everybody had to follow to proper procedures they wouldn't have a viable business model.

erikbern · on Jan 9, 2016

> Statistical power is simply the likelihood that the difference you’ve detected during your experiment actually reflects a difference in the real world.

This seems incorrect to me. Isn't statistical power the likelihood that the null hypothesis would generate an outcome at least as extreme as what you observed?

I'm guessing the issue has a lot more to do with peeking at the outcome and not correcting for it (and similarly running many tests)

http://www.stat.columbia.edu/~gelman/research/unpublished/p_...

smu3l · on Jan 9, 2016

Yes, this is incorrect.

Also the next sentence is wrong:

>You’ll often see statistical power conveyed as P90 or 90%. In other words, if there’s a 90% chance A is better than B, there’s a 10% chance B is better than A and you’ll actually get worse results.

Having a test w/ 90% power means that if A is truly better than B (for one sided) or A is truly different than B (two sided), then you'll detect it 90% of the time you run the test (on independent data).

brockf · on Jan 10, 2016

This post is almost entirely inaccurate re: actual statistics. Power, as you say, is the likelihood of a given experiment rejecting the null hypothesis given some a priori sample size and effect size. And one- vs. two-tailed tests do not change effect sizes, or even your estimate of the variability surrounding an effect, but a p-value related to the effect (should you choose to calculate one).

You would think, given their team of "analysts" and "statisticians", that they might have known these basic pieces of statistics.

nartz · on Jan 9, 2016

In my experience, if you are seeing huge effects like 60% difference in conversion, etc, you probably did something wrong (i.e. too small sample, didn't wait long enough, etc) - I've never seen something this large by simply moving things around, changing colors, changing messages, etc.

Also, in general, the more drastic the changes are, the more of an effect you could have (up to some percentage). I.e. a small change would be changing a message or color, dont expect conversion to change by much. A large change would be changing from a flash site to an html one with a full redesign that loads twice as fast...

ssharp · on Jan 10, 2016

One of the better explanations of a/b testing I've read is that there are three types of uses on your site: those who will covert no matter what, those who will not convert no matter what, and those who might convert. Testing is for that final group. Thinking in those terms, you have to consider if such a large percentage fits into that "might" group. I've never personally witnessed a 60% improvement at any volume I'd bother testing on and I've been doing tests for many years.

I also think a lot of the "change the button" tests or "increased email sign up by x times" are dubious. Unless the conversion is someone giving you money, then you still have steps to go before you see real business improvement. There are lots of ways I can minipulate my traffic to make certain funnel steps look better and more optimized, but the only thing I really care about is what's coming out of the funnel. So all those extra email signups or button pushes mean nothing if the group who perform those actions on your b version still aren't interested in an actual purchase.

forrestthewoods · on Jan 10, 2016

I'm increasingly convinced all statistical analysis performed by non-PhDs is no better than a coin flip. Maybe even worse.

My favorite example is still the quite popular Page Weight Matter posts. I wonder how close they were to abandoning a 90% reduction in size. I wonder how many improvements the world at large has thrown away due to faulty analysis.

http://blog.chriszacharias.com/page-weight-matters

jamiequint · on Jan 9, 2016

This is a huge problem with paid marketing as well. Many folks will look at a conversion rate completely ignorant of sample size and allocate thousands of dollars in budget to something which they have no idea performs better or worse.

The real problem (as you allude to in the article) is that the demand for accurate tools is not really there. Vendors don't build in accurate stats because only a tiny portion of their client base understand/demands them.

anarchitect · on Jan 10, 2016

There is much more to running experiments properly than it seems. While I'm not an expert on the statistics side, there are a few things I've learned over the years which come to mind...

1) Run the experiment in whole business cycles (for us, 1 week = one cycle), based on a sample size you've calculated upfront (I use http://www.evanmiller.org/ab-testing/sample-size.html). Accept that some changes are just not testable in any sensible amount of time (I wonder what the effect of changing a font will have on e-commerce conversion rate).

2) Use more than one set of metrics for analysis to discover unexpected effects. We use the Optimizely results screen for general steer, but do final analysis in either Google Analytics or our own databases. Sometimes tests can positively affect the primary metric but negatively affect another.

3) Get qualitative feedback either before or during the test. We use a combination of user testing (remote or moderated) and session recording (we use Hotjar, and send tags so we can view sessions in that experiment).

filleokus · on Jan 9, 2016

Interesting read, might be worth adding (2014) to the title though.

IkmoIkmo · on Jan 9, 2016

Why the downvote? The year is relevant as Optimizely has made various adjustments in the past 1-2 years to the way they handle statistics, one other user even purporting it addresses the very issues this article mentions.

PhasmaFelis · on Jan 10, 2016

One might wonder if a company which doesn't understand statistics, despite statistics being their sole reason for existence, is actually capable of fixing these problems. If they don't understand what they don't understand, how can they address it effectively?

gnicholas · on Jan 9, 2016

Almost didn't click on this because the title seems (and is) clickbait-y, but it was actually a very useful read for me.

As a founder, I'm constantly hearing about A/B testing and how great these tools are. I'm not enough of a statistician to know whether everything in this article is true/valid (and would welcome a rebuttal), but the part about regression to the mean really hits home. Encouraging users to cut off testing too early means that you make them feel good ("Look, we had this huge difference!"), when in reality the difference is smaller/negligible.

I'll still do some A/B testing, but given our engineering/time constraints—and my inability to accurately vet the claims/conclusions of the testing software—I won't spend too much time on this.

fharper1961 · on Jan 10, 2016

Deciding not to AB test because of this article would be a huge mistake.

Learning from mistakes made by others, and avoiding them is what I would suggest as the take away.

I work for a successful ($50M+ revenue) bootstrapped startup. And one of the reasons for the success is that AB testing became part of the company's culture, as soon as there was enough data coming in for the tests to become useful.

AB testing is so important that we have built our own in-house framework that automatically gives results for our company specific KPIs.

IndianAstronaut · on Jan 9, 2016

AB testing is too simplistic. Even on my marketing team we have designed more complex metrics to look at a factors impact on multiple outcomes. The testing is still a straight forward chi square, but with a bit more depth.

TeMPOraL · on Jan 9, 2016

Wow, did not see that coming. This article actually confirms the cynical hypothesis I entertain - that most of the "data-driven" marketing and analytics is basically marketers bullshitting each other, their bosses, their customers and themselves, because nobody knows much statistics and everyone wants to believe that if they're spending money and doing something, it must be bringing results.

Some quotes from the article supporting the cynical worldview:

--

"Most A/B testing tools recommend terminating tests as soon as they show significance, even though that significance may very well be due to short-term bias. A little green indicator will pop up, as it does in Optimizely, and the marketer will turn the test off. But most tests should run longer and in many cases it’s likely that the results would be less impressive if they did. Again, this is a great example of the default settings in these platforms being used to increase excitement and keep the users coming back for more."

This basically stops short of implying that Optimizely is doing this totally on purpose.

--

"In most organizations, if someone wants to make a change to the website, they’ll want data to support that change. Instead of going into their experiments being open to the unexpected, open to being wrong, open to being surprised, they’re actively rooting for one of the variations. Illusory results don’t matter as long as they have fodder for the next meeting with their boss. And since most organizations aren’t tracking the results of their winning A/B tests against the bottom line, no one notices."

In other words, everybody is bullshitting everybody, but it doesn't matter as long as everyone plays along and money keeps flowing.

--

"Over the years, I’ve spoken to a lot of marketers about A/B testing and conversion optimization, and, if one thing has become clear, it’s how unconcerned with statistics most marketers are. Remarkably few marketers understand statistics, sample size, or what it takes to run a valid A/B test."

"Companies that provide conversion testing know this. Many of those vendors are more than happy to provide an interface with a simple mechanic that tells the user if a test has been won or lost, and some numeric value indicating by how much. These aren’t unbiased experiments; they’re a way of providing a fast report with great looking results that are ideal for a PowerPoint presentation. Most conversion testing is a marketing toy, essentially." (emphasis mine)

Thank you for admitting it publicly.

--

Like whales, whose cancers grow so big that the tumors catch their own cancers and die[0], it seems that marketing industry, a well known paragon of honesty and teacher of truth, is actually being held down by its own utility makers applying their honourable strategies within their own industry.

I know it's not a very appropriate thing to do, but I really want to laugh out loud at this. Karma is a bitch. :).

[0] - http://www.nature.com/news/2007/070730/full/news070730-3.htm...

TheLogothete · on Jan 10, 2016

>most of the "data-driven" marketing and analytics is basically marketers bullshitting each other

Most of the ones you hear about. You know, the ones who were SEOs or content writers or programmers before waking up one day and deciding to be marketers.

Marketing degrees have mandatory statistical courses. The good marketing programs take them very seriously. However a lot of schools focus on the communication side of marketing, which leads a very significant chunk of the marketing analyst positions to be filled by people with economics and accounting degrees.

The Internet is actually quite new and has just begun maturing. A lot of people working in digital marketing do not really understand what marketing is. When you meet somebody on the street and ask him what marketing is, he will describe advertising, or more precisely mar. communications. So when the internet became this giant medium for doing all sorts of commerce, big companies and schools couldn't fill the skill gap fast enough, so the gap was filled by self-learners coming from all sorts of backgrounds. When they wanted to build up their "marketing skills", naturally they defaulted to learning about marketing communication instead of the economics-orientated part of marketing. This is why digital marketers obsess with their site and drool over a/b tests and such. The web site is a communication medium. They've put themselves in this box equating marketing and communications.

Too bad for them, because graduates nowadays are digital native too, so they have no problem navigating the internet and learning html/css.

jmount · on Jan 9, 2016

(as others have mentioned) Optimizely's newer engine uses ideas like Wald's sequential analysis. Here is my article on the topic: http://www.win-vector.com/blog/2015/12/walds-sequential-anal... .

hyperpallium · on Jan 9, 2016

\tangent When I first heard of A/B testing, I thought of combining it with genetic algorithms to evolve the entire site. Just run it til the money rolls in.

Unfortunately, if it did work, it would probably be through something misleading or scammy. Therefore, you need some kind of automatic legality checking... which would be hard.

jbpetersen · on Jan 10, 2016

Is anybody out there taking an approach of gradually driving more traffic to whichever option is winning out and never running 100% with anything specific?