Hacker News new | past | comments | ask | show | jobs | submit login
37signals: A/B testing part 3: Finalé (37signals.com)
181 points by wlll on Aug 23, 2011 | hide | past | favorite | 58 comments



I love when smart companies with great brands double business results with a few hours of A/B testing. The more this happens, the more you have to think: wait a second, if ten years of obsessive development and marketing and branding gets us X sales, and 10 hours of A/B testing added X sales to that... what the heck was I doing last week again. Answer: something that did not double sales.

I bang this drum frequently, but let me bang it one more time: you should be A/B testing.


> you should be A/B testing.

37 signals should be A/B testing, because they have the volume to do it.

Many of us don't, or aren't sure. Anyone want to comment on what sort of numbers you should be looking at, roughly, in order to get statistically significant results?


It's sensitive to your volume, conversion rates, and the magnitude of the difference in conversion rates between A and B. In general, volume is good, higher conversion rates (or easier intermediate conversions, like to email submit rather than to purchase) are good, and large differences in conversion rates are good. It would take a lot of time to statistically significantly discriminate between .001% and .0012% on volumes of 5 visitors a day.

People like easy answers, though, so my easy answer is "A/B testing is worthwhile if you have 100 visitors a day. Otherwise, spend the bulk of your time finding 100 visitors a day first."


Is 100 a day enough? Wouldn't it take forever to get statistical significance? (unless the results are really dramatic)


It depends on how much variance you normally have, and the size of the change you hope to measure.

http://en.wikipedia.org/wiki/Statistical_power


Perhaps. Let's say you run the test for 5 days (500 users), and 80% of conversions happen on variant A, it does provide you with something to work with and experiment further.


One simple and clever way to test for significant results is to run an A/A/B test. When your two A buckets converge, you've probably done enough testing.


or just calculate statistical significance?


> Anyone want to comment on what sort of numbers you should be looking at, roughly, in order to get statistically significant results?

This is a very important point. We've found you have to go with your gut until a point.

We've just started doing lots of A/B testing for Buffer (http://bufferapp.com). We are now getting 300-500 signups per day, and we currently have around a 10% conversion so you can work out how much traffic that is. We've tried A/B testing in the past without success, and I think now is the right time for us with the right volume for A/B testing to have a significant impact.

One important thing: we've started A/B testing aggressively (a new test almost daily, now onto our 4th test. We've tested 3 different variations and had no significant results. I expect this to take a long time, and posts like this make it seem easy to get significant changes in conversion. A/B testing is hard.


I have found you need about 200 conversions and to let the experiments run for a 2 months to get rid of the novelty effect (regardless if a winner has been found).

A conversion must also be a real sale, by that, I mean people make the mistake of using visits or free signups, and the landing page is optimized for clicks and free accounts not revenue. A perfect example of a failure here is Adwords.com. The Adwords landing page is designed to get people to signup for an account, rather than to generate revenue. 99% of adwords signups lead to dead accounts, because google does a bad job of setting customer expectations on the marketing site.


Check out Ron Kohavi's publications for a way to calculate this: http://robotics.stanford.edu/~ronnyk/ronnyk-bib.html (He has a bunch of publications that are just the same thing repackaged, so it probably doesn't matter which one you read.) For example, if you have a conversion rate of 5% and want to detect an increase of 5% you need about 120'000 hits.


I'm interested in this too. It sounds like it would be hard to do A/B testing pre-launch with our tiny volume, and unless we get a significant amount of volume immediately post-launch we won't have much data. However, I do plan to get a framework in place for A/B testing immediately after launch as we go into a refinement phase.


I don't think so this particular case is just "few hours of A/B testing". They were testing pretty drastic design changes - and it takes lot more than few hours to come up with design concepts, copyright text, and then actual implementation. These were not typical "test different landing page title or action call button color" tests. These were brand new design changes. So for established companies, this is absolutely the right approach - as your product is already ready, now just increase the sales. But for many early stage companies, when product development itself is constantly evolving and demanding, spending so much time on different design iterations is not always possible. Agree, their results are phenomenal, but not sure how many companies understand this before they try.


I would love to see the data sets that support the lifts - those little "up 3.49 / down 3.38" can get notoriously tricky in terms of statistical relevancy using GWO or otherwise.

Also to note, I often find that radical new variant designs for well established clients have rapidly diminishing returns.

Sometimes the sheer "newness" can skew the data as folks sitting on the fence (returning visitors) push the convo data up.

All that said, I'm a fan of the new look and feel, fun to watch 37signals influence turn the market - expect to see lots of similar looking stuff in the future.


There are a couple important biases where just changing something results in a positive result.

* http://en.wikipedia.org/wiki/Novelty_effect (everyone loves seeing something new!) * http://en.wikipedia.org/wiki/Hawthorne_effect (being a subject of an experiment is motivational)

There are a couple ways to deal with these biases. One is to try a large number of different designs, so the bias is spread across a large number of options. Thus the comparative winner is more likely to itself be intrinsically better, at least amongst the new designs. Note that this still negatively punishes the control.

The other way is to wait longer until the new design is no longer 'new'. In practice, that is impractical! Really, all you can do is pick the winner and keep monitoring it after the fact to see if there is a regression in the result.

I know others have said this, but you can harness these biases. Simply change the design frequently for no reason other than to stay 'fresh'. It keeps people interested and it shows you're alive and kicking.


The Hawthorne effect probably wouldn't apply since it requires the subject to know he's the subject of an experiment. The novelty of the new designs might be a factor, but since we're measuring new sign-ups, it seems kind of unlikely that a lot of people were familiar enough with the design of the page to realize it was new.


It is highly likely that people know they are looking at a new design. People don't just adopt a CRM on a whim. For my site, (foreign language learning), the customer visits an average of 6 times before buying. New designs drop the visit count temporarily, a clear sign the novelty effect is strong.


> Also to note, I often find that radical new variant designs for well established clients have rapidly diminishing returns.

I have seen this a number of times. I change the design, and it is a statistically significant winner, then I continue to let the experiment run, and the conversion rate falls back to the original level.

I have also found that rotating the design, is statistically better than never changing it.


There is the problem with repeated hypothesis testing: every test has a probability of error. If you test often enough you will make an error.


There's an xkcd strip for that: http://xkcd.com/882/


That strip is less about errors and more about the science news cycle. Another good reference for that: http://www.phdcomics.com/comics.php?f=1174


The article actually notes that those up/down 2% results are statistically insignificant ("a specific person didn’t quite matter among the set of people we tested"). But the lift from no-person to big-smiling-face was very significant: more than 100%.


Heck yes. It is so easy to do A/B testing incorrectly. Without knowing at least the number of hits the results are almost meaningless.


"Also to note, I often find that radical new variant designs for well established clients have rapidly diminishing returns."

I totally agree. However, this is a marketing site that "established clients" probably won't be visiting often. Once they've signed up the site has done its job right?



"Finalé" is a funny aberration. It is not a French word but rather the incorrect pronunciation spelled out in French characters.


Finale originates from an Italian word which meant final, not all Latin words in English derive from French.

Italian: http://www.wordreference.com/enit/final

English etymology: http://www.etymonline.com/index.php?search=finale&search...

Also, there is no such thing as "French characters". The grave and acute accent for example are used in multiple latin character based languages.


The accent on the final (hehe) e would be wrong in Italian. The Italian word is pronounced something like:

fee-NAH-lay, not fee-nah-LAY.

IIRC, it is much more common in French to place an accent on the final syllable than in Italian, where the second to last usually gets the emphasis.

In any case the point stands that the word looks wrong.


The accent isn't supposed to be on the last syllable in English. It isn't pronounced "fee-nah-LAY"; it's "fin-NALL-ee." In my experience, people sometimes correct the last sound to a long A rather than a long E, but I've never heard the accent on the last syllable. I think either you know some odd people or it's some regional variant.

Source for British English: http://dictionary.cambridge.org/dictionary/british/finale?q=...

Source for American English: http://www.merriam-webster.com/dictionary/finale


> I think either you know some odd people or it's some regional variant.

I think you've read the thread a bit quickly and are a tad confused. I was talking about the Italian pronunciation. I know how it's pronounced in English as well, as that is my native language.


Oh, my apologies. I read "accent" as referring to the pronunciation (since you followed it with a contrast of "fee-NAH-lay" and "fee-nah-LAY"), but you meant the diacritic. Yeah, you're right, that's not the correct spelling of the word.


I'm not arguing that it isn't spelled wrong or incorrectly, just that it isn't of French origin. It wouldn't be the first time an English speaker misspelled a foreign origin word. Naive comes to mind as a classic example. Maybe nee as well.


I don't think the GP claimed it is a French word. Actually he explicitly said that it's not, and I don't see him claiming anything on its origins.


My point was mainly just in regards to a word spelled with a say grave or acute accent doesn't necessarily originate from French.

And the origin was to demonstrate that this word is both misspelled, and has no direct relation to French at all since it originates from Italian.


But "Finale" also exists in french.


"The whole A/B testing concept probably came from from “strategy analysts” or “MBAsses”. Anyway, now I’m a believer in A/B testing."

Is he is also a new believer in not making irrational judgements of people based on their job title or education?


Touché! I was high-and-mighty too cool for school Designer role-playing.


It's fine, but I didn't really read it that way. It probably doesn't help that this place leans to a strong "business guy = moronic jackass" mentality. I'm sure that led to some assumptions on my part.

edit:

Let me add, that other than the comment I commented on, I really enjoyed the series. There was a lot of creative ideas to test that extended well beyond the usually published A/B test results that test fairly mundane changes, like the color of a button.


That's the correct way to use an accent


Lawyer?


This was a great demo, but if I could make one suggestion- I think it would be a more useful exercise to group layouts by category and then test based on educated assumptions from there.

What does that mean? Well I believe strongly in the "less is more" strategy, and I have a hunch that the real difference in results lay in the lack of options for visitors to the Basecamp intro page. With the two previous long-form versions, users had the option to spend time reading and eventually navigating away from the page. With the photo page, there is really only one option- go to the sign-up page now. The images certainly make the first impression very engaging, and I think it is well done, but I would be interested to see whether a meaningful change would occur if instead of the smiling people, you had a beautiful cityscape of the Chicago skyline, or no image at all. I suspect it wouldn't be huge. My point is that it would be useful to gain actionable insights that could be repeated, such as "always make a sign-up button the only option and focal point of a splash page" as opposed to "always use smiling people." Otherwise, the real value of the lesson may have been overlooked.


The gender of the "big smiley customer" didn't seem to matter too much. This makes me smile.


It's good that they're using real people in the photos in those designs. Few things bug me more on marketing sites than when people use those happy stock photo people, who are usually jumping up and down or looking at a pie chart, or staring at me in their hands-free headset, ready to take my call.


Our (Visual Website Optimizer) customers also arrived at a similar conclusion: human faces indeed increase conversion rate. We compiled two A/B tests on human faces v/s images into one case study. Here it is: http://visualwebsiteoptimizer.com/split-testing-blog/human-l...

This corroborates the result found out by 37Signals. (Although the caveat that it is not always true still applies. So, you should A/B test it before implementing on your website) Looks like there is indeed something special (on sub-conscious level) about human faces.


This is interesting because I think what these tests prove is that people coming to the Highrise website are not searching Google for CRM terms, they are rather searching for the brand, or are coming from targeted ads... As in, they don't need to be "sold" on anything or given more information to read, they just need a button to click (are already determined to buy / now goal is to just remove barriers for them).

What works here will NOT work somewhere else. Certainly will not work on my website (http://www.devside.net/server/webdeveloper) where I have to sell the benefits from the start... Even my buy button is at the very bottom of the page because that forces the visitor to read that page... "Above the fold" is nothing but bounces and lost sales for me.


"Jason Fried’s mantra while testing was: We need to test radically different things. We don’t know what works. Destroy all assumptions. We need to find what works and keep iterating—keep learning."

The specific learnings are irrelevant. Only the process.


I'm anticipating a new wave of SaaS homepages with photographs in the background.


Starting with me. I know it is cheesy to copy but if it works it works.


And the only way to know if it works for you is to test it.


Great series.

There are plenty of reasons to be skeptical of small values. The statistical significance probaly won't be there except for high volume sites. There is a paradox - AB is better for big changes, but those changes could also be the most disruptive to existing users.

Even with these concerns, look at how efficient this is compared to traditional retail. Imagine all the black magic required to figure out which display for Crest works the best. It was worthwhile to do with very blunt tools. We may not be in the realm of scalpels, but we are well beyond chainsaws.

Thanks for sharing!


"finale", a music term, is an italian word (not french) and has no accent.


I'm glad they share their findings but would have loved to see more actual information in this 3 part series.

what about:

* actual signup rate (not only compared to before) * retention (that's the thing that really counts right?)

does anyone know good articles on a/b testing that cover the important things like statistical significance and how to make sure that the novelity of the change isn't skewing the results and so on? I fear that too many people fall into the trap of drawing premature conclusions and wasting time and money :(


The background colors were very different in the big photos.


The design with the big customer photo also has the most obvious "call to action" button.

Not only that, it has the least amount of links that go elsewhere.


37signals are known and celebrated for their design style, and I think it's awesome that they are willing to experiment with new ideas and discard old ones. Particularly because the new designs in the post look nothing like their usual style.


There is also the issue of ethnicity. If you are targetting a global market, what pics do you put? Asian, African? Caucasian? Does that affect your conversion per region? That would be interesting.


I like that 37signals tested really different designs, and then refined the winner. This is, in my opinion, the best way to do things to avoid getting trapped in the local maximum.


They should've also tried the GoDaddy girl :) I wouldn't be surprised if the numbers were way higher even though she won't a bit be relevant to the context.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: