Hacker News new | past | comments | ask | show | jobs | submit login
Why small tweaks and split testing don't work, and what to do instead (cortes.design)
82 points by pedrocortes on Nov 29, 2018 | hide | past | favorite | 46 comments



I don't buy this article at all.

I am no longer with the company, but around 10 years ago, I turned a failing consultancy into an SEO SaaS company. Turning revenues from -40k/month to 90+k/month after laying off more than half the staff when we lost our last client and deciding to take our destiny into our own hands building a suite of SEO products.

We ran split tests on almost every design change. They weren't always tiny, some were entire page layouts. But even on things like changing the header text saw measurable differences.

I built the ability to split test into every facet of our framework. We had the ability to split test pages, transactional emails, marketing emails, and we would tweak everything all the time. If we noticed group a wasn't performing as well as b, we would drop a.

There were some tiny things that made major differences:

- Email "From" being a person's name and not just the service bumpped onboarding but nearly 20%.

- The onboarding subject "Almost Done..." performed about 10% better than "Activate your account"

- Having the video auto play converted visitors almost 50% more often than them having to click the play button (it may be evil, but it makes money).

- Having the video be an animation vs just a spokesperson talking into a webcam converted 5% better with the same script (+2% better with just reading the text from the page).

We had a large audience after we launched our first product free of charge for a limited time (linklicious.co -- formerly linklicious.me -- formerly lts.me), then used that momentum to built out a new product every month or so for the next couple of years. It was a great job in a terrible industry. 80% of my job was inventing new ideas.


-Take your email list -Split it in two equal halfs -Send each half the same email -Open and click rates will differ between the two lists


But if you apply statistics, you will see that the open rates only differ significantly 5% of the time, if you use a 95% confidence interval.


It will only differ 5% of the time if you have an adequate sample size and only check for significance once the sample size is reached.

If you end the test the moment the data reaches 95% significance, it will show a difference about 50% of the time for the same email. Many people make this mistake.

A 95% confidence interval doesn't mean much if you dont follow good statistical practices.


I don't understand why you would need a sample of a certain size. Setting a significance threshold at 5% takes sample size in to account. For example, if I ran a permutation test with a sample size of 5 in each group it could never been significant at that threshold, and never is < 5%!

A small sample size would lower your power to detect meaningful differences, which the original scenario doesn't have (by definition).

(If distributional assumptions, etc, are violated, then that's a different story!)


You need to make sure you have enough samples in order to know if you rejected the null hypothesis by chance. Stopping your test early, is a form of p-hacking. See:

https://heapanalytics.com/blog/data-stories/dont-stop-your-a...


Peeking at your data, and calculating the sample size you need for a test are separate statistical issues. I agree that peeking messes up significance levels :).

The point I was trying to make was you can decide to run a test with a very small sample (e.g. n = 5), and it will still have the level of type 1 power you set if you chose a significance level of .05.

> You need to make sure you have enough samples in order to know if you rejected the null hypothesis by chance.

You do this when you decide the significance level (e.g. .05). The value needed to reject, given a significance level, is a function of sample size.

The definition of Type 1 error on wikipedia has a good explanation of this:

https://en.wikipedia.org/wiki/Type_I_and_type_II_errors


It should be possible to adjust the significance estimate down based on an assumption of continuous sampling less than the sampling size target. E.g. if you're aiming for a sample size of 100, and after 20 samples the split is 19/1 the current ratio is 95%, but that's not accurate yet, so e.g. it could be adjusted to the average between the two possible outcome extremes (99/1 and 19/81) and show 59% with low confidence. I don't know the statistics or if that specific method actually makes sense, but it should be possible.


Only bigcos or companies whose core business is DTC will have the resources to do proper ab tests and then they might not have the knowledge to make them statistically significant.

What you're saying is perhaps correct but it's not how AB testing is done in the real world. I think many tests show a difference for random reasons.


a) 5% is still one in 20. Are you doing 20 trials or more a month? (Of course, it won't be _exactly_ 5% of the time... there's some other statistics we could calculate to say how likely it is to differ by more than some specified delta to 5% depending on how many trials you run... oh my)

b) That's assuming you are using statistics right. It's quite hard to use statistics right.


That's way better odds than most people's gut feel. Everyone thinks they don't need to run experiments because THEY already know what works.

Did you know, in experimentation programs run by Microsoft, Google and Amazon, roughly two thirds of their ideas have no impact or hurt the metrics they were designed to improve? And yet rookie web Devs or marketing assistants "know" better.

Source: https://www.google.com.au/url?sa=t&source=web&rct=j&url=http... (Read section three of this paper by a Microsoft distinguished engineer)


It's actually more like 80% for Google. I spent a lot of time running various experiments on Search.

I'll point out a major difference, though: Microsoft, Google, and Amazon are already highly optimized. They've had millions of man-hours put into optimizing everything from product design to UI to UI element positioning to wording to colors to fonts. When you get a new, beneficial result in a change to Google, it's usually because the environment has changed while you weren't looking, and the user is different.

That doesn't apply to a new business that's just learning how to sell their product. In a new business, by definition, you've spent zero time micro-optimizing your message & product. You can get really big wins almost by accident, if you happen to have stumbled into a product category where users actually want what you're building.


Assuming you completely randomize which half goes into each split, it wont vary significantly over time.

Obviously some segments are going to open/click/spam at higher rates, but that's fine and controllable with a customer rating. So when you split your users, ensure an even makeup of customers, and your tests will be fine.


Also known as an A/A test.


> Having the video auto play converted visitors almost 50% more often than them having to click the play button (it may be evil, but it makes money).

Because the ones who weren't going to convert never came back


It may or may not be true that they never came back. A good split test tool will tell you that as well.


I'm increasingly of the opinion that split-testing tools are the hidden way Satan influences the world. If your A/B tests keep telling you that you should make the product more user-hostile and less useful, that you should manipulate the users and abuse the commons for short-term gains, something is very wrong - with the tools, the tests, or the world.


Split testing can be used by companies who care about their users and companies who don't. It just tells you how your users react. It's not a substitute for user empathy or regulations.


How many impressions were you running if you were able to measure a 2% difference?


If you want to be completely sure, a really small amount, lets say your baseline is 20% to detect a 2% difference you just need 2 x 6,347 views for each test

Check this tools[0] and this article [1]

[0] https://www.evanmiller.org/ab-testing/sample-size.html#!20;8...

[1] https://www.evanmiller.org/how-not-to-run-an-ab-test.html


When we implemented split-testing-by-default baked into the software itself, we saw maybe 1000 visitors a week from various sources. Our free conversion rate was at about 3% and once we were "fully optimized" we were closer to 10-15%, and our paid conversions were about 10-30% of the free users depending on the month (December and January always sucked).

Most of these numbers were measured after at least a week of leaving the experiments running.

We never tested more than experiment at a time per visitor, but our framework had the ability to run multiple tests simultaneously, while cookie-ing a visitor to a single test.


Sorry to be so harsh, but this article is mostly BS. Booking.com built there entire corporate culture around A/B testing, and they (i.e. parent Priceline) were one of the top performing stocks of the 00s.

If anything, the biggest issue I've seen with A/B testing is that it biases organizations to things that are easy to measure (i.e. shopping conversion) while sometimes at the expense of things that are longer term/harder to measure (like brand reputation). I'd be most interested to hear how some companies have dealt with that shortcoming.

But all that said, one of the biggest benefits of a culture around A/B testing is that it gets companies away from the endless back-and-forth around opinions (like this article) and builds a culture of "OK, we'll try your suggestion and see if it succeeds or fails".


No problem ;)

If you noticed I included a section explaining when a split test makes sense and you can tell that Booking.com fit it that criteria as they were already starting to grow a lot.


Hi, I highly doubt booking.com's success came from CTA button colors - or any testing on the consumer side of the platform, for that matter. Their success is rooted in how they are able to take advantage of hotels and their distribution power.


This is absolutely false, and anyone who works at booking.com will tell you this. https://taplytics.com/blog/how-booking-comss-tests-like-nobo...


Because they test a lot doesn't mean it's critical to their success. That's a cargo cult fallacy.

Your article points to 2-3x industry average conversion rates (for existing traffic), but says nothing about the more important factors of acquisition or inventory.


Except that Booking actually has data to back this all up. It's very common to do "holdbacks" so that you, say, give an old version of your website to a very small portion of your user's, then compare its conversion rate to a version of your site that has all the latest A/B test winners. Booking also tracks religiously how conversion rates change over time.

The whole point being that the way Booking (and other) companies act is the exact opposite of a cargo cult. Everything needs to be backed up with data and challenging conclusions is part of the culture.

All that said, a common complaint of Booking is that it has a ton of "dark UI" patterns, so it will be interesting if there is any long term blowback against short term A/B winners that erode the goodwill to the brand over time.


Every few years, I read an article critical of testing. It's always written by a designer. It's always the same arguments, and the "solutions" are always qualitative. They rely a lot on rationalizing after the fact. For example, "this redesign failed because you didn't build your website around the EXACT words your customers might describe their problems and their solution" (italics a quote from this article). Saying these things don't actually help you arrive at better process.

At this point, all these articles get me thinking about is why they keep getting written. Were these designers abused by bad product managers? Are they ego-driven and don't like having their creativity reduced to quantitative values? I don't know, but I do know that designers who talk this way tend to be toxic to a productive culture. I've experienced that firsthand.


Nah, it's because they're consultants, which means they need to constantly drum up business, which means that they need potential customers to believe they have unique knowledge that'll help improve their business and all the alternatives are shit. It's a sales pitch. And because they're consultants, they don't need 80% or even 10% of website viewers to convert: they just need a handful of customers who will each pay them tens of thousands of dollars for services.

Controversy is great advertising: they get all the folks who hate them to help spread the word about their services.

It's the same thing with software methodologists, gurus, and architects. Roughly daily there's a new blog post about how [common practice] is now considered harmful, and you need [proprietary expertise held by consultant] to implement some other replacement instead. These posts may or may not be helpful to your software engineering efforts, but they are certainly helpful to the poster's bottom line.


It's the same principle as programmers who want to build a 16 layer microservice vortex using alpha-version frameworks when the product is a simple CRUD form. Let's call these people "architecture astronauts".

In the same way, some designers want to sculpt an ultra-minimalist flat typographic fantasyland with no user cues when the product is a straightforward ecommerce shopfront. Let's call these people "design wankers".

Fortunately, split testing can prove that the "wankers" are wrong. This makes them write angry blog posts about how testing is bullshit and Steve Jobs and skeuomorphism and building a better horse.

Can someone now please solve the "astronaut" problem for me?


Testing is not always the right tool for the job.

For many types of businesses, once a website design is good enough, it is really hard to gain meaningful improvements from further changes. Chances are that changing the color or position of button is not going to make a difference.

If you have 50 customers a month, you are not going to have enough data measure 5% improvement in conversion even if you let the test run for a year.

If a business has 50,000 customers are month, then you can measure a 5% improvement in conversion in a few days and testing is a valuable tool.

Until you are at that scale where testing becomes feasible, you are better off just following best practices in design and focus on other areas to drive more business. If you are going to test something, it has to be big bold changes so you are able to measure the significance with a small sample size.


There are plenty of things not worth A/B testing, and even if you did, it’s unclear what was the mechanism that made one succeed over the other. (It’s probably just chance.) Marissa Mayer and her 40 shades of blue[0] is a perfect example. In this case, the A/B teat wasn’t science, but rather scientism.

[0] https://www.theguardian.com/technology/2012/jul/16/marissa-m...


What remains the #1 most straightforward CRO advice that works basically every-time: fix your website's speed (especially mobile)!!!

In a previous life I ran a Conversion Rate Optimization as a Service business and optimized $100s of millions of transactions for customers across categories.

The amount of research on this topic is staggering and a lot of it dates years back like Walmart's seminal review in 2012 (1) showing the devastating impact of speed on conversion rate.

In case that's not enough for you Google set up not one, but TWO separate page speed tools to help webmasters fix this chronic problem before they just threw in the towel and took over the job for you with their AMP effort. (2)

Oh and by the way did you know that if your site speed is slow Google will penalize your SEO? And drop your Page Quality Score index resulting in higher PPC charges?

The best technical guide on the internet to addressing this problem is here - http://httpfast.com/ - and a comprehensive monitoring service built by the author here - https://www.machmetrics.com/.

(1) http://www.webperformancetoday.com/2012/02/28/4-awesome-slid...

(2) https://testmysite.thinkwithgoogle.com/ and https://developers.google.com/speed/pagespeed/insights/


Fun story: I pushed back on the implementation of an A/B testing platform and instead said we should focus on improving performance and my Marketing VP literally walked out of the meeting in anger. Infrastructure work is never sexy, but it's so important to everything — even sales performance.

A/B testing small changes is something that really big sites like Amazon can do because they have enough volume to justify it. It's kind of like blood doping in sports. If you're already at the top of your game it will make enough of a difference to be significant, but if you're just some average person who can't run a mile... blood doping is the last thing you should do.


Just yesterday I was watching a video about a triathlon bike which uses a frame setup that's illegal in normal bike racing. The question was, was it a faster bike? The conclusion was that the bike was quite heavy and not very stiff, so not really a great bike for normal bike racing. However, as a time trial bike where drafting is not allowed (the normal situation in triathlons), this bike could shave off 40 seconds over 40km. Which is a pretty huge amount, the presenter said with a grin.

He said it with a grin because it is a huge amount in the context of a competitive time trial -- like the difference between first place and 10th or 20th place perhaps. But as a percentage difference it's roughly a 1.4% increase in speed (assuming you can maintain 50 km/h). It's practically nothing in real terms.

In competitive cycling, though, the margins are super small. You might win the Tour de France by 2 minutes, which seems like a pretty big lead, until you realise that's 2 minutes in 80 or 90 hours of cycling. This is why Team Sky's approach of "marginal gains" is so successful -- the difference between first and second place is something like 0.04% performance.

We've got this idea that we need to optimise performance (in terms of SEO, etc, etc), but I've never seen anybody quantify the margin of "victory" required. How much better do you need to get to push you over the edge? Because that's what's going to dictate what strategy you need to pursue.


I was in a start-up once where the results were deliberately slowed down to make a point that some unique business process was taking place.

The start-up failed for many reasons and this was one.


So basically: Be bold to break out of or to stop wasting your time on tiny local maxima. Further: Use reason to make leaps in the right direction.


I'm confused. The author starts by stating:

"When you focus too much on increasing the conversion of a certain page you'll either not get any positive results or just push the problem to the rest of the funnel, none of which, will increase your revenue btw..."

And then they launch into their "How to redesign a SaaS website in 3 weeks" solution with:

"#1 - Focus on the money pages. Why would you focus on redesigning pages that barely no one visits or that is not related to a conversion goal?!

You need to focus on the pages that are part of the buyer's journey from landing on your website the first time to completing a goal..."

Am I reading this wrong or is the author contradicting himself?


What if you have done 1, 2 and 3, and now small tweaks are all you have left?

At what point do you consider an experience "optimized" and say we just can't squeeze anything else out of this?


Or to put it another way that will strike accord with a different audience. At what point do I stop refactoring my code and say, "This is good enough"? When do I say, "I've achieved my goal and as imperfect as this is, I'm going to start concentrating on new things".

Optimising revenue is not desirable. Making enough revenue to achieve your goal is what is desirable. Once you've done that, you should concentrate on new goals. However, just like refactoring code, you may find that you need to go back and revisit something in the future. "Not now" does not necessarily mean "Not ever". It just means that I'm putting this aside until it makes business sense to do it.

To complicate things, though, just like programmers have to be the ones calling the shots about whether or not something needs to be refactored (they are the ones who have the required vision and experience to know), marketing people need to be the ones calling the shots about these kinds of things. If you don't trust your programmers to make appropriate decisions, then you have a huge problem. The same is true of marketing people.

Fixing that trust issue (either by placing your trust in the people, or replacing people who are not worthy of trust) is hyper difficult, though. It's one of the reasons organisations run around doing ineffective things instead of fixing this kind of root problem.


Considering something "optimized" is always going to be a struggle of almost perfectionism (even for myself and my own website haha).

I think you should always be critical of what would be better use of your time: optimizing vs getting more (targeted) traffic ;)

I think if you think small tweaks are all you have left that you need to consider how can you make your website specific to each persona you have on or even dig through which type of people could benefit from your product even outside of what your currently targeting.


IMO, never.

Rule #1: Perfection is attainable

Rule #2: The universe is not constant: Screen sizes change, browsers, screen types, technology, new competitors come up changing people's expectations, people's tastes change, you expand into new demographics/demos/sales channels which react differently than existing demographics/sales channels, etc.

Ideally this work is done with a marketing/programming/operations team all in one, so all aspects are considered, with their trade offs, to forward on a collective understanding of solving user pain points.


I am not speaking optimized in terms of perfection I am talking about optimized in terms of revenue generation.

At a certain point your offering is it's offering, people found the maximum value out of it, and making it easier to use or testing different ways of delivering that functionality is just adding on lipstick.

At what point does it make sense to pursue other revenue models or strategies rather than squeezing out another 1-3%

I am bringing this up for a reason -- my company is going through this very thing


I think the section on customer acquisition cost really resonates with me.

The corollary, I suppose, is that once you have some customer traction, looking at where those costs are is where you want to spend time with optimization. However, this begs some questions:

If I'm spending $500 for a $5000 customer - should I change anything at all to drive down that $500?

...which leads to: At what CAC:Cusotmer Value ratio should I start trying to drive down that cost relative to other costs in the biz?

Anyone have any real experiences from their biz' they can share on this subject?


Thanks, glad you liked it. This it's something I've noticed people need to come back too when they focus too much on conversions.

To answer your question that totally depends because in the end what you'll need is cash flow to maintain that reinvesting process and that's something (finances/management) I'm not an expert in :P


Thanks for chiming in here. I've been reading some of your others articles. I like the very practical style and results focus.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: