A/B testing is much more difficult and complex than most vendors and blog posts would have you think.
There are literally hundreds of ways to screw up your A/B tests and if you aren’t careful, testing can easily do more harm than good to your online business.
In this article, 10 CRO experts with extensive hands-on experience will reveal their biggest A/B testing mistakes so you can learn from them an avoid these pitfalls yourself.
1. Peep Laja, ConversionXL – Stopping Tests Too Soon
When getting started with split testing, my most common test mistake was definitely ending tests too soon. I would either “see the trend” and end it in a few days, or perhaps call it as soon as there was 95% confidence level. Of course, you shouldn’t do any of this.
Ending tests too early is the #1 mistake I see. You can’t “spot a trend”, that’s total bullshit. The correct way to go about it is to pre-calculate your minimum needed sample size (no penalty for bigger sample size), and testing for 1 (better still 2) full business cycles – so the sample would include all weekdays, weekends, various sources of traffic, your blog publishing schedules, newsletters, phases of the moon, weather and everything else that might influence the outcome.” Read this article for more on statistical significance and test validity
2. Craig Sullivan, Optimise or Die – Not Integrating Test Data With Analytics
When it comes to split testing, the most dangerous mistakes are the ones you don’t realise you’re making. I know this, because I learned the hard way – by screwing up the validity of my tests. Let me explain what I mean with a real example: The double test whammy I was watching a test run and although the early stage results showed some promise, the test simply settled into a depressing pattern of showing no real difference between the two creatives.
[sociallocker]The strange thing was that this was a real ‘rip it up and start again’ test – where we had heavily changed the design between A and B versions. We should be seeing a major reaction (negative or positive) in terms of the new creative. Why, with all the huge differences in design, copy and layout, was this test coming out at about ‘just the same’ as the other tests? I checked all the data in the split testing software and this all looked correct.
I asked the developers and team for ideas and found nothing that would explain the test results going this way. We rechecked the QA for all the browsers and devices and again, found absolutely nothing. At this point, you’d be forgiven for thinking we should have just accepted the result and moved on. However, the result was so counter intuitive that I had to investigate further.
Luckily, we had set up some custom variables in Google Analytics to collect which creative people were exposed to during the test. This basic form of analytics integration was about to save us, big time – because I noticed something odd. In an A/B test – you’d expect to see roughly a 50/50 split between people seeing A and B over the test period. I was stunned to find that over 80% of people exposed to the tests saw BOTH versions. WTF? It was a big warning flag and when we investigated, we found the developers had coded the test in an unusual way.
They’d set a session (browser) cookie which expired after each visit when the browser was closed. This meant that visitors returning on the same device & browser pair would simply get version A or B randomly and it never ‘remembered’ which creative was picked when they came back to the site. This is just one of over 60 ways to break your AB test that I’ve discovered. I really must write all these down sometime so please offer your encouragement! What are the main lessons I learned then?
– Always integrate your AB testing with analytics. If you don’t do this, you cannot ‘cross check’ the figures from one with the other. If there is a large difference, then something is broken and you need to investigate. In the past, I’ve had issues with AB tests not recording goals/revenue/other data correctly and being able to verify with two collection engines is vital for picking up bugs.
– Get your developers to understand the session model (cookies, expiry, sessions, timeouts) employed by both your analytics package and the AB testing software you are using. If there is a misalignment between the three session models (your site, the AB testing and the analytics package) then the numbers WILL be out. Asking developers about this area will ensure you’re not doing something like happened to me above.
– Don’t believe the numbers. Assume that your testing data is suspect until you have run through checking against analytics. If this is your first test, you can run an A/A test or have a short period where you check the figures before you’re happy. Running checks that what the AB test says is reflected both in your back-end figures and in your analytics package will give you a safety net against instrumentation issues.
– Check your data even if it looks good. If version B is what you want to win, and the data looks this way, still distrust it until you’ve checked. Don’t be a victim of confirmation bias – make sure that what you’re seeing is confirmed by more than one data source.
3. Alhan Keser, Wider Funnel – Focusing on Design Rather than Solving Conversion Problems
Back when I first started a/b testing when I worked on marketing at a web design company, I used to make the mistake of focusing entirely on design/usability-related tests. I didn’t give copy and its effect on user motivation enough credit.
One time, I invested weeks into planning and designing two completely new designs of a lead-gen landing page. I had been allocated a designer and developer to get the job done, with the expectation of delivering at least a 20% increase in leads. With these resources and time, the new designs would surely outperform the existing page! Alas, the test went terribly and I was left with few insights. As a result, I lost support internally. This forced me to make tests that were simple and focused on copy.
It was for the best: I discovered how much impact a good headline and body copy could have. Almost immediately, I saw big improvements by testing various approaches to our landing page copy.It required little more than a few brainstorming sessions to find themes. Also, I was able to get through many more tests, resulting in quicker revenue gains and more insights that could be tested through other marketing channels.
Looking back, it was a rookie mistake. I should have mitigated risk and set up a/b tests that had a low chance of failure, requiring little effort, and with re-usable insights. Being at a web design company at the time, I was looking at everything through a myopic lens, analyzing all things design-related, rather than solving the conversion problem at hand.
4. Joel Harvey, Conversion Sciences – Rendering a Site Unable to Produce Revenue
Aside from science, what do Conversion Optimizers and Doctors have in common? We’re both subject to the same oath – First do no harm. For us, that means our tests can never (EVER) impede a site from making money.
Naturally we’ll have some variations that under-perform the control, thus leading to some opportunity cost. That’s fine – it’s just a cost of testing. Plus we know that in the medium to long term, testing will produce incremental revenue exponentially exceeding the testing cost.
On the other hand, launching a variation that literally renders a site unable to produce revenue. Well, that’s not fine. Not. Fine. At. All. Notice I said ‘renders a site unable to produce revenue’ instead of ‘breaks a site.’ The distinction is important because it reminds us that the costliest mistakes are also some of the most easily overlooked. For the past several years, we’ve designed, QA’d and launched dozens of tests a month across a huge variety of websites. To this day, the one mistake that makes me cringe more than any other we’ve made didn’t even “break” a client’s site, per se. The site looked and worked fine in every test variation, including the variation that still haunts me.
Nope, there wasn’t a break, just two transposed numbers. The primary goal for this particular website is generating phone leads. Every test variation we launched had a phone number assigned to it, meaning that any place on the site where there was a phone number, the number displayed to visitors would be the number associated with the specific variation they were in.
To make a long story short, because of a QA breakdown we didn’t notice that the last 4-digits of one of the variation phone numbers displayed to visitors was 3576 when it should have been 3567. What was the difference between 76 and 67? Well, not 9. In the short time that the offending variation was live, we lost at least 100 phone calls. Visitors dialed the number and were greeted by “duh-duh-duh, we’re sorry, the number you have dialed is no longer in service.”
Gut wrenching. Embarrassing. Money-burning.
We calculated the lost revenue, made our client whole on the small amount of lost revenue, added two extra layers of QA to make sure it would never happen again and continued testing. To this day, we’ve never repeated the mistake. But just thinking of it still makes me cringe.
5. Andre Morys, WebArts.com – Tracking the Wrong Ecommerce KPIs
Around two years ago, somebody asked me a very simple question: “What is the primary KPI for a/b testing success when it is about ecommerce optimization?” This guy wrote a whole blog post about this question and asked for my opinion.
While I was thinking about it, I realized slowly, that we maybe measured the wrong stuff in the last 3-4 years. I asked myself: “What really defines ecommerce success?” You may argue, that it is about conversions. But maybe that is the only thing you see when you are testing, because it is the only thing beside revenue that we can measure. I thought: Success is about growth, and for growth you need good bottom line results.
So the primary KPI that you can measure and influence directly is bottom line, which means revenue minus VAT minus returns minus cost of goods. Sounds complex? Mostly it is not. It is easy to analyse – you can directly move the data about the test variation into the ecommerce system and run a cohort analysis later. We did this and we saw an extreme influence in the area of returns.
In a case study we did, we got the following data:
Variation 1 – focus on discount: 13% uplift, -14% bottom line
Variation 2 – focus on value: 41% uplift, +22% bottom line
So variation 1 lost although the testing tool reported an uplift. The uplift in conversions for variation 2 in conversions was much bigger than the “real uplift” in bottom line. This is why I recommend everybody to do a cohort analysis after you test things in ecommerce with high contrast – there could be some differences…
6. Ton Wesseling, Online Dialogue – Basing Conclusions on a Low Power Level
I love mistakes, they make you an expert – and I’ve made them all! Calling tests too early, letting them run until significant…
I’ve made all the 20 mistakes that can make A/B-testing a big lie. I shared them in this presentation. But, there is one very important lesson that’s not in this slide set. A mistake I’ve made in the past and I still see many people in the industry get this wrong…. You tested some great idea, but there was no significant effect found. People tend to say: I’ve tested that idea – and it had no effect. YOU CAN NOT SAY THAT! You can only say – we were not able to tell if the variation was better. BUT in reality it can still be better!
Many people tend to test with a too low Power level – so you will only be able to find real winners for a certain percentage of tests. No significant effect DOES NOT MEAN that there is no difference between the original and the variation, it can even be the variation is indeed better. Test again – or better, test with more traffic & conversions. But please, with a power level lower than 90% – NEVER SAY: tested, didn’t work!
7. John Ekman, Conversionista – Choosing a Bad Testing Period
AB-testing is not a game for nervous business people, (maybe that’s why so few people do it?!). You will come up with bad hypotheses that reduce conversions!! And you will mess up the testing software and tracking. That is more or less guaranteed. So, here are two of our biggest screw-ups. We hope you can learn from them and avoid doing the same mistakes.
1: Choosing a bad testing period. The pros will tell you that it is important that you set the test duration to span over a sufficiently long period. Two full “Business cycles” is usually the recommendation. We did a test for a company that sells office supplies online. We set up and ran the test for well over a month, covering at least that time span. And we had sufficient traffic to get significant results. The period was long enough. It just wasn’t the right period!
The test period spanned the whole Christmas and new years season. We had plenty of visitors and conversions, but it later turned out that their shopping behavior in that period was not representative. So we couldn’t reproduce the test results later. Maybe they were stocking up before the end of the year? Or the other way around? We don’t know. We just know that we did not gain any actionable customer insights in that period.
2: Getting template and individual pages mixed up. One common problem with E-commerce testing is that the site has just a few template pages, but several thousand of individual pages based on those templates. These are product pages, category pages and so on. So when you test, you want to test on the template level NOT on an individual page instance.
If you test on the individual page you will not get the traffic leverage of a test that runs across several thousands of pages. For one E-commerce site, we deployed a template, site-wide test, but we messed up the test. The test script would fetch the price from the example/template page we had based the code on. So all of a sudden every single item in the store did cost 95 kr. The horror!!! We had to shut that one down quickly.
8. Paul Rouke, PRWD – Not Engaging The Technical Team
One of the biggest lessons I have learnt is making sure we fully engage, and build relationships with the people responsible for the technical delivery of a website, right from the start of any project.
Nearly all technical teams have a never ending backlog of fixes and tweaks to work through, as well as delivering a long-term roadmap of features and functionality, meaning there is often little compassion or resource available to support a new set of demands on their workload. This can not only create resentment for your project from internal teams but can cause serious issues in restricting innovative tests likely to delivery more dramatic results, keeping you confined to the limits of what’s possible through testing tools.
When you do engage the technical team Turning this on its head, when we now always engage the technical team from the start, we witness and hear from them how much more they enjoy developing new page designs, new features and functionality – knowing that they have been designed based on a rigorous research and evaluation process to develop strong hypotheses with a high likelihood the changes will improve website performance. The lessons we have learnt from this are;
1. Always engage with the people responsible for the website experience from the start. Get them involved in the kick off meetings. Show them how some of their existing workload could be incorporated into future tests. More than anything we see how much development teams get out of actually seeing the impact their work has, in a role where usually you develop something and then for no apparent reason you’re told to re-develop the same area 6 months down the line. Development teams should be as much a part of your optimisation team as the person responsible for CRO.
2. Working in silos will hinder your project. Get the right people and expertise in the room so you can make decisions, plans and gain momentum without having to go away and consult others.
3. Don’t believe the hype when the testing tools say “no need for any technical involvement”
4. Do whatever is necessary to remove bottlenecks stopping you from developing innovative, radical tests – typically these are the tests with a high potential to alter user behaviour and dramatically impact and improve site performance
9. Matt Gershoff, Conductrics – Assessing the Past Instead of Getting Insight to Move Forward
One of the best learning mistakes I made was way back in the mid 90’s, when I was working in database marketing. What may not be apparent to digital folks today is that there was a fair amount of sophisticated data driven marketing going in back in direct mail decades ago.
One of my clients was a large US retailer, looking to run a direct mail campaign to target the most likely households to buy or upgrade their existing home heating system (HVAC). For the targeting piece, I built a predictive model that identified which households looked most like past HVAC buyers. I then created two mailing lists. The first was just a list randomly drawn from the client’s customer database (almost every household in the US had been a customer at one time or another so their database had good national coverage).
This would be the control. After pulling this random, control list, I applied the predictive model to the database and selected all of the households with scores in the top 10%. After mailing the HVAC offer to the two files, we were then able to determine the conversion rates of each of the mail files based on the downstream sales. Not too surprisingly, the targeted list significantly out performs the random list – Success!
The client is happy, we are happy. But the joy was short lived. A few months later, my new boss, who had seen the test results of how effective the predictive model was, wanted to use the model to run another targeted campaign for the client. This is how the conversation went:
Boss: ‘ hey, could you pull a fresh targeted list, we want to run another HVAC campaign’.
Me: ‘Well, we can’t really pull a fresh list since we already mailed all the customers in the top 10%’
Boss: ’What do you mean you mailed all of them?! If you already mailed all of the top scoring households, what good is the model going forward? Why even bother spending all of the time testing if you were only going to be able to use it once anyway?’
Now technically, I did everything correctly. However, one of the traps of testing is that if you aren’t careful, you can get hung up on just seeing what you DID in the past, but not finding out anything useful about what you can DO in the future. What I could have done was make my list selection, less select, maybe make the cut off at 35%.
For the test, I could have then pulled a random sample from that top 35%. Then, we could have mailed several lists going forward. So the main take away, is make sure if you are spending time testing, you are testing to learn something you use going forward, not just to assess what you did in the past.
10. Michael Aagaard, michaelaagaard.com – Running A/B Tests with no Underlying Test Hypothesis
When I started A/B testing years ago, there was very little material on the subject available. So for me it’s been a learning-by-doing process that lead me to make just about every possible mistake you can think of.
I’ve felt like an idiot many, many times, but looking back I’m grateful for all those mistakes because they’ve taught me so much that I wouldn’t have learned or understood otherwise. Let me introduce my biggest mistake with a quote from Abraham Lincoln:
“Give me six hours to chop down a tree, and I’ll spend the first four sharpening the ax.”
Jumping headfirst into a series of A/B tests with no data and insight is like chopping blindly away at a tree for hours with a dull ax, hoping that the tree will eventually give way to the blade and fall over. In CRO, collecting data and getting insight is like sharpening the ax. We’re trying to increase our chances of being able to take down the tree in the first attempt – maybe even in the first swing.
Your A/B test is only as good as your hypothesis
For a long time, I thought that CRO was all about conducting as many tests as possible – boy, was I wrong about that one! After years of trial and error, it finally dawned on me that that the most successful tests were the ones based on dat, insight and solid hypotheses – not impulse, personal preference or pure guesswork.
In my experience, most A/B tests fail because of the underlying test hypothesis – either the hypothesis was fundamentally flawed, or there was no hypothesis to begin with. Your test hypothesis is the basic assumption that you base your optimized test variant on. It encapsulates what you want to change on the landing page and what impact you expect to see from making that change. Moreover, it forces you to scrutinize your test ideas and helps you keep your eyes on the goal.
If we stick to the Lincoln analogy, formulating a test hypothesis is like doing a test to check whether your blade really is sharp enough to dig into the trunk and effectively cut down the tree. You can do an on-the-fly test of your optimization idea by filling in the blanks in this template: If the expected outcome seems way too good to be true (or simply stupid), it’s a clear sign that that your test hypothesis is too weak to have an impact in the minds of your potential customers.
Working with test hypotheses provides you with a much more solid optimization framework than simply running with guesses and ideas that come about on a whim. But remember that a solid test hypothesis is an informed solution to a real problem – not an arbitrary guess. The more research and data you have to base your hypothesis on, the better it will be.