A/B testing is a method of comparing two versions of a web page (or an app) against each other in order to determine which one performs better. Statistical analysis is used to determine which version is more effective for a given conversion goal.
A/B testing is widespread in the marketing services of companies of all sizes as a technique for Conversion Rate Optimization (CRO). The problem is: the limits of the statistical analyses that are used translates to limits in terms of marketing practices. Let’s dive into the subtleties of A/B testing.
The Grail for Marketers: A/B Testing Business Decisions
Marketing directors are responsible for a number of decisions to increase revenue. A majority of them rack their brains to answer these questions:
- Should you lower the price to sell more?
- Or raise them to increase the average basket, at the risk of a lower conversion rate?
- Should products be sorted by increasing price? Or decreasing?
- Should you broaden your range of products upward or downward? Or both? Or neither?
- Is an offer such as “3 for the price of 2” a good way to increase your average basket?
- Should you offer free shipping? Or only from a certain purchase basket value?
Wouldn’t it be great if you could run business experiments in order to test these hypotheses and make the right decisions? Unfortunately, the statistical analyses used today are very limiting in terms of interpreting the results.
Let me explain…
The Basic Principle of A/B Testing
A/B testing consists of exposing two variations (called A and B) of the same web page to two homogeneous populations by randomly splitting the website visitors. For each variation, we collect:
- The number of visitors
- The number of purchases
- The value of the purchase basket
On paper, it should be quite simple to determine which variation generated the most revenue, hence define the better variation. However, like any experiment on human behavior, the data is subject to chance. If variation B generates a more substantial average basket size than variation A, it doesn’t necessarily mean that B will always be better than A.
Indeed, it is difficult to assert that a difference observed during a test will recur in the future. That’s the reason why A/B testing tools use statistical analyses in order to qualify the observed differences and identify the winning variation. Their aim is to help sort out the significant data from random and unpredictable fluctuations that are not correlated to the differences between the variations.
“The Problem is Choice”
In e-commerce, variation B can be considered a “winner” if it generates:
- a conversion gain: more sales are concluded with this variation
- a gain in the average shopping basket: the average shopping basket of variation B is higher than A
- a mixed gain: variation B generates both a conversion gain and a gain in the average shopping basket
The conversion gain is the simplest data to analyze in A/B testing. The statistical tool used is the Bayesian test (you don’t need to read the article to understand the point). The most important functional characteristic of this test is the confidence interval of the measured conversion gain.
For example, it can say that variation B produces a gain of 5 to 10% – meaning that variation B would generate between 5 and 10% more purchases than variation A. In this example, it’s easy to determine that variation B is more effective. You can validate it as at the winning variation, and display it for all your traffic.
…but is this really all you need to make the final call on a winning variation? We’ll see later.
Gain in average shopping basket size
The gain in the average shopping basket is much more complex to analyze. A/B testing tools use the Mann-Whitney U test, also called Wilcoxon. Unlike the Bayesian test, this analysis only provides a probability of gain, without specifying the size of the gain.
For example, if you measure a difference of +5€ in variation B’s average shopping basket, it can say that variation B’s probability of gain is 98%. But in reality, you may only have a gain of +0.1€. The statistical analysis is still right: it’s a gain. But the Mann-Whitney test never predicted the size of the gain.
What is even worse is that a winning variation in average basket size, according to the Mann-Whitney test, may actually yield less revenue, due to the presence of extreme values that distort the analysis. To avoid that, an option could be to remove these extreme values before analyzing the results. However, this solution is inevitably biased: the winner will only depend on the “extreme values” line that you will artificially define.
The ideal case to identify a winning variation is to determine a significant gain in both conversion and average basket. In fact, it’s the only situation when a decision can be made without a doubt.
- Certain conversion gain and certain loss in average basket → impossible to decide, as you don’t know how much you’ll lose, and whether the gain and loss will cancel each other out.
- Certain conversion loss and certain gain in average basket → ditto.
- Undefined loss or gain in the average basket → if you don’t know the evolution of the average basket, it’s impossible to be sure.
This last scenario is the most common situation. Indeed, average basket statistics usually need much more conversion rate information in order to provide a significant analysis.
As you can see, the majority of A/B tests conclude on the certainty of a conversion gain. But without information on the evolution of the average basket, these conclusions are questionable. One could argue that there’s a reason this discipline is called “Conversion Rate Optimization” rather than “Business Optimization”. 😏
Does it mean that A/B testing is completely rubbish? Luckily no. Today, most A/B tests focus on user experience, user interface, and design: colors, wording, pictures, the layout of a product page… In marketing, we talk about “reducing the friction of the buying funnel” – in other words, limiting the number of people who are frustrated and leave the website without buying.
But to be able to go further than ergonomy-based tests and tackle true marketing questions, we need to invent a successor to the Mann-Whitney test that could estimate the size of the gain or loss generated by the experiment. This would definitely get A/B testing a second wind.