Article

6min read

Mastering Revenue Metrics: Understand the Power and Practical Use of RevenueIQ

Revenue is the cornerstone of any e-commerce business, yet most optimization efforts focus only on improving conversion rates.

Average Order Value (AOV), an equally important driver of revenue, is often overlooked because it’s difficult to measure accurately with standard statistical tools. This gap can lead to missed opportunities and slow decision-making.

RevenueIQ addresses this challenge by providing a robust, reliable way to measure and optimize revenue directly—combining conversion and AOV into a single, actionable metric.

Here’s how RevenueIQ changes the way you approach experimentation and business growth.

Discover how to accurately measure and optimize revenue in your experiments with our patented feature.

The most important KPI in e-commerce is revenue. In an optimization context, this means focusing on two key areas:

  • Conversion Rate (CR): Turning as many visitors as possible into customers.
  • Average Order Value (AOV): Generating as much value as possible per customer.

However, Conversion Rate Optimization (CRO) often remains focused on conversion, while AOV is frequently neglected due to its statistical complexity. Accurately estimating AOV with classic tests (such as the t-test or Mann-Whitney) is challenging because purchase distributions are highly skewed and have no upper bound.

RevenueIQ offers a robust test that directly estimates the distribution of the effect on revenue (through a refined estimation of AOV), providing both the probability of gain (“chance to win”) and consistent confidence intervals.

In benchmarks, RevenueIQ maintains a correct false positive rate, has power close to Mann-Whitney, and produces confidence intervals four times narrower than the t-test. By combining the effects of AOV and CR, it delivers an RPV (Revenue Per Visitor) impact and then an actionable revenue projection.

Curious to learn more details? Please read our RevenueIQ Whitepaper for a full scientific explanation written by our Data Scientist, Hubert Wassner.

Context & Problem

In CRO, we often optimize CR due to a lack of suitable tools for revenue. Yet, Revenue = Visitors × CR × AOV; ignoring AOV distorts the view.

AOV is misleading because:

  • It is unbounded (someone can buy many items).
  • It is highly right-skewed (many small orders, a few very large ones).
  • A few “large and rare” values can dominate the average.
  • In random A/B splits, these large orders can be unevenly distributed, leading to huge variance in observed AOV.

Limitations of Classic Tests

t-test

Assumes normality (or relies on the Central Limit Theorem for the mean). On highly skewed e-commerce data, the CLT variance formula is unreliable at realistic volumes. The result: very low power (detects ~15% of true winners in the benchmark) and very wide confidence intervals, leading to slow and imprecise decisions.

Mann-Whitney (MW):

Robust to non-normality (works on ranks), so much more powerful (~80% detection in the benchmark). But it only provides a p-value (thus only trend information), not an estimate of effect size (no confidence interval), making it impossible to quantify the business case.

RevenueIQ: Principle

It uses and combines two innovative approaches:

  1. Bootstrap Technique: Studies the variability of a measure with unknown statistical behavior.
  2. Basket Difference Measurement: Instead of measuring the difference in average baskets, it measures the average of basket differences. It compares sorted order differences between variants (A and B), with weighting by density (approx. log-normal) to favor “comparable” pairs. This bypasses the problem of very large observed value differences in such data.

RevenueIQ then provides:

  • The Chance to Win (probability that the effect is > 0), which is easy for decision-makers to interpret.
  • Narrow and reliable confidence intervals on the AOV effect as well as on revenue.

Benchmarks (AOV)

  • Alpha validity (on AA tests): Good control of false positives. Using a typical 95% threshold exposes only a 5% false positive risk.
  • Statistical power measurement: 1000 AB tests with a known effect of +€5
    • MW Test: 796/1000 winners, ~80% power.
    • t-test: 146/1000, only 15% power.
    • RevenueIQ: 793/1000 (≈ equivalent to MW). ~80% power.
  • Confidence interval (CI): RevenueIQ produces CIs of €8 width, which is reasonable and functional in the context of a real effect of €5. With an average CI width of €34, the t-test is totally ineffective.
  • CI coverage: The validity of the confidence intervals was verified. A 95% CI indeed has a 95% chance of containing the true effect value (i.e., €0 for AA tests and €5 for AB tests).

From AOV KPI to Revenue

Beyond techniques and formulas, the key point is that RevenueIQ uses a Bayesian method for AOV analysis, allowing this metric to be merged with conversion. Competitors use frequentist methods, at least for AOV, making any combination of results impossible. Under the hood, RevenueIQ combines conversion and AOV results into a central metric: visitor value (RPV). With precise knowledge of RPV, revenue (in € or other currency) is then projected by multiplying by the targeted traffic for a given period.

Real Case (excerpt) Here is a textbook case for RevenueIQ:

  • Conversion gain is 92% CTW, encouraging but not “significant” by standard threshold.
  • AOV gain is at 80% CTW. Similarly, taken separately, this is not enough to declare a winner.
  • The combination of these two metrics gives a CTW of 95.9% for revenue, enabling a simple and immediate decision, where a classic approach would have required additional data collection while waiting for one of the two KPIs (CR or AOV) to become significant.
  • For an advanced business decision, RevenueIQ provides an estimated average gain of +€50k, with a confidence interval [-€6,514; +€107,027], allowing identification of minimal risk and substantial gain.

What This Changes for Experimentation

  • Without RevenueIQ: “inconclusive” results (or endless tests) lead to missed opportunities.
  • With RevenueIQ: Faster, quantified decisions (probability, effect, CI), at the revenue level (RPV then projected revenue).

Practical Recommendations

  • Stop interpreting observed AOV without safeguards: it is highly volatile.
  • Avoid filtering/Winsorizing “extreme values”: arbitrary thresholds ⇒ bias.
  • Measure CR & AOV jointly and reason in RPV to reflect business reality.
  • Use RevenueIQ to obtain chance to win + CI on AOV, RPV, and revenue projection.
  • Decide via projected revenue (average gain, lower CI bound) rather than isolated p-values.

Curious to learn more details? Please read our RevenueIQ Whitepaper for a full scientific explanation written by our Data Scientist, Hubert Wassner.

Conclusion

RevenueIQ brings a robust and quantitative statistical test to monetary metrics (AOV, RPV, revenue), where:

  • t-test is weak and imprecise on e-commerce data,
  • Mann-Whitney is powerful but not quantitative.

RevenueIQ enables faster detection, quantification of business impact, and prioritization of deployments with explicit confidence levels.

**Original information can be found by following this link to AB Tasty’s documentation, “Understanding the practical use of RevenueIQ.”

Article

7min read

Is Your Average Order Value (AOV) Misleading You?

Average Order Value (AOV) is a widely used metric in Conversion Rate Optimization (CRO), but it can be surprisingly deceptive. While the formula itself is simple—summing all order values and dividing by the number of orders—the real challenge lies within the data itself.

The problem with averaging

AOV is not a “democratic” measure. A single high-spending customer can easily spend 10 or even 100 times more than your average customer. These few extreme buyers can heavily skew the average, giving a limited number of visitors disproportionate impact compared to hundreds or thousands of others. This is problematic because you can’t truly trust the significance of an observed AOV effect if it’s tied to just a tiny fraction of your audience.

Let’s look at a real dataset to see just how strong this effect can be. Consider the order value distribution:

  • The horizontal axis represents the order value.
  • The vertical axis represents the frequency of that order value.
  • The blue surface is a histogram, while the orange outline is a log-normal distribution approximation.

This graph shows that the most frequent order values are small, around €20. As the order value increases, the frequency of such orders decreases. This is a “long/heavy tail distribution,” meaning very large values can occur, albeit rarely.

A single strong buyer with an €800 order value is worth 40 times more than a frequent buyer when looking at AOV. This is an issue because a slight change in the behavior of 40 visitors is a stronger indicator than a large change from one unique visitor. While not fully visible on this scale, even more extreme buyers exist. 

The next graph, using the same dataset, illustrates this better:

  • The horizontal axis represents the size of the growing dataset of order values (roughly indicating time).
  • The vertical axis represents the maximum order value in the growing dataset in €

At the beginning of data collection, the maximum order value is quite small (close to the most frequent value of ~€20). However, we see that it grows larger as time passes and the dataset expands. With a dataset of 10,000 orders, the maximum order value can exceed €5,000. This means any buyer with an order above €5,000 (they might have multiple) holds 250 times the power of a frequent buyer at €20. At the maximum dataset size, a single customer with an order over €20,000 can influence the AOV more than 2,000 other customers combined.

When looking at your e-commerce metrics, AOV should not be used as a standalone decision-making data.

E-commerce moves fast. Get the insights that help you move faster. Download the 2025 report now.

The challenge of AB Test splitting

The problem intensifies when considering the random splits used in A/B tests.

Imagine you have only 10 very large spenders whose collective impact equals that of 10,000 medium buyers. There’s a high probability that the random split for such a small group of users will be uneven. While the overall dataset split is statistically even, the disproportionate impact of these high spenders on AOV requires specific consideration for this small segment. Since you can’t predict which visitor will become a customer or how much they will spend, you cannot guarantee an even split of these high-value users.

This phenomenon can artificially inflate or deflate AOV in either direction, even without a true underlying effect, simply depending on which variation these few high spenders land on.

What’s the solution?

AOV is an unreliable metric, how can we effectively work with it? The answer is similar to how you approach conversion rates and experimentation.

You don’t trust raw conversion data—one more conversion on variation B doesn’t automatically make it a winner, nor do 10 or 100. Instead, you rely on a statistical test to determine when a difference is significant. The same principle applies to AOV. Tools like AB Tasty offer the Mann-Whitney test, a statistical method robust against extreme values and well-suited for long-tail distributions.

AOV behavior can be confusing because you’re likely accustomed to the more intuitive statistics of conversion rates. Conversion data and their corresponding statistics usually align; a statistically significant increase in conversion rate typically means a visibly large difference in the number of conversions, consistent with the statistical test. However, this isn’t always the case with AOV. It’s not uncommon to see the AOV trend and the statistical results pointing in different directions. Your trust should always be placed in the statistical test.

The root cause: Heavy tail distributions

You now understand that the core issue stems from the unique shape of order value distributions: long-tail distributions that produce rare, extreme values.

It’s important to note that the problem isn’t just the existence of extreme values. If these extreme values were frequent, the AOV would naturally be higher, and their impact would be less dramatic because the difference between the AOV and these values would be smaller. Similarly, for the splitting problem, a larger number of extreme values would ensure a more even split.

At this point, you might think your business has a different order distribution shape and isn’t affected. However, this shape emerges whenever these two conditions are met:

  • You have a price list with more than several dozen different values.
  • Visitors can purchase multiple products at once.

Needless to say, these conditions are ubiquitous and apply to nearly every e-commerce business. The e-commerce revolution itself was fueled by the ability to offer vast catalogues.

Furthermore, the presence of shipping costs naturally encourages users to group their purchases to minimize those costs. It means that nearly all e-commerce businesses are affected. The only exceptions are subscription-based businesses with limited pricing options, where most purchases are for a single service.

Here’s a glimpse into the order value distribution across various industries, demonstrating the pervasive nature of the “long tail distribution”:

Cosmetic
Transportation
B2B packaging (selling packaging for e-commerce)
Fashion
online flash sales

AOV, despite its simple definition and apparent ease of understanding, is a misleading metric. Its magnitude is easy to grasp, leading people to confidently make intuitive decisions based on its fluctuations. However, the reality is far more complex; AOV can show dramatic changes even when there’s no real underlying effect.

Conversely, significant changes can go unnoticed. A strong negative effect could be masked by just a few high-spending customers landing in a poorly performing variation. So, now you know: just as you do for conversion rates, rely on statistical tests for your AOV decisions.

Article

6min read

Minimal Detectable Effect: The Essential Ally for Your A/B Tests

In CRO (Conversion Rate Optimization), a common dilemma is not knowing what to do with a test that shows a small and non-significant gain. 

Should we declare it a “loser” and move on? Or should we collect more data in the hope that it will reach the set significance threshold? 

Unfortunately, we often make the wrong choice, influenced by what is called the “sunk cost fallacy.” We have already put so much energy into creating this test and waited so long for the results that we don’t want to stop without getting something out of this work. 

However, CRO’s very essence is experimentation, which means accepting that some experiments will yield nothing. Yet, some of these failures could be avoided before even starting, thanks to a statistical concept: the MDE (Minimal Detectable Effect), which we will explore together.

MDE: The Minimal Detectable Threshold

In statistical testing, samples have always been valuable, perhaps even more so in surveys than in CRO. Indeed, conducting interviews to survey people is much more complex and costly than setting up an A/B test on a website. Statisticians have therefore created formulas that link the main parameters of an experiment for planning purposes:

  • The number of samples (or visitors) per variation
  • The baseline conversion rate
  • The magnitude of the effect we hope to observe

This allows us to estimate the cost of collecting samples. The problem is that, among these three parameters, only one is known: the baseline conversion rate

We don’t really know the number of visitors we’ll send per variation. It depends on how much time we allocate to data collection for this test, and ideally, we want it to be as short as possible. 

Finally, the conversion gain we will observe at the end of the experiment is certainly the biggest unknown, since that’s precisely what we’re trying to determine.

So, how do we proceed with so many unknowns? The solution is to estimate what we can using historical data. For the others, we create several possible scenarios:

  • The number of visitors can be estimated from past traffic, and we can make projections in weekly blocks.
  • The conversion rate can also be estimated from past data.
  • For each scenario configuration from the previous parameters, we can calculate the minimal conversion gains (MDE) needed to reach the significance threshold.

For example, with traffic of 50,000 visitors and a conversion rate of 3% (measured over 14 days), here’s what we get:

MDE Uplift
  • The horizontal axis indicates the number of days.
  • The vertical axis indicates the MDE corresponding to the number of days.

The leftmost point of the curve tells us that if we achieve a 10% conversion gain after 14 days, then this test will be a winner, as this gain can be considered significant. Typically, it will have a 95% chance of being better than the original. If we think the change we made in the variation has a chance of improving conversion by ~10% (or more), then this test is worth running, and we can hope for a significant result in 14 days.

On the other hand, if the change is minor and the expected gain is less than 10%, then 14 days will not be enough. To find out more, we move the curve’s slider to the right. This corresponds to adding days to the experiment’s duration, and we then see how the MDE evolves. Naturally, the MDE curve decreases: the more data we collect, the more sensitive the test becomes to smaller effects.

For example, by adding another week, making it a 21-day experiment, we see that the MDE drops to 8.31%. Is that sufficient? If so, we can validate the decision to create this experiment.

MDE Graph

If not, we continue to explore the curve until we find a value that matches our objective. Continuing along the curve, we see that a gain of about 5.44% would require waiting 49 days.

Minimum Detectable Uplift Graph

That’s the time needed to collect enough data to declare this gain significant. If that’s too long for your planning, you’ll probably decide to run a more ambitious test to hope for a bigger gain, or simply not do this test and use the traffic for another experiment. This will prevent you from ending up in the situation described at the beginning of this article, where you waste time and energy on an experiment doomed to fail.

From MDE to MCE

Another approach to MDE is to see it as MCE: Minimum Caring Effect. 

This doesn’t change the methodology except for the meaning you give to the definition of your test’s minimal sensitivity threshold. So far, we’ve considered it as an estimate of the effect the variation could produce. But it can also be interesting to consider the minimal sensitivity based on its operational relevance: the MCE. 

For example, imagine you can quantify the development and deployment costs of the variation and compare it to the conversion gain over a year. You could then say that an increase in the conversion rate of less than 6% would take more than a year to cover the implementation costs. So, even if you have enough traffic for a 6% gain to be significant, it may not have operational value, in which case it’s pointless to run the experiment beyond the duration corresponding to that 6%.

MDE graph

In our case, we can therefore conclude that it’s pointless to go beyond 42 days of experimentation because beyond that duration, if the measured gain isn’t significant, it means the real gain is necessarily less than 6% and thus has no operational value for you.

Conclusion

AB Tasty’s MDE calculator feature will allow you to know the sensitivity of your experimental protocol based on its duration. It’s a valuable aid when planning your test roadmap. This will allow you to make the best use of your traffic and resources.

Looking for a free and minimalistic MDE calculator to try? Check out our free Minimal Detectable Effect calculator here.

Article

4min read

Transaction Testing With AB Tasty’s Report Copilot

Transaction testing, which focuses on increasing the rate of purchases, is a crucial strategy for boosting your website’s revenue. 

To begin, it’s essential to differentiate between conversion rate (CR) and average order value (AOV), as they provide distinct insights into customer behavior. Understanding these metrics helps you implement meaningful changes to improve transactions.

In this article, we’ll delve into the complexities of transaction metrics analysis and introduce our new tool, the “Report Copilot,” designed to simplify report analysis. Read on to learn more.

Transaction Testing

To understand how test variations impact total revenue, focus on two key metrics:

  • Conversion Rate (CR): This metric indicates whether sales are increasing or decreasing. Tactics to improve CR include simplifying the buying process, adding a “one-click checkout” feature, using social proof, or creating urgency through limited inventory.
  • Average Order Value (AOV): This measures how much each customer is buying. Strategies to enhance AOV include cross-selling or promoting higher-priced products.

By analyzing CR and AOV separately, you can pinpoint which metrics your variations impact and make informed decisions before implementation. For example, creating urgency through low inventory may boost CR but could reduce AOV by limiting the time users spend browsing additional products. After analyzing these metrics individually, evaluate their combined effect on your overall revenue.

Revenue Calculation

The following formula illustrates how CR and AOV influence revenue:

Revenue=Number of Visitors×Conversion Rate×AOV

In the first part of the equation (Number of Visitors×Conversion Rate), you determine how many visitors become customers. The second part (×AOV) calculates the total revenue from these customers.

Consider these scenarios:

  • If both CR and AOV increase, revenue will rise.
  • If both CR and AOV decrease, revenue will fall.
  • If either CR or AOV increases while the other remains stable, revenue will increase.
  • If either CR or AOV decreases while the other remains stable, revenue will decrease.
  • Mixed changes in CR and AOV result in unpredictable revenue outcomes.

The last scenario, where CR and AOV move in opposite directions, is particularly complex due to the variability of AOV. Current statistical tools struggle to provide precise insights on AOV’s overall impact, as it can experience significant random fluctuations. For more on this, read our article “Beyond Conversion Rate.”

While these concepts may seem intricate, our goal is to simplify them for you. Recognizing that this analysis can be challenging, we’ve created the “Report Copilot” to automatically gather and interpret data from variations, offering valuable insights.

Report Copilot

The “Report Copilot” from AB Tasty automates data processing, eliminating the need for manual calculations. This tool empowers you to decide which tests are most beneficial for increasing revenue.

Here are a few examples from real use cases.

Winning Variation:

The left screenshot provides a detailed analysis, helping users draw conclusions about their experiment results. Experienced users may prefer the summarized view on the right, also available through the Report Copilot.

Complex Use Case:


The screenshot above demonstrates a case where CR and OAV have opposite trends and need a deeper understanding of the context.

It’s important to note that the Report Copilot doesn’t make decisions for you; it highlights the most critical parts of your analysis, allowing you to make informed choices.

Conclusion

Transaction analysis is complex, requiring a breakdown of components like conversion rate and average order value to better understand their overall effect on revenue. 

We’ve developed the Report Copilot to assist AB Tasty users in this process. This feature leverages AB Tasty’s extensive experimentation dashboard to provide comprehensive, summarized analyses, simplifying decision-making and enhancing revenue strategies.

Article

5min read

Mutually Exclusive Experiments: Preventing the Interaction Effect

What is the interaction effect?

If you’re running multiple experiments at the same time, you may find their interpretation to be more difficult because you’re not sure which variation caused the observed effect. Worse still, you may fear that the combination of multiple variations could lead to a bad user experience.

It’s easy to imagine a negative cumulative effect of two visual variations. For example, if one variation changes the background color, and another modifies the font color, it may lead to illegibility. While this result seems quite obvious, there may be other negative combinations that are harder to spot.

Imagine launching an experiment that offers a price reduction for loyal customers, whilst in parallel running another that aims to test a promotion on a given product. This may seem like a non-issue until you realize that there’s a general rule applied to all visitors, which prohibits cumulative price reductions – leading to a glitch in the purchase process. When the visitor expects two promotional offers but only receives one, they may feel frustrated, which could negatively impact their behavior.

What is the level of risk?

With the previous examples in mind, you may think that such issues could be easily avoided. But it’s not that simple. Building several experiments on the same page becomes trickier when you consider code interaction, as well as interactions across different pages. So, if you’re interested in running 10 experiments simultaneously, you may need to plan ahead.

A simple solution would be to run these tests one after the other. However, this strategy is very time consuming, as your typical experiment requires two weeks to be performed properly in order to sample each day of the week twice.

It’s not uncommon for a large company to have 10 experiments in the pipeline and running them sequentially will take at least 20 weeks. A better solution would be to handle the traffic allocated to each test in a way that renders the experiments mutually exclusive.

This may sound similar to a multivariate test (MVT), except the goal of an MVT is almost the opposite: to find the best interaction between unitary variations.

Let’s say you want to explore the effect of two variation ideas: text and background color. The MVT will compose all combinations of the two and expose them simultaneously to isolated chunks of the traffic. The isolation part sounds promising, but the “all combinations” is exactly what we’re trying to avoid. Typically, the combination of the same background color and text will occur. So an MVT is not the solution here.

Instead, we need a specific feature: A Mutually Exclusive Experiment.

What is a Mutually Exclusive Experiment (M2E)?

AB Tasty’s Mutually Exclusive Experiment (M2E) feature enacts an allocation rule that blocks visitors from entering selected experiments depending on the previous experiments already displayed. The goal is to ensure that no interaction effect can occur when a risk is identified.

How and when should we use Mutually Exclusive Experiments?

We don’t recommend setting up all experiments to be mutually exclusive because it reduces the number of visitors for each experiment. This means it will take longer to achieve significant results and the detection power may be less effective.

The best process is to identify the different kinds of interactions you may have and compile them in a list. If we continue with the cumulative promotion example from earlier, we could create two M2E lists: one for user interface experiments and another for customer loyalty programs. This strategy will avoid negative interactions between experiments that are likely to overlap, but doesn’t waste traffic on hypothetical interactions that don’t actually exist between the two lists.

What about data quality?

With the help of an M2E, we have prevented any functional issues that may arise due to interactions, but you might still have concerns that the data could be compromised by subtle interactions between tests.

Would an upstream winning experiment induce false discovery on downstream experiments? Alternatively, would a bad upstream experiment make you miss an otherwise downstream winning experiment? Here are some points to keep in mind:

  • Remember that roughly eight tests out of 10 are neutral (show no effect), so most of the time you can’t expect an interaction effect – if no effect exists in the first place.
  • In the case where an upstream test has an effect, the affected visitors will still be randomly assigned to the downstream variations. This evens out the effect, allowing the downstream experiment to correctly measure its potential lift. It’s interesting to note that the average conversion rate following an impactful upstream test will be different, but this does not prevent the downstream experiment from correctly measuring its own impact.
  • Remember that the statistical test is here to take into account any drift of the random split process. The drift we’re referring to here is the fact that more impacted visitors of the upstream test could end up in a given variation creating the illusion of an effect on the downstream test. So the gain probability estimation and the confidence interval around the measured effect is informing you that there is some randomness in the process. In fact, the upstream test is just one example among a long list of possible interfering events – such as visitors using different computers, different connection quality, etc.

All of these theoretical explanations are supported by an empirical study from the Microsoft Experiment Platform team. This study reviewed hundreds of tests on millions of visitors and saw no significant difference between effects measured on visitors that saw just one test and visitors that saw an additional upstream test.

Conclusion

While experiment interaction is possible in a specific context, there are preventative measures that you may take to avoid functional loss. The most efficient solution is the Mutually Exclusive Experiment, allowing you to eliminate the functional risks of simultaneous experiments, make the most of your traffic and expedite your experimentation process.

References:

https://www.microsoft.com/en-us/research/group/experimentation-platform-exp/articles/a-b-interactions-a-call-to-relax/

 

Article

13min read

Frequentist vs Bayesian Methods in A/B Testing

In A/B testing, there are two main ways of interpreting test results: Frequentist vs Bayesian.

These terms refer to two different inferential statistical methods. Debates over which is ‘better’ are fierce – and at AB Tasty, we know which method we’ve come to prefer.

If you’re shopping for an A/B testing vendor, new to A/B testing or just trying to better interpret your experiment’s results, it’s important to understand the logic behind each method. This will help you make better business decisions and/or choose the best experimentation platform.

Bayesian vs frequentist methods in ab testing

Source

In this article, we discuss these two statistical methods under the inferential statistics umbrella, compare and contrast their strong points and explain our preferred method of measurement.

What is inferential statistics?

Both Frequentist and Bayesian methods are under the umbrella of inferential statistics.

As opposed to descriptive statistics (which describes purely past events), inferential statistics try to infer or forecast future events.

Would version A or version B have a better impact on X KPI?

Side note: If we want to geek out, technically inferential statistics isn’t really forecasting in a temporal sense, but extrapolating what will happen when we apply results to a larger pool of participants. 

What happens if we apply winning version B to my entire website audience? There’s a notion of ‘future’ events in that we need to actually implement version B tomorrow, but in the strictest sense, we’re not using statistics to ‘predict the future.’

For example, let’s say you were really into Olympic sports, and you wanted to learn more about the men’s swimming team. Specifically, how tall are they? Using descriptive statistics, you could determine some interesting facts about ‘the sample’ (aka the team):

  • The average height of the sample
  • The spread of the sample (variance)
  • How many people are below or above the average
  • Etc.

This might fit your immediate needs, but the scope is pretty limited.

What inferential statistics allows you to do is to infer conclusions about samples that are too big to study in a descriptive way.

If you were interested in knowing the average height of all men on the planet, it wouldn’t be possible to go and collect all that data. Instead, you can use inferential statistics to infer that average from different, smaller samples.

Two ways of inferring this kind of information through statistical analysis are the Frequentist and Bayesian methods.

What is the Frequentist statistics method in A/B testing? 

The Frequentist approach is perhaps more familiar to you since it’s more frequently used by A/B testing software (pardon the pun). This method also makes an appearance in college-level stats classes.

This approach is designed to make a decision about a unique experiment.

With the Frequentist approach, you start with the hypothesis that there is no difference between test versions A and B. And at the end of your experiment, you’ll end up with something called a P-Value (probability value).

The P-Value is the probability of obtaining results at least as extreme as the observed results assuming that there is no (real) difference between the experiments.

In practice, the P-Value is interpreted to mean: the probability that there is no difference between your two versions. (That’s why it is often “inverted” with the basic formula p = 1-pValue, in order to express the probability that there is a difference.)

The smaller the P-Value, the higher the chance that there is, in fact, a difference, and also that your hypothesis is wrong.

Frequentist pros:

  • Frequentist models are available in any statistic library for any programming language.
  • The computation of frequentist tests is blazing fast.

Frequentist cons:

  • You only estimate the P-Value at the end of a test, not during. ‘Data peeking’ before a test has ended generates misleading results because it actually becomes several experiments (one experiment each time you peek at the data), whereas the test is designed for one unique experiment.
  • You can’t know the actual gain interval of a winning variation – just that it won.

What is the Bayesian statistics method in A/B testing?

The Bayesian approach looks at things a little differently.

We can trace it back to a charming British mathematician, Thomas Bayes, and his eponymous Bayes’ Theorem.

Bayes Theorem

Source

The Bayesian approach allows for the inclusion of prior information (‘a prior’) into your current analysis. The method involves three overlapping concepts:

  • Prior – information you have from a previous experiment. At the beginning of the experiment, we use a ‘non-informative’ prior (think ’empty’)
  • Evidences –  the data of the current experiment
  • Posterior – the updated information you have from the prior and the evidences. This is what is produced by the Bayesian analysis.

By design, this test can be used for an ongoing experiment. When data peeking, the ‘peeked at data’ can be seen as a prior, and the future incoming data will be the evidence, and so on.

This means ‘data peeking’ naturally fits in the test design. So at each ‘data peeking,’ the posterior computed by the Bayesian analysis is valid.

Crucially for A/B testing in a business setting, the Bayesian approach allows the CRO practitioner to estimate the gain of a winning variation – more on that later.

Bayesian pros:

  • Allows you to ‘peek’ at the data during a test, so you can either stop sending traffic if a variation is tanking or switch earlier to a clear winner.
  • Allows you to see the actual gain of a winning test.
  • By its nature, often rules out the implementation of false positives.

Bayesian cons:

  • Needs a sampling loop, which takes a non-negligible CPU load.  This is not a concern at the user level, but could potentially gum things up at scale.

Bayesian vs Frequentist: which is better?

So, which method is the ‘better’ method?

Let’s start with the caveat that both are perfectly legitimate statistical methods. But at AB Tasty, our customer experience optimization and feature management software, we have a clear preference for the Bayesian a/b testing approach.  Why?

Gain size

One very strong reason is because with Bayesian statistics, you can estimate a range of the actual gain of a winning variation, instead of only knowing that it was the winner, full stop.

In a business setting, this distinction is crucial. When you’re running your A/B test, you’re really deciding whether to switch from variation A to variation B, not whether you choose A or B from a blank slate. You therefore need to consider:

  • The implementation cost of switching to variation B (time, resources, budget)
  • Additional associated costs of variation B (vendor costs, licenses…)

As an example, let’s say you’re a B2B software vendor, and you ran an A/B test on your pricing page. Variation B included a chatbot, whereas version A didn’t. Variation B outperformed variation A, but to implement variation B, you’ll need 2 weeks of developer time to integrate your chatbot into your lead workflow, plus allocate X dollars of marketing budget to pay for the monthly chatbot license.

via GIPHY

You need to be sure the math adds up, and that it’s more cost-effective to switch to version B when these costs are weighed against the size of the test gain. A Bayesian A/B testing approach will let you do that.

Let’s take a look at an example from the AB Tasty reporting dashboard.

In this fictional test, we’re measuring three variations against an original, with ‘CTA clicks’ as our KPI.

AB Tasty reporting

We can see that variation 2 looks like the clear winner, with a conversion rate of 34.5%, compared to the original of 25%. But by looking to the right, we also get the confidence interval of this gain. In other words, a best and worst-case scenario.

The median gain for version 2 is 36.4%, with the lowest possible gain being +2.25% and the highest being 48.40%

These are the lowest and the highest gain markers you can achieve in 95% of cases.

If we break it down even further:

  • There’s a 50% chance of the gain percentage lying above 36.4% (the median)
  • There’s a 50% chance of it lying below 36.4%.
  • In 95% of cases, the gain will lie between +2.25% and +48.40%.
  • There remains a 2.5% chance of the gain lying below 2.25% (our famous false positive) and a 2.5% chance of it lying above 48.40%.

This level of granularity can help you decide whether to roll out a winning test variation across your site.

Are both the lowest and highest ends of your gain markers positive? Great!

Is the interval small, i.e. you’re quite sure of this high positive gain? It’s probably the right decision to implement the winning version.

Is your interval wide but implementation costs are low? No harm in going ahead there, too.

However, if your interval is large and the cost of implementation is significant, it’s probably best to wait until you have more data to shrink that interval. At AB Tasty we generally recommend that you:

  • Wait until you have recorded at least 5,000 unique visitors per variation
  • Let the test run for at least 14 days (two business cycles)
  • Wait until you have reached 300 conversions on the main goal.

Data peeking

Another advantage of Bayesian statistics is that it’s ok for you to ‘peek’ at your data’s results during a test (but be sure not to overdo it…).

Let’s say you’re working for a giant e-commerce platform and you’re running an A/B test involving a new promotional offer. If you notice that version B is performing abysmally – losing you big money – you can stop it immediately!

Conversely, if your test is outperforming, you can switch all of your website traffic to the winning version earlier than if you were relying on the Frequentist method.

This is precisely the logic behind our Dynamic Traffic Allocation feature – and it wouldn’t be possible without Mr. Thomas Bayes.

Dynamic Traffic Allocation

If we pause quickly on the topic of Dynamic Traffic Allocation, we’ll see that it’s particularly useful in business settings or contexts that are volatile or time-limited.

AB Tasty dynamic traffic allocation bayesian

Dynamic Traffic Allocation option in the AB Tasty Interface.

Essentially, (automated) Dynamic Traffic Allocation strikes the balance between data exploitation and exploration.

The test data is ‘explored’ rigorously enough to be confident in the conclusion, and ‘exploited’ early enough so as to not lose out on conversions (or whatever your primary KPI is) unnecessarily. Note that this isn’t manual – a real live person is not interpreting these results and deciding to go or not to go.

Instead, an algorithm is going to make the choice for you, automatically.

In practice, for AB Tasty clients, this means checking the associated box and picking your primary KPI. The platform’s algorithm will then make the determination of if or when to send the majority of your traffic to a winning variation, once it’s determined.

This kind of approach is particularly useful:

  • Optimizing micro-conversions over a short time period
  • When the time span of the test is short (for example, during a holiday sales promotion)
  • When  your target page doesn’t get a lot of traffic
  • When you’re testing 6+ variations

Though you’ll want to pick and choose when to go for this option, it’s certainly a handy one to have in your back pocket.

Want to start A/B testing on your website with a platform that leverages the Bayesian method? AB Tasty is a great example of an A/B testing tool that allows you to quickly set up tests with low code implementation of front-end or UX changes on your web pages, gather insights via an ROI dashboard, and determine which route will increase your revenue.

False Positives

In Bayesian statistics, like with Frequentist methods, there is a risk of what’s called a false positive.

A false positive, as you might guess, is when a test result indicates a variation shows an improvement when in reality it doesn’t.

It’s often the case with false positives that version B gives the same results as version A (not that it performs inadequately compared to version A).

While by no means innocuous, false positives certainly aren’t a reason to abandon A/B testing. Instead, you can adjust your confidence interval to fit the risk associated with a potential false positive.

Gain probability using Bayesian statistics

You’ve probably heard of the 95% gain probability rule of thumb.

In other words, you consider that your test is statistically significant when you’ve reached a 95% certainty level. You’re 95% sure your version B is performing as indicated, but there’s still a 5% risk that it isn’t.

For many marketing campaigns, this 95% threshold is probably sufficient. But if you’re running a particularly important campaign with a lot at stake, you can adjust your gain probability threshold to be even more exact – 97%, 98% or even 99%, practically ruling out the potential for a false positive.

While this seems like a safe bet – and it is the right choice for high-stakes campaigns – it’s not something to apply across the board.

This is because:

  • In order to attain this higher threshold, you’ll have to wait longer for results, therefore leaving you less time to reap the rewards of a positive outcome.
  • You will implicitly only get a winner with a bigger gain (which is rarer), and you will let go of smaller improvements that still could be impactful.
  • If you have a smaller amount of traffic on your web page, you may want to consider a different approach

Bayesian tests limit false positives

Another thing to keep in mind is that because the Bayesian approach provides a gain interval – and because false positives virtually only appear to perform slightly better than in reality – you’re unlikely to implement a false positive in the first place.

A common scenario would be that you run an A/B test to test whether a new promotional banner design increases CTA click-through rates.

Your result says version B performs better with a 95% gain probability but that the gain is minuscule (1% median improvement). Were this to be a false positive, you’re unlikely to deploy the version B promotional banner across your website, since the resources needed to implement it wouldn’t make it worth the minimum again.

But, since a Frequentist approach doesn’t provide the gain interval, you might be more tempted to put in place the false positive. While this wouldn’t be the end of the world – version B likely performs the same as version A – you would be spending time and energy on a modification that won’t bring you any added return.

Bottom line? If you play it too safe and wait for a confidence level that’s too high, you’ll miss out on a series of smaller gains, which is also a mistake.

Wrapping up: Frequentist vs Bayesian

So, which is better, Frequentist or Bayesian?

As we mentioned early, both approaches are perfectly sound, statistical methods.

But at AB Tasty, we’ve opted for the Bayesian approach, since we think it helps our clients make even better business decisions on their web experiments.

It also allows for more flexibility and maximizing returns (Dynamic Traffic Allocation). As for false positives, these can occur whether you go with a Frequentist or Bayesian approach – though you’re less likely to fall for one with the Bayesian approach.

At the end of the day, if you’re shopping for an A/B testing platform, you’ll want to find one that gives you easily interpretable results that you can rely on.

Article

6min read

The Truth Behind the 14-Day A/B Test Period

The A/B testing method involves a simple process: create two variations, expose them to your customer, collect data, and analyze the results with a statistical formula. 

But, how long should you wait before collecting data? With 14 days being standard practice, let’s find out why as well as any exceptions to this rule.

Why 14 days?

To answer this question we need to understand what we are fundamentally doing. We are collecting current data within a short window, in order to forecast what could happen in the future during a more extended period. To simplify this article, we will only focus on explaining the rules that relate to this principle. Other rules do exist, which mostly correlate to the number of visitors, but this can be addressed in a future article.

The forecasting strategy relies on the collected data containing samples of all event types that may be encountered in the future. This is impossible to fulfill in practice, as periods like Christmas or Black Friday are exceptional events relative to the rest of the year. So let’s focus on the most common period and set aside these special events that merit their own testing strategies.

If the future we are considering relates to “normal” times, our constraint is to sample each day of the week uniformly, since people do not behave the same on different days. Simply look at how your mood and needs shift between weekdays and weekends. This is why a data sampling period must include entire weeks, to account for fluctuations between the days of the week. Likewise, if you sample eight days for example, one day of the week will have a doubled impact, which doesn’t realistically represent the future either.

This partially explains the two-week sampling rule, but why not a longer or shorter period? Since one week covers all the days of the week, why isn’t it enough? To understand, let’s dig a little deeper into the nature of conversion data, which has two dimensions: visits and conversions.

  • Visits: as soon as an experiment is live, every new visitor increments the number of visits.
  • Conversions: as soon as an experiment is live, every new conversion increments the number of conversions.

It sounds pretty straightforward, but there is a twist: statistical formulas work with the concept of success and failure. The definition is quite easy at first: 

  • Success: the number of visitors that did convert.
  • Failures: the number of visitors that didn’t convert.

At any given time a visitor may be counted as a failure, but this could change a few days later if they convert, or the visit may remain a failure if the conversion didn’t occur. 

So consider these two opposing scenarios: 

  • A visitor begins his buying journey before the experiment starts. During the first days of the experiment he comes back and converts. This would be counted as a “success”, but in fact he may not have had time to be impacted by the variation because the buying decision was made before he saw it. The problem is that we are potentially counting a false success: a conversion that could have happened without the variation.
  • A visitor begins his buying journey during the experiment, so he sees the variation from the beginning, but doesn’t make a final decision before the end of the experiment – finally converting after it finishes. We missed this conversion from a visitor who saw the variation and was potentially influenced by it.

These two scenarios may cancel each other out since they have opposite results, but that is only true if the sample period exceeds the usual buying journey time. Consider a naturally long conversion journey, like buying a house, measured within a very short experiment period of one week. Clearly, no visitors beginning the buying journey during the experiment period would have time to convert. The conversion rates of these visitors would be artificially in the realm of zero – no proper measurements could be done in this context. In fact, the only conversions you would see are the ones from visitors that began their journey before the variation even existed. Therefore, the experiment would not be measuring the impact of the variation. 

The delay between the effective variation and the conversion expedites the conversion rate. In order to mitigate this problem, the experiment period has to be twice as long as the standard conversion journey. Doing so ensures that visitors entering the experiment during the first half will have time to convert. You can expect that people who began their journey before the experiment and people entering during the second half of the experiment period will cancel each other out: The first group will contain conversions that should not be counted, and some of the second group’s conversions will be missing. However, a majority of genuine conversions will be counted.

That’s why a typical buying journey of one week results in a two-week experiment, offering the right balance in terms of speed and accuracy of the measurements.

Exceptions to this rule

A 14-day experiment period doesn’t apply to all cases. If the delay between the exposed variation and the conversion is 1.5 weeks for instance, then your experiment period should be three weeks, in order to cover the usual conversion delay twice. 

On the other hand, if you know that the delay is close to zero, such as in the case of a media website, where you are trying to optimize the placement of an advertisement frame on a page where visitors only stay a few minutes, you may think that one day would be enough based on the this logic, but it’s not. 

The reason being that you would not sample every day of the week, and we know from experience that people do not behave the same way throughout the week. So even in a zero-delay context, you still need to conduct the experiment for an entire week.

Takeaways: 

  1. Your test period should mirror the conditions of your expected implementation period.
  2. Sample each day of the week in the same way.
  3. Wait an integer number of weeks before closing an A/B test.

Respecting these rules will ensure that you’ll have clean measures. The accuracy of the measure is defined by another parameter of the experiment: the total number of visitors. We’ll address this topic in another article – stay tuned.

Article

8min read

Optimizing Revenue Beyond Conversion Rate

When it comes to CRO, or Conversion Rate Optimization, it would be natural to assume that conversion is all that matters. At least, we can argue that conversion rate is at the heart of most experiments. However, the ultimate goal is to raise revenue, so why does the CRO world put so much emphasis on conversion rates?

In this article, we’ll shed some light on the reason why conversion rate is important and why it’s not just conversions that should be considered.

Why is conversion rate so important?

Let’s start off with the three technical reasons why CRO places such importance on conversion rates:

  1. Conversion is a generic term. It covers the fact that an e-commerce visitor becomes a customer by buying something, or simply the fact that this visitor went farther than just the homepage, or clicks on a product page, or adds this product to the cart. In that sense, it’s the Swiss Army Knife of CRO.
  2. Conversion statistics are far easier than other KPI statistics, and they’re the simplest from a maths point of view. In terms of measurement, it’s pretty straightforward: success or failure.
    This means off-the-shelf code or simple spreadsheet formulas can compute statistics indices for decision, like the chance to win or confidence intervals about the expected gain. This is not that easy for other metrics as we will see later with Average Order Value (AOV).
  3. Conversion analysis is also the simplest when it comes to decision-making. There’s (almost) no scenario where raising the number of conversions is a bad thing. Therefore, deciding whether or not to put a variation in production is an easy task when you know that the conversion rate will rise. The same can’t be said about the “multiple conversions” metric where, unlike the conversion rate metric that counts one conversion per visitor even if this visitor made 2 purchases, every conversion counts and so is often more complex to analyze. For example, the number of product pages seen by an e-commerce visitor is harder to interpret. A variation increasing this number could have several meanings: the catalog can be seen as more engaging or it could mean that visitors are struggling to find what they’re looking for. 

Due to the aforementioned reasons, the conversion rate is the starting point of all CRO journeys. However, conversion rate on its own is not enough. It’s also important to pay attention to other factors other than conversions to optimize revenue. 

Beyond conversion rate

Before we delve into a more complex analysis, we’ll take a look at some simpler metrics. This includes ones that are not directly linked to transactions such as “add to cart” or “viewed at least one product page”.

If it’s statistically assured to win, then it’s a good choice to put the variation into production, with one exception. If the variation is very costly, then you will need to dig deeper to ensure that the gains will cover the costs. This can occur, for example, if the variation holds a product recommender system that comes with its cost. 

The bounce rate is also simple and straightforward in that the aim is to keep the figure down unlike the conversion rate. In this case, the only thing to be aware of is that you want to lower the bounce rate unlike the conversion rate. But the main idea is the same: if you change your homepage image and you see the bounce rate statistically drop, then it’s a good idea to put it in production.

We will now move onto a more complex metric, the transaction rate, which is directly linked to the revenue. 

Let’s start with a scenario where the transaction rate goes up. You assume that you will get more transactions with the same traffic, so the only way it could be a bad thing is that you earn less in the end. This means your average cart value (AOV) has plummeted. The basic revenue formula shows it explicitly: 

Total revenue = traffic * transaction rate * AOV 

Since we consider traffic as an external factor, then the only way to have a higher total revenue is to have an increase in both transaction rate and AOV or have at least one of them increase while the other remains stable. This means we also need to check the AOV evolution, which is much more complicated. 

On the surface, it looks simple: take the sum of all transactions and divide that by the number of transactions and you have the AOV. While the formula seems basic, the data isn’t. In this case, it’s not just either success or failure; it’s different values that can widely vary.

Below is a histogram of transaction values from a retail ecommerce website. The horizontal axis represents values (in €), the vertical axis is the proportion of transactions with this value. Here we can see that most values are spread between 0 and €200, with a peak at ~€50. 


The right part of this curve shows a “long/fat tail”. Now let’s try to see how the difference within this kind of data is hard to spot. See the same graph below but with higher values, from €400 to €1000. You will also notice another histogram (in orange) of the same values but offset by €10.

We see that the €10 offset which corresponds to a 10-unit shift to the right is hard to distinguish. And since it corresponds to the highest values this part has a huge influence when averaging samples. Due to the shape of this transaction value distribution, any measure of the average value is somewhat blurred, which makes it very difficult to have clear statistical indices. For this reason, changes in AOV need to be very drastic or measured over a huge dataset to be statistically asserted,  making it difficult to use in CRO.

Another important feature is hidden even further on the right of the horizontal axis. Here’s another zoom on the same graph, with the horizontal axis ranging from €1000 to €4500. This time only one curve is shown.

From the previous graph, we could have easily assumed that €1000 was the end, but it’s not. Even with a most common transaction value at €50, there are still some transactions above €1000, and even some over €3000. We call these extreme values. 

As a result, whether these high values exist or not makes a big difference. Since these values exist but with some scarcity, they will not be evenly spread across a variation, which can artificially create difference when computing AOV. By artificially, we mean the difference comes from a small number of visitors and so doesn’t really count as “statistically significant”. Also, keep in mind that customer behavior will not be the same when buying for €50 as when making a purchase of more than €3000.

There’s not much to do about this except know it exists. One good thing though is to separate B2B and B2C visitors if you can, since B2C transaction values are statistically bigger and less frequent. Setting them apart will limit these problems.

What does this mean for AOV?

There are three important things  to keep in mind when it comes to AOV:

  1. Don’t trust the basic AOV calculation; the difference you are seeing probably does not exist, and is quite often not even in the same observed direction! It’s only displayed to give an order of magnitude to interpret changes in conversion rates but shouldn’t be used to state a difference between variations’ AOV. That’s why we use a specific test, the Mann-Whitney U test, that’s adapted for this kind of data.
  2. You should only believe the statistical index on AOV, which is only valid to assess the direction of the difference between AOV, not its size. For example, you notice a +€5 AOV difference and the statistical index is 95%; this only means that you can be 95% sure that you will have an AOV gain, but not that it will be €5.
  3. Since transaction data is far more wild than conversion data, it will need stronger differences or bigger datasets to reach statistical significance. But since there are always fewer transactions than visitors, reaching significance on the conversion rate doesn’t imply being significant on AOV.

This means that a decision on a variation that has a conversion rate gain can still be complex because we rarely have a clear answer about the variation effect on the AOV.

This is yet another reason to have a clear experimentation protocol including an explicit hypothesis. 

For example, if the test is about showing an alternate product page layout based on the hypothesis that visitors have trouble reading the product page, then the AOV should not be impacted. Afterwards, if the conversion rate rises, we can validate the winner if the AOV has no strong statistical downward trend. However, if the changes are in the product recommender system, which might have an impact on the AOV, then one should be more strict on measuring a statistical innocuity on the AOV before calling a winner. For example, the recommender might bias visitors toward cheaper products, boosting sales numbers but not the overall revenue.

The real driving force behind CRO

We’ve seen that the conversion rate is at the base of CRO practice because of its simplicity and versatility compared to all other KPIs. Nonetheless, this simplicity must not be taken for granted. It sometimes hides more complexity that needs to be understood in order to make profitable business decisions, which is why it’s a good idea to have expert resources during your CRO journey. 

That’s why at AB Tasty, our philosophy is not only about providing top-notch software but also Customer Success accompaniment.

Article

10min read

A/A Testing: What is it and When Should You Use it?

A/A tests are a legacy from the early days of A/B testing. It’s basically creating an A/B test where two identical versions of a web page or element are tested against each other. Variation B is just a copy of A without any modification.

One of the goals of A/A tests is to check the effectiveness and accuracy of testing tools. The expectation is that, if no winner is declared, the test is a success. Whereas detecting a statistical difference would mean a failure, indicating a problem somewhere in the pipeline.

But it’s not always that simple. We’ll dive into this type of testing and the statistics and tech behind the scenes. We’ll look at why a failed A/A test is not a proof of pipeline failure, and that a successful A/A test isn’t a foolproof sanity check.

What is tested during an A/A test?

Why is there so much buzz around A/A testing? An A/A test can be a way to verify two components of an experimentation platform: 

  1. The statistical tool: It may be possible that the formulas chosen don’t fit the real nature of the data, or may contain bugs.
  2. The traffic allocation: The split between variations must be random and respect the proportions it has been given. When a problem occurs, we talk about Sample Ratio Mismatch (SRM); that is, the observed traffic does not match the allocation setting. This means that the split has some bias impacting the analysis quality.
    Let’s explore this in more detail.

Statistical tool test

Let’s talk about a “failed” A/A test

The most common idea behind A/A tests is that the statistical tool should yield no significant difference. It is considered a “failed” A/A test if you detect a difference in performance during an A/A test. 

However, to understand how weak this conclusion is, you need to understand how statistical tests work. Let’s say that your significance threshold is 95%. This means that there is still a 5% chance that the difference you see is a statistical fluke and no real difference exists between the variations. So even with a perfectly working statistical tool, you still have one chance in twenty (1/20=5%) that you will have a “failed” A/A test and you might start looking for a problem that may not exist.

With that in mind, an acceptable statistical procedure would be to perform 20 A/A tests and expect to have 19 that yield no statistical difference, and one that does detect a significant difference. And even in this case, if 2 tests show significant results, it’s a sign of a real problem. In other words, having 1 successful A/A test is in fact not enough to validate a statistical tool. To validate it fully, you need to show that the tests are successful 95% of the time (=19/20).

Therefore, a meaningful approach would be to perform hundreds of A/A tests and expect ~5% of them to “fail”. It’s worth noting that if it “fails” less than 5% of the time it’s also a problem, maybe indicating that the statistical test simply says “no” too often, leading to a strategy that never detects any winning variation. So one A/A “failed” test doesn’t tell much in reality. 

What if it’s a “successful A/A test”? 

A “successful” A/A test (yielding no difference) is not proof that everything is working as it should. To understand why, you need to check another important tool in an A/B test: the sample size calculator.

In the following example, we see that from a 5% conversion rate, you need around 30k visitors per variation to reach the 95% significance level if a variation yields a 10% MDE (Minimal Detectable Effect).

But in the context of an A/A test, the Minimal Detectable Effect (MDE) is in fact 0%. Using the same formula, we’ll plug 0% as MDE.

At this point, you will discover that the form does not let you put a 0% here, so let’s try a very small number then. In this case, you get almost 300M visitors, as seen below.

In fact, to be confident that there is exactly no difference between two variations, you need an infinite number of visitors, which is why the form does not let you set 0% as MDE.

Therefore, a successful A/A test only tells you that the difference between the two variations is smaller than a given number but not that the two variations perform exactly the same.

This problem comes from another principle in statistical tests: the power. 

The power of a test is the chance that you discover a difference if there is any. In the context of an A/A test, this refers to the chance you discover a statistically significant discrepancy between the two variations’ performance. 

The more power, the more chance you will discover a difference. To raise the power of a test you simply raise the number of visitors.

You may have noticed that in the previous screenshots, tests are usually powered at 80%. This means that even if a difference exists between the variations in performance, 20% of the time you will miss it. So one “successful” A/A test (yielding no statistical difference) may just be an occurrence of this 20%. In other words, having just one successful A/A test doesn’t ensure the efficiency of your experimentation tool. You may have a problem and there is a 20% chance that you missed it. Additionally, reaching 100% of power will need an infinite number of visitors, making it impractical.

How do we make sure we can trust the statistical tool then? If you are using a platform that is used by thousands of other customers, chances are that the problem would have already been discovered. 

Because statistical software does not change very often and it is not affected by the variation content (whereas the traffic allocation might change, as we will see later), the best option is to trust your provider, or you can double-check the results with an independent provider. You can find a lot of independent calculators on the web. They only need the number of visitors and the number of conversions for each variation to provide the results making it quick to implement.

Traffic allocation test

In this part, we only focus on traffic, not conversions. 

The question is: does the splitting operation work as it should? We call this kind of failure a SRM or Sample Ratio Mismatch. You may ask yourself how a simple random choice could fail. In fact, the failure happens either before or after the random choice. 

The following demonstrates two examples where that can happen:

  • The variation contains a bug that may crash some navigators. In this case, the corresponding variation will lose visitors. The bug might depend on the navigator and then you will end up with bias in your data.
  • If the variation gives a discount coupon (or any other advantage), and some users find a way to force their navigator to run the variation (to get the coupon), then you will have an excess of visitors for that variation that is not due to random chance, which results in biased data.


It’s hard to detect with the naked eye because the allocation is random, so you never get sharp numbers. 

For instance, a 50/50 allocation never precisely splits the traffic in groups with the exact same size. As a result, we would need statistical tools to check if the split observed corresponds with the desired allocation. 

SRM tests exist. They work more or less the same way as an A/B test except that the SRM formula indicates whether there is a difference between the desired allocation and what really happened. If there is indeed an SRM, then there is a chance that this difference is not due to pure randomness. This means that some data is lost or bias occurred during the experiment entailing trust for future (real) experiments.

On the one hand, detecting an SRM during an A/A test sounds like a good idea. On the other hand, if you think operationally it might not be that useful because the chance of a SRM is low.  

Even if some reports say that they are more frequent than you may think, most of the time it happens on complex tests. In that sense, checking SRM within an A/A test will not help you to prevent having one on a more complex experiment later. 

If you find a Sample Ration Mismatch on a real experiment or in an A/A test, the following actions remain the same: find the cause, fix it, and restart the experiment. So why waste time and traffic on an A/A test that will give you no information? A real experiment would have given you real information if it worked fine on the first try. If a problem does occur, we would detect it even in a real experiment since we only consider traffic and not conversions.

A/A tests are also unnecessary since most trustworthy A/B testing platforms (like AB Tasty) do SRM checks on an automated basis. So if an SRM occurs, you will be notified anyway. 

So where does this “habit” of practicing A/A tests come from?

Over the years, it’s something that engineers building A/B testing platforms have done. It makes sense in this case because they can run a lot of automated experiments, and even simulate users if they don’t have enough at hand, performing a sound statistical approach to A/A tests. 

They have reasons to doubt the platform in the works and they have the programming skills to automatically create hundreds of A/A tests to test it properly. Since these people can be seen as pioneers, their voice on the web is loud when they explain what an A/A test is and why it’s important (from an engineering perspective).

However, for a platform user/customer, the context is different as they’ve paid for a ready-to- use and trusted platform and can start a real experiment as soon as possible to get a return on investment. Therefore, it makes little sense to waste time and traffic on an A/A test that won’t provide any valuable information.

Why sometimes it might be better to skip A/A tests

We can conclude that a failed A/A test is not a problem and that a successful one is not  proof of sanity. 

In order to gain valuable insights from A/A tests, you would need to perform hundreds of them with an infinite number of visitors. Moreover, an efficient platform like AB Tasty does the corresponding checks for you.

That’s why, unless you are developing your own A/B testing platform, running an A/A test may not give you the insights you’re looking for. A/A tests require a considerable amount of time and traffic that could otherwise be used to conduct A/B tests that could give you valuable insights on how to optimize your user experience and increase conversions. 

When it makes sense to run an A/A test

It may seem that running A/A tests may not be the right call after all. However, there may be a couple of reasons why it might still be useful to perform A/A tests. 

First is when you want to check the data you are collecting and compare it to data already collected with other analytics tools but keep in mind that you will never get the exact same results. The reason is that most of the metric definitions vary on different tools. Nonetheless this comparison is an important onboarding step to ensure that the data is properly collected.

The other reason to perform an A/A test is to know the reference value for your main metrics so you can establish a baseline to analyze your future campaigns more accurately. For example, what is your base conversion rate and/or bounce rate? Which of these metrics need to be improved and are, therefore, a good candidate for your first real A/B test?

This is why AB Tasty has a feature that helps users build A/A tests dedicated to reach these goals and avoids the pitfalls of “old school”  methods that are not useful anymore. With our new A/A test feature, A/A test data is collected in one variant (not two); let’s call this an “A test”. 

This allows you to have a more accurate estimation of these important metrics as the more data you have, the more accurate the measurements are. Meanwhile, in a classic A/A test, data is collected in two different variants which provides less accurate estimates since you have less data for each variant.

With this approach, AB Tasty enables users to automatically set up A/A tests, which gives better insights than classic “handmade” A/A tests.

Article

8min read

How to Better Handle Collateral Effects of Experimentation: Dynamic Allocation vs Sequential Testing

When talking about web experimentation, the topics that often come up are learning and earning. However, it’s important to remember that a big part of experimentation is encountering risks and losses. Although losses can be a touchy topic, it’s important to talk about and destigmatize failed tests in experimentation because it encourages problem-solving, thinking outside of your comfort zone and finding ways to mitigate risk. 

Therefore, we will take a look at the shortcomings of classic hypothesis testing and look into other options. Basic hypothesis testing follows a rigid protocol: 

  • Creating the variation according to the hypothesis
  • Waiting a given amount of time 
  • Analyzing the result
  • Decision-making (implementing the variant, keeping the original, or proposing a new variant)

This rigid protocol and simple approach to testing doesn’t say anything about how to handle losses. This raises the question of what happens if something goes wrong? Additionally, the classic statistical tools used for analysis are not meant to be used before the end of the experiment.

If we consider a very general rule of thumb, let’s say that out of every 10 experiments, 8 will be neutral (show no real difference), one will be positive, and one will be negative. Practicing classic hypothesis testing suggests that you just accept that as a collateral effect of the optimization process hoping to even it out in the long term. It may feel like crossing a street blindfolded.

For many, that may not cut it. Let’s take a look at two approaches that try to better handle this problem: 

  • Dynamic allocation – also known as “Multi Armed Bandit” (MAB). This is where traffic allocation changes for each variation according to their performance, implicitly lowering the losses.
  • Sequential testing – a method that allows you to stop a test as soon as possible, given a risk aversion threshold.

These approaches are statistically sound but they come with their assumptions. We will go through their pros and cons within the context of web optimization.

First, we’ll look into the classic version of these two techniques and their properties and give tips on how to mitigate some of their problems and risks. Then, we’ll finish this article with some general advice on which techniques to use depending on the context of the experiment.

Dynamic allocation (DA)

Dynamic allocation’s main idea is to use statistical formulas that modify the amount of visitors exposed to a variation depending on the variation’s performance. 

This means a poor-performing variation will end up having little traffic which can be seen as a way to save conversions while still searching for the best-performing variation. Formulas ensure the best compromise between avoiding loss and finding the real best-performing variation. However, this implies a lot of assumptions that are not always met and that make DA a risky option. 

There are two main concerns, both of which are linked to the time aspect of the experimentation process: 

  • The DA formula does not take time into account 

If there is a noticeable delay between the variation exposure and the conversion, the algorithm may go wrong resulting in a visitor being considered a ‘failure’ until they convert. This means that the time between a visit and a conversion will be falsely counted as a failure.

As a result, the DA will use the wrong conversion information in its formula so that any variation gaining traffic will automatically see a (false) performance drop because it will detect a growing number of non-converting visitors. As a result, traffic to that variation will be reduced.  

The reverse may also be true: a variation with decreasing traffic will no longer have any new visitors while existing visitors of this variation could eventually convert. In that sense, results would indicate a (false) rise in conversions even when there are no new visitors, which would be highly misleading.

DA gained popularity within the advertising industry where the delay between an ad exposure and its potential conversion (a click) is short. That’s why it works perfectly well in this context. The use of Dynamic Allocation in CRO must be done in a low conversion delay context only.

In other words, DA should only be used in scenarios where visitors convert quickly. It’s not recommended for e-commerce except for short-term campaigns such as flash sales or when there’s not enough traffic for a classic AB test. It can also be used if the conversion goal is clicking on an ad on a media website.

  • DA and the different days of the week 

It’s very common to see different visitor behavior depending on the day of the week. Typically, customers may behave differently on weekends than during weekdays.  

With DA, you may be sampling days unevenly, implicitly giving more weight on some days for some variations. However, you should weigh each day the same because, in reality, you have the same amount of weekdays. You should only use Dynamic Allocation if you know that the optimized KPI is not sensitive to fluctuations during the week.

The conclusion is that DA should be considered only when you expect too few total visitors for classic A/B testing. Another requirement is that the KPI under experimentation needs a very short conversion time and no dependence on the day of the week. Taking all this into account: Dynamic Allocation should not be used as a way to secure conversions.

Sequential Testing (ST)

Sequential Testing is when a specific statistical formula is used enabling you to stop an experiment. This will depend on the performance of variations with given guarantees on the risk of false positives. 

The Sequential Testing approach is designed to secure conversions by stopping a variation as soon as its underperformance is statistically proven. 

However, it still has some limitations. When it comes to effect size estimation, the effect size may be wrong in two senses: 

  • Bad variations will be seen as worse than they really are. It’s not a problem in CRO because the false positive risk is still guaranteed. This means that in the worst-case scenario, you will discard not a strictly losing variation but maybe just an even one, which still makes sense in CRO.
  • Good variations will be seen as better than they really are. It may be a problem in CRO since not all winning variations are useful for business. The effect size estimation is key to business decision-making. This can easily be mitigated by using sequential testing to stop losing variations only. Winning variations, for their part, should be continued until the planned end of the experiment, ensuring both correct effect size estimation and an even sampling for each day of the week.
    It’s important to note that not all CRO software use this hybrid approach. Most of them use ST to stop both winning and losing variations, which is wrong as we’ve just seen.

As we’ve seen, by stopping a losing variation in the middle of the week, there’s a risk you may be discarding a possible winning variation. 

However, to actually have a winning variation after ST has shown that it’s underperforming, this variation will need to perform so well that it becomes even with the reference. Then, it would also have to perform so well that it outperforms the reference and all that would need to happen in a few days. This scenario is highly unlikely.

Therefore, it’s safe to stop a losing variation with Sequential Testing, even if all weekdays haven’t been evenly sampled.

The best of both worlds in CRO 

Dynamic Allocation is the best approach to experimentation instead of static allocation when you expect a small volume of traffic. It should be used only in the context of ‘short delay KPI’ and with no known weekday effect (for example: flash sales). However, it’s not a way to mitigate risk in a CRO strategy.

To be able to run experiments with all the needed guarantees, you need a hybrid system using Sequential Testing to stop losing variations and a classic method to stop a winning variation. This method will allow you to have the best of both worlds.