Article

7min read

Is Your Average Order Value (AOV) Misleading You?

Average Order Value (AOV) is a widely used metric in Conversion Rate Optimization (CRO), but it can be surprisingly deceptive. While the formula itself is simple—summing all order values and dividing by the number of orders—the real challenge lies within the data itself.

The problem with averaging

AOV is not a “democratic” measure. A single high-spending customer can easily spend 10 or even 100 times more than your average customer. These few extreme buyers can heavily skew the average, giving a limited number of visitors disproportionate impact compared to hundreds or thousands of others. This is problematic because you can’t truly trust the significance of an observed AOV effect if it’s tied to just a tiny fraction of your audience.

Let’s look at a real dataset to see just how strong this effect can be. Consider the order value distribution:

  • The horizontal axis represents the order value.
  • The vertical axis represents the frequency of that order value.
  • The blue surface is a histogram, while the orange outline is a log-normal distribution approximation.

This graph shows that the most frequent order values are small, around €20. As the order value increases, the frequency of such orders decreases. This is a “long/heavy tail distribution,” meaning very large values can occur, albeit rarely.

A single strong buyer with an €800 order value is worth 40 times more than a frequent buyer when looking at AOV. This is an issue because a slight change in the behavior of 40 visitors is a stronger indicator than a large change from one unique visitor. While not fully visible on this scale, even more extreme buyers exist. 

The next graph, using the same dataset, illustrates this better:

  • The horizontal axis represents the size of the growing dataset of order values (roughly indicating time).
  • The vertical axis represents the maximum order value in the growing dataset in €

At the beginning of data collection, the maximum order value is quite small (close to the most frequent value of ~€20). However, we see that it grows larger as time passes and the dataset expands. With a dataset of 10,000 orders, the maximum order value can exceed €5,000. This means any buyer with an order above €5,000 (they might have multiple) holds 250 times the power of a frequent buyer at €20. At the maximum dataset size, a single customer with an order over €20,000 can influence the AOV more than 2,000 other customers combined.

When looking at your e-commerce metrics, AOV should not be used as a standalone decision-making data.

E-commerce moves fast. Get the insights that help you move faster. Download the 2025 report now.

The challenge of AB Test splitting

The problem intensifies when considering the random splits used in A/B tests.

Imagine you have only 10 very large spenders whose collective impact equals that of 10,000 medium buyers. There’s a high probability that the random split for such a small group of users will be uneven. While the overall dataset split is statistically even, the disproportionate impact of these high spenders on AOV requires specific consideration for this small segment. Since you can’t predict which visitor will become a customer or how much they will spend, you cannot guarantee an even split of these high-value users.

This phenomenon can artificially inflate or deflate AOV in either direction, even without a true underlying effect, simply depending on which variation these few high spenders land on.

What’s the solution?

AOV is an unreliable metric, how can we effectively work with it? The answer is similar to how you approach conversion rates and experimentation.

You don’t trust raw conversion data—one more conversion on variation B doesn’t automatically make it a winner, nor do 10 or 100. Instead, you rely on a statistical test to determine when a difference is significant. The same principle applies to AOV. Tools like AB Tasty offer the Mann-Whitney test, a statistical method robust against extreme values and well-suited for long-tail distributions.

AOV behavior can be confusing because you’re likely accustomed to the more intuitive statistics of conversion rates. Conversion data and their corresponding statistics usually align; a statistically significant increase in conversion rate typically means a visibly large difference in the number of conversions, consistent with the statistical test. However, this isn’t always the case with AOV. It’s not uncommon to see the AOV trend and the statistical results pointing in different directions. Your trust should always be placed in the statistical test.

The root cause: Heavy tail distributions

You now understand that the core issue stems from the unique shape of order value distributions: long-tail distributions that produce rare, extreme values.

It’s important to note that the problem isn’t just the existence of extreme values. If these extreme values were frequent, the AOV would naturally be higher, and their impact would be less dramatic because the difference between the AOV and these values would be smaller. Similarly, for the splitting problem, a larger number of extreme values would ensure a more even split.

At this point, you might think your business has a different order distribution shape and isn’t affected. However, this shape emerges whenever these two conditions are met:

  • You have a price list with more than several dozen different values.
  • Visitors can purchase multiple products at once.

Needless to say, these conditions are ubiquitous and apply to nearly every e-commerce business. The e-commerce revolution itself was fueled by the ability to offer vast catalogues.

Furthermore, the presence of shipping costs naturally encourages users to group their purchases to minimize those costs. It means that nearly all e-commerce businesses are affected. The only exceptions are subscription-based businesses with limited pricing options, where most purchases are for a single service.

Here’s a glimpse into the order value distribution across various industries, demonstrating the pervasive nature of the “long tail distribution”:

Cosmetic
Transportation
B2B packaging (selling packaging for e-commerce)
Fashion
online flash sales

AOV, despite its simple definition and apparent ease of understanding, is a misleading metric. Its magnitude is easy to grasp, leading people to confidently make intuitive decisions based on its fluctuations. However, the reality is far more complex; AOV can show dramatic changes even when there’s no real underlying effect.

Conversely, significant changes can go unnoticed. A strong negative effect could be masked by just a few high-spending customers landing in a poorly performing variation. So, now you know: just as you do for conversion rates, rely on statistical tests for your AOV decisions.

Subscribe to
our Newsletter

bloc Newsletter EN

We will process and store your personal data to send you communications as described in our  Privacy Policy.

Article

6min read

Minimal Detectable Effect: The Essential Ally for Your A/B Tests

In CRO (Conversion Rate Optimization), a common dilemma is not knowing what to do with a test that shows a small and non-significant gain. 

Should we declare it a “loser” and move on? Or should we collect more data in the hope that it will reach the set significance threshold? 

Unfortunately, we often make the wrong choice, influenced by what is called the “sunk cost fallacy.” We have already put so much energy into creating this test and waited so long for the results that we don’t want to stop without getting something out of this work. 

However, CRO’s very essence is experimentation, which means accepting that some experiments will yield nothing. Yet, some of these failures could be avoided before even starting, thanks to a statistical concept: the MDE (Minimal Detectable Effect), which we will explore together.

MDE: The Minimal Detectable Threshold

In statistical testing, samples have always been valuable, perhaps even more so in surveys than in CRO. Indeed, conducting interviews to survey people is much more complex and costly than setting up an A/B test on a website. Statisticians have therefore created formulas that link the main parameters of an experiment for planning purposes:

  • The number of samples (or visitors) per variation
  • The baseline conversion rate
  • The magnitude of the effect we hope to observe

This allows us to estimate the cost of collecting samples. The problem is that, among these three parameters, only one is known: the baseline conversion rate

We don’t really know the number of visitors we’ll send per variation. It depends on how much time we allocate to data collection for this test, and ideally, we want it to be as short as possible. 

Finally, the conversion gain we will observe at the end of the experiment is certainly the biggest unknown, since that’s precisely what we’re trying to determine.

So, how do we proceed with so many unknowns? The solution is to estimate what we can using historical data. For the others, we create several possible scenarios:

  • The number of visitors can be estimated from past traffic, and we can make projections in weekly blocks.
  • The conversion rate can also be estimated from past data.
  • For each scenario configuration from the previous parameters, we can calculate the minimal conversion gains (MDE) needed to reach the significance threshold.

For example, with traffic of 50,000 visitors and a conversion rate of 3% (measured over 14 days), here’s what we get:

MDE Uplift
  • The horizontal axis indicates the number of days.
  • The vertical axis indicates the MDE corresponding to the number of days.

The leftmost point of the curve tells us that if we achieve a 10% conversion gain after 14 days, then this test will be a winner, as this gain can be considered significant. Typically, it will have a 95% chance of being better than the original. If we think the change we made in the variation has a chance of improving conversion by ~10% (or more), then this test is worth running, and we can hope for a significant result in 14 days.

On the other hand, if the change is minor and the expected gain is less than 10%, then 14 days will not be enough. To find out more, we move the curve’s slider to the right. This corresponds to adding days to the experiment’s duration, and we then see how the MDE evolves. Naturally, the MDE curve decreases: the more data we collect, the more sensitive the test becomes to smaller effects.

For example, by adding another week, making it a 21-day experiment, we see that the MDE drops to 8.31%. Is that sufficient? If so, we can validate the decision to create this experiment.

MDE Graph

If not, we continue to explore the curve until we find a value that matches our objective. Continuing along the curve, we see that a gain of about 5.44% would require waiting 49 days.

Minimum Detectable Uplift Graph

That’s the time needed to collect enough data to declare this gain significant. If that’s too long for your planning, you’ll probably decide to run a more ambitious test to hope for a bigger gain, or simply not do this test and use the traffic for another experiment. This will prevent you from ending up in the situation described at the beginning of this article, where you waste time and energy on an experiment doomed to fail.

From MDE to MCE

Another approach to MDE is to see it as MCE: Minimum Caring Effect. 

This doesn’t change the methodology except for the meaning you give to the definition of your test’s minimal sensitivity threshold. So far, we’ve considered it as an estimate of the effect the variation could produce. But it can also be interesting to consider the minimal sensitivity based on its operational relevance: the MCE. 

For example, imagine you can quantify the development and deployment costs of the variation and compare it to the conversion gain over a year. You could then say that an increase in the conversion rate of less than 6% would take more than a year to cover the implementation costs. So, even if you have enough traffic for a 6% gain to be significant, it may not have operational value, in which case it’s pointless to run the experiment beyond the duration corresponding to that 6%.

MDE graph

In our case, we can therefore conclude that it’s pointless to go beyond 42 days of experimentation because beyond that duration, if the measured gain isn’t significant, it means the real gain is necessarily less than 6% and thus has no operational value for you.

Conclusion

AB Tasty’s MDE calculator feature will allow you to know the sensitivity of your experimental protocol based on its duration. It’s a valuable aid when planning your test roadmap. This will allow you to make the best use of your traffic and resources.

Looking for a free and minimalistic MDE calculator to try? Check out our free Minimal Detectable Effect calculator here.