Article

6min read

Which Statistical Model is Best for A/B Testing: Bayesian, Frequentist, CUPED, or Sequential?

If you’ve ever run an A/B test, you know the thrill of watching those numbers tick up and down, hoping your new idea will be the next big winner. But behind every successful experiment is a secret ingredient: the statistical model that turns your data into decisions.

With so many options – Bayesian, Frequentist, CUPED, Sequential – it’s easy to feel like you’re picking a flavor at an ice cream shop you’ve never visited before. Which one is right for you? Let’s dig in!

The Scoop on Statistical Models

Statistical models are the brains behind your A/B tests. They help you figure out if your shiny new button color is actually better, or if you’re just seeing random noise. But not all models are created equal, and each has its own personality – some are straightforward, some are a little quirky, and some are best left to the pros.

Bayesian Testing Model: The Friendly Guide

Imagine you’re asking a friend, “Do you think this new homepage is better?” The Bayesian model is that friend who gives you a straight answer: “There’s a 92% chance it is!” Bayesian statistics use probability to tell you, in plain language, how likely it is that your new idea is actually an improvement.

Bayesian analysis works by updating what you believe as new data comes in. It’s like keeping a running tally of who’s winning the race, and it’s not shy about giving you the odds. This approach is especially handy for marketers, product managers, and anyone who wants to make decisions without a PhD in statistics. It’s clear, actionable, and – dare we say – fun to use.

At AB Tasty, we love Bayesian. It’s our go-to because it helps teams make confident decisions without getting tangled up in statistical spaghetti. Most of our clients use it by default, and for good reason: it’s easy to understand, hard to misuse, and perfect for fast-paced digital teams.

Pros of Bayesian Testing:

  • Results are easy to interpret (“There’s a 92.55% chance to win!”).
  • Great for business decisions (and no need to decode cryptic p-values).
  • Reduces the risk of making mistakes from peeking at your data.

Cons of Bayesian Testing:

  • Some data scientists may prefer more traditional methods.
  • Can require a bit more computing power for complex tests.

Frequentist Testing Model: The Classic Statistician

If Bayesian is your friendly guide, Frequentist is the wise professor. This is the classic approach you probably learned about in school. Frequentist models use p-values to answer questions like, “If there’s really no difference, what are the chances I’d see results like this?”

Frequentist analysis is all about statistical significance. If your p-value is below 0.05, you’ve got a winner. This method is tried and true, and it’s the backbone of academic research and many data teams.

But here’s the catch: p-values can be tricky. They don’t tell you the probability that your new idea is better; they tell you the probability of seeing your data if nothing is actually different. It’s a subtle distinction, but it trips up even seasoned pros. If you’re comfortable with statistical lingo and want to stick with tradition, the Frequentist model is a good choice. Otherwise, it can feel a bit like reading tea leaves.

Pros of Frequentist Testing:

  • Familiar to statisticians and data scientists.
  • Matches legacy processes in many organizations.

Cons of Frequentist Testing:

  • Results can be confusing for non-experts.
  • Easy to misinterpret, leading to “false positives” if you peek at results too often.

CUPED Testing Model: The Speedster (But Only for the Right Crowd)

CUPED (Controlled Experiment Using Pre-Experiment Data) is designed to go fast by using data from before your experiment even started. By comparing your test results to users’ past behavior, CUPED can reduce the noise and help you reach conclusions quicker.

But here’s the twist: CUPED only shines when your users come back again and again, like on streaming platforms (Netflix) or big SaaS products (Microsoft). If you have an e-commerce site, CUPED can actually steer you wrong, leading to misleading results.

For most e-commerce teams, CUPED is a bit like putting racing tires on a city bike, not the best fit. But if you’re running experiments on a platform with high user recurrence, it can be a powerful tool in your kit.

Pros CUPED Testing:

  • Can deliver faster, more precise results for high-recurrence platforms.
  • Makes the most of your existing data.

Cons of CUPED Testing:

  • Not suitable for most e-commerce or low-frequency sites.
  • Can lead to errors if used in the wrong context.
  • More complex to set up and explain.

Sequential Testing Model: The Early Warning System

Sequential testing is your experiment’s smoke alarm. Instead of waiting for a set number of visitors, it keeps an eye on your results as they come in. If things are going south – say, your new checkout flow is tanking conversions – it can sound the alarm early, letting you stop the test and save precious traffic.

But don’t get too trigger-happy. Sequential testing is fantastic for spotting losers early, but it’s not meant for declaring winners ahead of schedule. If you use it to crown champions too soon, you risk falling for false positives – those pesky results that look great at first but don’t hold up over time.

At AB Tasty, we use sequential testing as an early warning system. It helps our clients avoid wasting time and money on underperforming ideas, but we always recommend waiting for the full story before popping the champagne.

Experiment health check

Pros of Sequential Testing:

  • Helps you spot and stop losing tests quickly.
  • Saves resources by not running doomed experiments longer than necessary.

Cons of Sequential Testing:

  • Not designed for picking winners early.
  • Can lead to mistakes if used without proper guidance.

Which Statistic Model is Best for A/B Testing?

If you’re looking for a model that’s easy to use, hard to misuse, and perfect for making fast, confident decisions, Bayesian is your best bet – especially if you’re in e-commerce or digital marketing. It’s the model we recommend for most teams, and it’s the default for a reason.

If you have a team of data scientists who love their p-values, or you’re working in a highly regulated environment, Frequentist might be the way to go. Just be sure everyone’s on the same page about what those numbers really mean.

Running a streaming service or a platform where users log in daily? CUPED could help you speed things up – just make sure you’ve got the right data and expertise.

And if you want to keep your experiments safe from disasters, Sequential is the perfect early warning system.

Conclusion: The Right A/B Testing Model for the Right Job

Choosing a statistical model for A/B testing doesn’t have to be a headache. Think about your team, your users, and your goals. For most, Bayesian is the friendly, reliable choice that keeps things simple and actionable. But whichever model you choose, remember: the best results come from understanding your tools and using them wisely.

Ready to run smarter, safer, and more successful experiments? Pick the model that fits your needs—and don’t be afraid to ask for help if you need it. After all, even the best chefs need a good recipe now and then.

Hungry for more?
Check out our guides on Bayesian vs. Frequentist A/B Testing and When to Use CUPED. Happy testing!

Subscribe to
our Newsletter

bloc Newsletter EN

We will process and store your personal data to send you communications as described in our  Privacy Policy.

Article

7min read

Is Your Average Order Value (AOV) Misleading You?

Average Order Value (AOV) is a widely used metric in Conversion Rate Optimization (CRO), but it can be surprisingly deceptive. While the formula itself is simple—summing all order values and dividing by the number of orders—the real challenge lies within the data itself.

The problem with averaging

AOV is not a “democratic” measure. A single high-spending customer can easily spend 10 or even 100 times more than your average customer. These few extreme buyers can heavily skew the average, giving a limited number of visitors disproportionate impact compared to hundreds or thousands of others. This is problematic because you can’t truly trust the significance of an observed AOV effect if it’s tied to just a tiny fraction of your audience.

Let’s look at a real dataset to see just how strong this effect can be. Consider the order value distribution:

  • The horizontal axis represents the order value.
  • The vertical axis represents the frequency of that order value.
  • The blue surface is a histogram, while the orange outline is a log-normal distribution approximation.

This graph shows that the most frequent order values are small, around €20. As the order value increases, the frequency of such orders decreases. This is a “long/heavy tail distribution,” meaning very large values can occur, albeit rarely.

A single strong buyer with an €800 order value is worth 40 times more than a frequent buyer when looking at AOV. This is an issue because a slight change in the behavior of 40 visitors is a stronger indicator than a large change from one unique visitor. While not fully visible on this scale, even more extreme buyers exist. 

The next graph, using the same dataset, illustrates this better:

  • The horizontal axis represents the size of the growing dataset of order values (roughly indicating time).
  • The vertical axis represents the maximum order value in the growing dataset in €

At the beginning of data collection, the maximum order value is quite small (close to the most frequent value of ~€20). However, we see that it grows larger as time passes and the dataset expands. With a dataset of 10,000 orders, the maximum order value can exceed €5,000. This means any buyer with an order above €5,000 (they might have multiple) holds 250 times the power of a frequent buyer at €20. At the maximum dataset size, a single customer with an order over €20,000 can influence the AOV more than 2,000 other customers combined.

When looking at your e-commerce metrics, AOV should not be used as a standalone decision-making data.

E-commerce moves fast. Get the insights that help you move faster. Download the 2025 report now.

The challenge of AB Test splitting

The problem intensifies when considering the random splits used in A/B tests.

Imagine you have only 10 very large spenders whose collective impact equals that of 10,000 medium buyers. There’s a high probability that the random split for such a small group of users will be uneven. While the overall dataset split is statistically even, the disproportionate impact of these high spenders on AOV requires specific consideration for this small segment. Since you can’t predict which visitor will become a customer or how much they will spend, you cannot guarantee an even split of these high-value users.

This phenomenon can artificially inflate or deflate AOV in either direction, even without a true underlying effect, simply depending on which variation these few high spenders land on.

What’s the solution?

AOV is an unreliable metric, how can we effectively work with it? The answer is similar to how you approach conversion rates and experimentation.

You don’t trust raw conversion data—one more conversion on variation B doesn’t automatically make it a winner, nor do 10 or 100. Instead, you rely on a statistical test to determine when a difference is significant. The same principle applies to AOV. Tools like AB Tasty offer the Mann-Whitney test, a statistical method robust against extreme values and well-suited for long-tail distributions.

AOV behavior can be confusing because you’re likely accustomed to the more intuitive statistics of conversion rates. Conversion data and their corresponding statistics usually align; a statistically significant increase in conversion rate typically means a visibly large difference in the number of conversions, consistent with the statistical test. However, this isn’t always the case with AOV. It’s not uncommon to see the AOV trend and the statistical results pointing in different directions. Your trust should always be placed in the statistical test.

The root cause: Heavy tail distributions

You now understand that the core issue stems from the unique shape of order value distributions: long-tail distributions that produce rare, extreme values.

It’s important to note that the problem isn’t just the existence of extreme values. If these extreme values were frequent, the AOV would naturally be higher, and their impact would be less dramatic because the difference between the AOV and these values would be smaller. Similarly, for the splitting problem, a larger number of extreme values would ensure a more even split.

At this point, you might think your business has a different order distribution shape and isn’t affected. However, this shape emerges whenever these two conditions are met:

  • You have a price list with more than several dozen different values.
  • Visitors can purchase multiple products at once.

Needless to say, these conditions are ubiquitous and apply to nearly every e-commerce business. The e-commerce revolution itself was fueled by the ability to offer vast catalogues.

Furthermore, the presence of shipping costs naturally encourages users to group their purchases to minimize those costs. It means that nearly all e-commerce businesses are affected. The only exceptions are subscription-based businesses with limited pricing options, where most purchases are for a single service.

Here’s a glimpse into the order value distribution across various industries, demonstrating the pervasive nature of the “long tail distribution”:

Cosmetic
Transportation
B2B packaging (selling packaging for e-commerce)
Fashion
online flash sales

AOV, despite its simple definition and apparent ease of understanding, is a misleading metric. Its magnitude is easy to grasp, leading people to confidently make intuitive decisions based on its fluctuations. However, the reality is far more complex; AOV can show dramatic changes even when there’s no real underlying effect.

Conversely, significant changes can go unnoticed. A strong negative effect could be masked by just a few high-spending customers landing in a poorly performing variation. So, now you know: just as you do for conversion rates, rely on statistical tests for your AOV decisions.