Article

9min read

Inconclusive A/B Test Results – What’s Next?

Have you ever had an experiment leave you with an unexpected result and were unsure of what to do next? This is the case for many when receiving neutral, flat, or inconclusive A/B test results and this is a question we aim to answer.

In this article, we are going to discuss what an inconclusive experimentation result is, what you can learn from it, and what the next step is when you receive this type of result.

What is an inconclusive experiment result?

We have two definitions for an inconclusive experiment: a practitioner’s answer and a more broken-down answer. A basic practitioner’s answer is a numerical answer that shows statistical information depending on the platform you’re using:

  • The probability of a winner is less than 90-95%
  • The pValue is bigger than 0.05
  • The lift confidence interval includes 0

In other words, an inconclusive result happens when the results of an experiment are non-statistically significant or an uplift is too small to be measured. 

However, let’s take note of the true meaning of “significance” in this case: the significance is the threshold one has previously set as a metric or a statistic for measurement. If this previously set threshold is crossed, then an action will be made, usually implementing the winning variation.

Setting thresholds for experimentation

It’s important to note that the user sets the threshold and there are no magic formulas for calculating a threshold value. The only mandatory thing that must be done is that the threshold must be set before the beginning of an experiment. In doing so, this statistical hypothesis protocol provides caution and mitigates the risks of making a poor decision or missing an opportunity during experimentation.

To set a proper threshold, you will need a mix of statistical and business knowledge considering the context.

There is no golden rule, but there is a widespread consensus for using a “95% significance threshold.” However, it’s best to use this generalization cautiously as using the 95% threshold may be a bad choice in some contexts.

To make things simple, let’s consider that you’ve set a significance threshold that fits your experiment context. Then, having a “flat” result may have different meanings – we will dive into this more in the following sections.

The best tool: the confidence interval (CI)

The first thing to do after the planned end of an experiment is to check the confidence interval (CI) that can tell useful information without any notion of significance. The usage is a 95% confidence level to build these intervals. This means that there is a 95% chance that the real value lies between its boundaries. You can consider the boundaries to be an estimate of the best and worst-case scenarios.

Let’s say that your experiment is collaborating with a brand ambassador (or influencer) to attract more attention and sales. You want to see the impact the brand ambassador has on the conversion rate. There are several possible scenarios depending on the CI values:

Scenario 1:

The confidence interval of the lift is [-1% : +1%]. This means that in the best-case scenario, this ambassador effect is a 1% gain and in the worst-case scenario, the effect is -1%. If this 1% relative gain is less than the cost of the ambassador, then you know that it’s okay to stop this collaboration.

A basic estimation can be done by taking this 1% of your global revenue from an appropriate period. If this is smaller than the cost of the ambassador, then there is no need for “significance“ to validate the decision – you are losing money.

Sometimes neutrality is a piece of actionable information.

Scenario 2: 

The confidence interval of the lift is [-1% : +10%]. Although this sounds promising, it’s important not to make quick assumptions. Since the 0 is still in the confidence interval, you’re still unsure if this collaboration has a real impact on conversion. In this case, it would make sense to extend the experiment period because there are more chances that the gain will be positive than negative.

It’s best to extend the experimentation period until the left bound gets to a “comfortable” margin.

Let’s say that the cost of the collaboration is covered if the gain is as small as 3%, then any CI [3%, XXX%] will be okay. With a CI like this, you are ensuring that the worst-case scenario is even. And with more data, you will also have a better estimate of the best-case scenario, which will certainly be lower than the initial 10%.

Important notice: do not repeat this too often, otherwise you may be waiting until your variant beats the original just by chance.

When extending a testing period, it’s safer to do it by looking at the CI rather than the “chances to win” or P-value, because the CI provides you with an estimate of the effect size. When the variant wins only by chance (which you increase when extending the testing period), it will yield a very small effect size.

You will notice the size of the gain by looking at the CI, whereas a p-value (or any statistical index) will not inform you about the size. This is a known statistical mistake called p-hacking. P-hacking is basically running an experiment until you get what you expect.

The dangers of P-hacking in experimentation

It’s important to be cautious of p-hacking. Statistical tests are meant to be used once. Splitting the analysis into segments, to some extent, can be seen as portraying different experiences. Therefore, if making a unique decision at a 95% significance level means accepting a 5% risk of having a false positive, then checking for 2 segments implicitly leads to doubling this risk to 10% (roughly).

We recommend the following advice may help to mitigate this risk:

  • Limit the number of segments you are studying to only segments that could have a reason to interact differently with the variation. For example: if it’s a user interface modification (such as the screen size or the navigator used), it may have an impact on how the modification is displayed, but not the geolocation.
  • Use segments that convey strong information regarding the experiment. For example: Changing the wording of anything may have no link to the navigator used. It may only have an effect on the emotional needs of the visitors, which is something you can capture with new AI technology when using AB Tasty.
  • Don’t check the smallest segments. The smallest segments will not greatly impact your business overall and are often the least statistically significant. Raising the significance threshold may also be useful to mitigate the risk of having a false positive

Should you extend the experiment period often?

If you notice that you often need to extend the experiment period, you might be skipping an important step in the test protocol: estimating the sample size you need for your experiment.

Unfortunately, many people are skipping this part of the experiment thinking that they can fix it later by extending the period. However, this is bad practice for several reasons:

  • This brings you close to P-hacking
  • You may lose time and traffic on tests that will never be significant

Asking a question you can’t know the answer to can be very difficult: what will be the size of the lift? It’s impossible to know. This is one reason why experimenters don’t often use sample size calculators. The reason you test and experiment is because you do not know the outcome.

A far more intuitive approach is to use a Minimal Detectable Effect (MDE) calculator. Based on the base conversion rate and the number of visitors you send to a given experiment, an MDE calculator can help you come up with the answer to the question: what is the smallest effect you may be able to detect? (if it exists).

For example, if the total traffic on a given page is 15k for 2 weeks, and the conversion rate is 3% – the calculator will tell you that the MDE is about 25% (relative). This means that what you are about to test must have a quite big impact: going from 3% to 3.75% (25% relative growth).

If your variant is only changing some colors to a small button, developing an entire experiment may not be worth the time. Even if the new colors are better and give you a small uplift, it will not be significant in the classic statistical way (having a “chance to win” >95% or a p-value < 0.05).

On the other hand, if your variation tests a big change such as offering a coupon or a brand new product page format, then this test has a chance to give usable results in the given period.

Digging deeper into ‘flatness’

Some experiments may appear to be flat or inconclusive when in reality, they need a closer look.

For example, frequent visitors may be puzzled by your changes because they expect your website to remain the same, whereas new visitors may instantly prefer your variation. This combined effect of the two groups may cancel each other out when looking at the overall results instead of further investigating the data. This is why it’s very important to take the time to dig into your visitor segments as it can provide useful insights.

This can lead to very useful personalization where only a given segment will be exposed to the variation with benefits.

What is the next step after receiving an inconclusive experimentation result?

Let’s consider that your variant has no effect at all, or at least not enough to have a business impact. This still means something. If you reach this point, it means that all previous ideas fell short; You discovered no behavioral difference despite the changes you made in your variation.

What is the next step in this case? The next step is actually to go back to the previous step – the hypothesis. If you are correctly applying the testing protocol, you should have stated a clear hypothesis. It’s time to use it now.

There might be several meta-hypotheses about why your hypothesis has not been validated by your experiment:

  • The signal is too weak. You might have made a change, but perhaps it’s barely noticeable. If you offered free shipping, your visitors might not have seen the message if it’s too low on the page.
  • The change itself is too weak. In this case, try to make the change more significant. If you have increased the product picture on the page by 5% – it’s time to try 10% or 15%.
    The hypothesis might need revision. Maybe the trend is reversed. For instance, if the confidence interval of the gain is more on the negative side, why not try the opposite idea to implement?
  • Think of your audience. Another consideration is that even if you have a strong belief about your hypothesis, it’s just time to change your mind about what is important for your visitors and try something different.

It’s important to notice that this change is something that you’ve learned thanks to your experiment. This is not a waste of time – it’s another step forward to better knowing your audience.

Yielding an inconclusive experiment

An experiment not yielding a clear winner (or loser), is often called neutral, inconclusive, or flat. This still produces valuable information if you know how and where to search. It’s not an end, it’s just another step further in your understanding of who you’re targeting.

In other words, an inconclusive experiment result is always a valuable result.

Subscribe to
our Newsletter

bloc Newsletter EN

We will process and store your personal data to respond to send you communications as described in our  Privacy Policy.

Article

11min read

CRO Metrics: Navigating Pitfalls and Counterintuitive KPIs

Metrics play an essential role in measuring performance and influencing decision-making.

However, relying on certain metrics alone can lead you to misguided conclusions and poor strategic choices. Potentially misleading metrics are often referred to as “pitfall metrics” in the world of Conversion Rate Optimization.

Pitfall metrics are data indicators that can give you a distorted version of reality or an incomplete view of your performance if analyzed in isolation. Pitfall metrics can even cause you to backtrack in your performance if you’re not careful about how you evaluate these metrics.

Metrics are typically split into two categories:

  • Session metrics: Any metrics that are measured on a session instead of a visitor basis
  • Count metrics: Metrics that count events (for instance number of pages viewed)

Some metrics can mesh into both categories. Needless to say, that’s the worst option for a few main reasons: no real statistical model is used when meshing into both categories. There is no direct/simple link to business objectives and these metrics may not need standard optimization.

While metrics are very valuable for business decisions, it’s crucial to use them wisely and be mindful of potential pitfalls in your data collection and analysis. In this article, we will explore and explain why some metrics are very not wise to use in practice in CRO.

Session-based metrics vs visitors

One problem with session-based metrics is that “power users” (AKA users returning for multiple sessions during the experimentation) will lead to a bias with the results.

Let’s remember that during experimentation, the traffic split between the variations is a random process.

Typically you think of a traffic split as very random but very even groups. When we talk about big groups of users – this is typically true. However, when you consider a small group, it’s very unlikely that you will have an even split in terms of visitor behaviors, intentions and types.

Let’s say that you have 12 power users that need to be randomly divided between two variations. Let’s say that these power users have 10x more sessions than the average user. It’s quite likely that you will end up with a 4 and 8 split, a 2 and 10 split, or another uneven split. Having an even split randomly occur is very unlikely. You will then end up in one of two very likely situations:

  • Situation 1: Very few users may make you believe you have a winning variation (which doesn’t yet exist)
  • Situation 2: The winning variation is masked because it  received too few of these power users

Another problem with session-based metrics is that a session-based approach blurs the meaning of important metrics like transaction rates. The recurring problem here is that not all visitors display the same type of behavior. If average buyers need 3 sessions to make a purchase while some need 10, this is a difference in user behavior and does not have anything to do with your variation. If your slow buyers are not evenly split between the variations, then you will see a discrepancy in the transaction rate that doesn’t actually exist.

Moreover, the metric itself will lose part of its intuitive meaning over time. If your real conversion rate is around 3%, but counted by session and not by unique visitors, you will only likely only see a 1% conversion rate when switching to unique visitors.

This is not only disappointing but very confusing.

Imagine a variation urging visitors to buy sooner by using “stress marketing” techniques. Let’s say this leads to a one session purchase instead of three sessions. You will see a huge gain (3x) on the conversion per session. BUT this “gain” is not an actual gain since the number of conversions will have no effect on the revenue earned. It’s also good to keep in mind that visitors under pressure may not feel very happy or comfortable with such a quick purchase and may not return.

It’s best practice to avoid using session-based metrics unless you don’t have another choice as they can be very misleading.

Understanding count metrics

We will come back to our comparison of these two types of metrics. But for now, let’s get on the same page about “count metrics.” To understand why count metrics are harder to assess, you need to have more context on how to measure accuracy and where exactly the measure comes from.

To model rate accuracy measures, we use beta distribution. In the graph below, we see the measure of two conversion ratios – one blue and one orange. The X-axis is the rate and Y-axis is the likelihood. When trying to measure the probability that the two rates are different, we implicitly explore the part of the two curves that are overlapping.

In this case, the two curves have very little overlap. Therefore, the probability that these two rates are actually different is quite high.

The more narrow or compact the distribution is, the easier it is to see that they’re different.

Want to start optimizing your website with a platform you can trust? AB Tasty is the best-in-class experience optimization platform that empowers you to create a richer digital experience – fast. From experimentation to personalization, this solution can help you activate and engage your audience to boost your conversions.

The fundamental difference between conversion and count distributions

Conversion metrics are bounded into [0:1] as a rate or [0%:100%] as a percentage. But, for count metrics the range is open, and the counts are in [0,+infinity].

The following figure shows a gamma distribution (in orange) that may be used with this kind of data, along with a beta distribution (in blue).

These two distributions are based on the same data: 10 visitors and 5 successes. This is a 0.5 success rate (or 50%) when considering unique conversions. In the context of multiple conversions, it’s a process with an average of 0.5 rate conversion per visitor.

Notice that the orange curve (for the count metric) is non-0 above x = 1, this clearly shows that it expects that sometimes there will be more than 1 conversion per visitor.

We will see that comparisons between this kind of metric depend on whether we consider it as a count metric or as a rate. There are two options:

  • Either we consider that the process is a conversion process, using a beta distribution (in blue), which is naturally bounded in [0;1].
  • Or we consider that the process is a count process, using gamma distribution (in orange), which is not bounded on the right side.

On the graph, we see an inner property of count data distributions, they are dissymmetric: the right part goes slower to 0 than the left part. This makes it naturally more spread out than the beta distribution.

Since both curves are distributions, their surface under the curve must be 1.

As you can see, the beta distribution (in blue) has a higher peak than the gamma distribution (in orange). This exposes that the gamma distribution is more spread out than the beta distribution. This is a hint that count distributions are harder to get accurate than conversion distributions. This is also why we need more visitors to assess a difference when using count metrics rather than when using conversion metrics.

To understand this problem you have to imagine two gamma distribution curves, one for each variation of an experiment. Then, gradually shift one on the right, showing an increasing difference between the two distributions. (see figure below)

Since both curves are right-skewed, the overlap region will occur on at least one of the skewed parts of the distributions.

This means that differences will be harder to assess with count data than with conversion data. This comes from the fact that count data works on an open range, whereas conversion rates work on a closed range.

Do count metrics need more visitors to get accurate results?

No, it is more complex than that in the CRO context. Typical statistical tests for count metrics are not suited for CRO in practice.

Most of these tests come from the industrial world. A classic usage of count metrics is counting the number of failures of a machine in a given timeframe. In this context, the risk of failure doesn’t depend on previous events. If a machine already had one failure and has been repaired, the chance of a second failure is considered to be the same.

This hypothesis is not suited for the number of pages viewed by a visitor. In reality, if a visitor saw two pages, there’s a higher chance that they will see a third page compared to a visitor that just saw one page (since they have a high probability to “bounce”).

The industrial model does not fit in the CRO context since it deals with human behavior, making it much more complex.

Not all conversions have the same value

The next CRO struggle also comes from the direct exploitation of formulas from the industrial world.

If you run a plant that produces goods with machines, and you test a new kind of machine that produces more goods per day on average, you will conclude that these new machines are a good investment. Because the value of a machine is linear with its average production, each extra product adds the same value to the business.

But this is not the same in CRO.

Imagine this experiment result for a media company:

Variation B is yielding an extra 1,000 page views more than the original A. Based on that data, you put variation B in production. Let’s say that variation B lost 500 people that saw 2 pages and variation B also won 20 people that saw 100 pages each. That makes a net benefit of 1000 page views for variation B.

But what about the value? These 20 people, even if they spent a lot of time on the media, are maybe not the same value as 500 people that come regularly.

In CRO each extra value added to a count metric does not have the same value, so you cannot trust measured increment as a direct added value.

In applied statistics, one adds an extra layer to the analysis: a utility function, which links extra counts to value. This utility function is very specific to the problem and is unknown to most CRO problems. Even if you get some more conversions in a count metric context, you are unsure about the real value of this gain (if any).

Some count metrics are not meant to be optimized

Let’s see some examples where raising the number of a count metric might not be a good thing:

  • Page views: If the count of page views rises, you can think it’s a good thing because people are seeing more of your products. However, you can also think that people get lost and need to browse more pages to find what they need.
  • Items added to cart: We have the same idea for the number of products added to the cart. If you do not check how many products remain in the cart at the checkout stage, you don’t know if the variation helps to sell more or if it just makes the product selection harder.
  • Product purchased: Even the number of products purchased may be misleading as a business objective alone if used alone in an optimization context. Visitors could be buying two cheaper products instead of one high-quality (and more expensive) product.

You can’t tell just by looking at these KPIs if your variation or change is good for your business or not. There is more that needs to be considered when looking at these numbers.

How do we use this count data then?

We see in this article how counterintuitive optimization based on sessions is. And even worse, we see how misleading count metrics are in CRO.

Unless you have both business and statistics expert resources, it’s best practice to avoid them, at least as a unique KPI.

As a workaround, you can use several conversion metrics with specific triggers using business knowledge to set the thresholds. For instance:

  • Use one conversion metric for count is in the range [1; 5] called “light users.”
  • Use another conversion metric in the range [6,10] called “medium users.”
  • Use another one for the range [11,+infinity] called “heavy users”.

Splitting up the conversion metrics in this way will give you a clearer signal about where you gain or lose conversions.

Another piece of advice is to use several KPIs to have a broader view.

For instance, although analyzing the product views alone is not a good idea – you can check the overall conversion rate and average order value at the same time. If product views and conversion KPIs are going up and the average order value is stable or goes up, then you can conclude that your new product page layout is a success.

Counterintuitive Metrics in CRO

Now you see that except for conversions counted on a unique visitor basis, nearly all other metrics can be very counterintuitive to use in CRO. Mistakes can happen because of statistics that work differently, and also because the meaning of these metrics and their evolutions may have several interpretations.

It’s important to understand that CRO skill is a mix of statistics, business and UX knowledge. Since it’s very rare to have all this within one person, the key is to have the needed skills spread across a team with good communication.