Inconclusive A/B Test Results – What’s Next?

Have you ever had an experiment leave you with an unexpected result and were unsure of what to do next? This is the case for many when receiving neutral, flat, or inconclusive A/B test results and this is a question we aim to answer.

In this article, we are going to discuss what an inconclusive experimentation result is, what you can learn from it, and what the next step is when you receive this type of result.

What is an inconclusive experiment result?

We have two definitions for an inconclusive experiment: a practitioner’s answer and a more broken-down answer. A basic practitioner’s answer is a numerical answer that shows statistical information depending on the platform you’re using:

The probability of a winner is less than 90-95%
The pValue is bigger than 0.05
The lift confidence interval includes 0

In other words, an inconclusive result happens when the results of an experiment are non-statistically significant or an uplift is too small to be measured.

However, let’s take note of the true meaning of “significance” in this case: the significance is the threshold one has previously set as a metric or a statistic for measurement. If this previously set threshold is crossed, then an action will be made, usually implementing the winning variation.

Setting thresholds for experimentation

It’s important to note that the user sets the threshold and there are no magic formulas for calculating a threshold value. The only mandatory thing that must be done is that the threshold must be set before the beginning of an experiment. In doing so, this statistical hypothesis protocol provides caution and mitigates the risks of making a poor decision or missing an opportunity during experimentation.

To set a proper threshold, you will need a mix of statistical and business knowledge considering the context.

There is no golden rule, but there is a widespread consensus for using a “95% significance threshold.” However, it’s best to use this generalization cautiously as using the 95% threshold may be a bad choice in some contexts.

To make things simple, let’s consider that you’ve set a significance threshold that fits your experiment context. Then, having a “flat” result may have different meanings – we will dive into this more in the following sections.

The best tool: the confidence interval (CI)

The first thing to do after the planned end of an experiment is to check the confidence interval (CI) that can tell useful information without any notion of significance. The usage is a 95% confidence level to build these intervals. This means that there is a 95% chance that the real value lies between its boundaries. You can consider the boundaries to be an estimate of the best and worst-case scenarios.

Let’s say that your experiment is collaborating with a brand ambassador (or influencer) to attract more attention and sales. You want to see the impact the brand ambassador has on the conversion rate. There are several possible scenarios depending on the CI values:

Scenario 1:

The confidence interval of the lift is [-1% : +1%]. This means that in the best-case scenario, this ambassador effect is a 1% gain and in the worst-case scenario, the effect is -1%. If this 1% relative gain is less than the cost of the ambassador, then you know that it’s okay to stop this collaboration.

A basic estimation can be done by taking this 1% of your global revenue from an appropriate period. If this is smaller than the cost of the ambassador, then there is no need for “significance“ to validate the decision – you are losing money.

Sometimes neutrality is a piece of actionable information.

Scenario 2:

The confidence interval of the lift is [-1% : +10%]. Although this sounds promising, it’s important not to make quick assumptions. Since the 0 is still in the confidence interval, you’re still unsure if this collaboration has a real impact on conversion. In this case, it would make sense to extend the experiment period because there are more chances that the gain will be positive than negative.

It’s best to extend the experimentation period until the left bound gets to a “comfortable” margin.

Let’s say that the cost of the collaboration is covered if the gain is as small as 3%, then any CI [3%, XXX%] will be okay. With a CI like this, you are ensuring that the worst-case scenario is even. And with more data, you will also have a better estimate of the best-case scenario, which will certainly be lower than the initial 10%.

Important notice: do not repeat this too often, otherwise you may be waiting until your variant beats the original just by chance.

When extending a testing period, it’s safer to do it by looking at the CI rather than the “chances to win” or P-value, because the CI provides you with an estimate of the effect size. When the variant wins only by chance (which you increase when extending the testing period), it will yield a very small effect size.

You will notice the size of the gain by looking at the CI, whereas a p-value (or any statistical index) will not inform you about the size. This is a known statistical mistake called p-hacking. P-hacking is basically running an experiment until you get what you expect.

The dangers of P-hacking in experimentation

It’s important to be cautious of p-hacking. Statistical tests are meant to be used once. Splitting the analysis into segments, to some extent, can be seen as portraying different experiences. Therefore, if making a unique decision at a 95% significance level means accepting a 5% risk of having a false positive, then checking for 2 segments implicitly leads to doubling this risk to 10% (roughly).

We recommend the following advice may help to mitigate this risk:

Limit the number of segments you are studying to only segments that could have a reason to interact differently with the variation. For example: if it’s a user interface modification (such as the screen size or the navigator used), it may have an impact on how the modification is displayed, but not the geolocation.
Use segments that convey strong information regarding the experiment. For example: Changing the wording of anything may have no link to the navigator used. It may only have an effect on the emotional needs of the visitors, which is something you can capture with new AI technology when using AB Tasty.
Don’t check the smallest segments. The smallest segments will not greatly impact your business overall and are often the least statistically significant. Raising the significance threshold may also be useful to mitigate the risk of having a false positive

Should you extend the experiment period often?

If you notice that you often need to extend the experiment period, you might be skipping an important step in the test protocol: estimating the sample size you need for your experiment.

Unfortunately, many people are skipping this part of the experiment thinking that they can fix it later by extending the period. However, this is bad practice for several reasons:

This brings you close to P-hacking
You may lose time and traffic on tests that will never be significant

Asking a question you can’t know the answer to can be very difficult: what will be the size of the lift? It’s impossible to know. This is one reason why experimenters don’t often use sample size calculators. The reason you test and experiment is because you do not know the outcome.

A far more intuitive approach is to use a Minimal Detectable Effect (MDE) calculator. Based on the base conversion rate and the number of visitors you send to a given experiment, an MDE calculator can help you come up with the answer to the question: what is the smallest effect you may be able to detect? (if it exists).

For example, if the total traffic on a given page is 15k for 2 weeks, and the conversion rate is 3% – the calculator will tell you that the MDE is about 25% (relative). This means that what you are about to test must have a quite big impact: going from 3% to 3.75% (25% relative growth).

If your variant is only changing some colors to a small button, developing an entire experiment may not be worth the time. Even if the new colors are better and give you a small uplift, it will not be significant in the classic statistical way (having a “chance to win” >95% or a p-value < 0.05).

On the other hand, if your variation tests a big change such as offering a coupon or a brand new product page format, then this test has a chance to give usable results in the given period.

Digging deeper into ‘flatness’

Some experiments may appear to be flat or inconclusive when in reality, they need a closer look.

For example, frequent visitors may be puzzled by your changes because they expect your website to remain the same, whereas new visitors may instantly prefer your variation. This combined effect of the two groups may cancel each other out when looking at the overall results instead of further investigating the data. This is why it’s very important to take the time to dig into your visitor segments as it can provide useful insights.

This can lead to very useful personalization where only a given segment will be exposed to the variation with benefits.

What is the next step after receiving an inconclusive experimentation result?

Let’s consider that your variant has no effect at all, or at least not enough to have a business impact. This still means something. If you reach this point, it means that all previous ideas fell short; You discovered no behavioral difference despite the changes you made in your variation.

What is the next step in this case? The next step is actually to go back to the previous step – the hypothesis. If you are correctly applying the testing protocol, you should have stated a clear hypothesis. It’s time to use it now.

There might be several meta-hypotheses about why your hypothesis has not been validated by your experiment:

The signal is too weak. You might have made a change, but perhaps it’s barely noticeable. If you offered free shipping, your visitors might not have seen the message if it’s too low on the page.
The change itself is too weak. In this case, try to make the change more significant. If you have increased the product picture on the page by 5% – it’s time to try 10% or 15%.
The hypothesis might need revision. Maybe the trend is reversed. For instance, if the confidence interval of the gain is more on the negative side, why not try the opposite idea to implement?
Think of your audience. Another consideration is that even if you have a strong belief about your hypothesis, it’s just time to change your mind about what is important for your visitors and try something different.

It’s important to notice that this change is something that you’ve learned thanks to your experiment. This is not a waste of time – it’s another step forward to better knowing your audience.

Yielding an inconclusive experiment

An experiment not yielding a clear winner (or loser), is often called neutral, inconclusive, or flat. This still produces valuable information if you know how and where to search. It’s not an end, it’s just another step further in your understanding of who you’re targeting.

In other words, an inconclusive experiment result is always a valuable result.

You might also like...

Is Your Average Order Value (AOV) Misleading You?

Why AB Tasty Delivers 4x Faster

16 Experimentation Influencers You Should Follow