How to Better Handle Collateral Effects of Experimentation: Dynamic Allocation vs Sequential Testing

When talking about web experimentation, the topics that often come up are learning and earning. However, it’s important to remember that a big part of experimentation is encountering risks and losses. Although losses can be a touchy topic, it’s important to talk about and destigmatize failed tests in experimentation because it encourages problem-solving, thinking outside of your comfort zone and finding ways to mitigate risk.

Therefore, we will take a look at the shortcomings of classic hypothesis testing and look into other options. Basic hypothesis testing follows a rigid protocol:

Creating the variation according to the hypothesis
Waiting a given amount of time
Analyzing the result
Decision-making (implementing the variant, keeping the original, or proposing a new variant)

This rigid protocol and simple approach to testing doesn’t say anything about how to handle losses. This raises the question of what happens if something goes wrong? Additionally, the classic statistical tools used for analysis are not meant to be used before the end of the experiment.

If we consider a very general rule of thumb, let’s say that out of every 10 experiments, 8 will be neutral (show no real difference), one will be positive, and one will be negative. Practicing classic hypothesis testing suggests that you just accept that as a collateral effect of the optimization process hoping to even it out in the long term. It may feel like crossing a street blindfolded.

For many, that may not cut it. Let’s take a look at two approaches that try to better handle this problem:

Dynamic allocation – also known as “Multi Armed Bandit” (MAB). This is where traffic allocation changes for each variation according to their performance, implicitly lowering the losses.
Sequential testing – a method that allows you to stop a test as soon as possible, given a risk aversion threshold.

These approaches are statistically sound but they come with their assumptions. We will go through their pros and cons within the context of web optimization.

First, we’ll look into the classic version of these two techniques and their properties and give tips on how to mitigate some of their problems and risks. Then, we’ll finish this article with some general advice on which techniques to use depending on the context of the experiment.

Dynamic allocation (DA)

Dynamic allocation’s main idea is to use statistical formulas that modify the amount of visitors exposed to a variation depending on the variation’s performance.

This means a poor-performing variation will end up having little traffic which can be seen as a way to save conversions while still searching for the best-performing variation. Formulas ensure the best compromise between avoiding loss and finding the real best-performing variation. However, this implies a lot of assumptions that are not always met and that make DA a risky option.

There are two main concerns, both of which are linked to the time aspect of the experimentation process:

The DA formula does not take time into account

If there is a noticeable delay between the variation exposure and the conversion, the algorithm may go wrong resulting in a visitor being considered a ‘failure’ until they convert. This means that the time between a visit and a conversion will be falsely counted as a failure.

As a result, the DA will use the wrong conversion information in its formula so that any variation gaining traffic will automatically see a (false) performance drop because it will detect a growing number of non-converting visitors. As a result, traffic to that variation will be reduced.

The reverse may also be true: a variation with decreasing traffic will no longer have any new visitors while existing visitors of this variation could eventually convert. In that sense, results would indicate a (false) rise in conversions even when there are no new visitors, which would be highly misleading.

DA gained popularity within the advertising industry where the delay between an ad exposure and its potential conversion (a click) is short. That’s why it works perfectly well in this context. The use of Dynamic Allocation in CRO must be done in a low conversion delay context only.

In other words, DA should only be used in scenarios where visitors convert quickly. It’s not recommended for e-commerce except for short-term campaigns such as flash sales or when there’s not enough traffic for a classic AB test. It can also be used if the conversion goal is clicking on an ad on a media website.

DA and the different days of the week

It’s very common to see different visitor behavior depending on the day of the week. Typically, customers may behave differently on weekends than during weekdays.

With DA, you may be sampling days unevenly, implicitly giving more weight on some days for some variations. However, you should weigh each day the same because, in reality, you have the same amount of weekdays. You should only use Dynamic Allocation if you know that the optimized KPI is not sensitive to fluctuations during the week.

The conclusion is that DA should be considered only when you expect too few total visitors for classic A/B testing. Another requirement is that the KPI under experimentation needs a very short conversion time and no dependence on the day of the week. Taking all this into account: Dynamic Allocation should not be used as a way to secure conversions.

Sequential Testing (ST)

Sequential Testing is when a specific statistical formula is used enabling you to stop an experiment. This will depend on the performance of variations with given guarantees on the risk of false positives.

The Sequential Testing approach is designed to secure conversions by stopping a variation as soon as its underperformance is statistically proven.

However, it still has some limitations. When it comes to effect size estimation, the effect size may be wrong in two senses:

Bad variations will be seen as worse than they really are. It’s not a problem in CRO because the false positive risk is still guaranteed. This means that in the worst-case scenario, you will discard not a strictly losing variation but maybe just an even one, which still makes sense in CRO.
Good variations will be seen as better than they really are. It may be a problem in CRO since not all winning variations are useful for business. The effect size estimation is key to business decision-making. This can easily be mitigated by using sequential testing to stop losing variations only. Winning variations, for their part, should be continued until the planned end of the experiment, ensuring both correct effect size estimation and an even sampling for each day of the week.
It’s important to note that not all CRO software use this hybrid approach. Most of them use ST to stop both winning and losing variations, which is wrong as we’ve just seen.

As we’ve seen, by stopping a losing variation in the middle of the week, there’s a risk you may be discarding a possible winning variation.

However, to actually have a winning variation after ST has shown that it’s underperforming, this variation will need to perform so well that it becomes even with the reference. Then, it would also have to perform so well that it outperforms the reference and all that would need to happen in a few days. This scenario is highly unlikely.

Therefore, it’s safe to stop a losing variation with Sequential Testing, even if all weekdays haven’t been evenly sampled.

The best of both worlds in CRO

Dynamic Allocation is the best approach to experimentation instead of static allocation when you expect a small volume of traffic. It should be used only in the context of ‘short delay KPI’ and with no known weekday effect (for example: flash sales). However, it’s not a way to mitigate risk in a CRO strategy.

To be able to run experiments with all the needed guarantees, you need a hybrid system using Sequential Testing to stop losing variations and a classic method to stop a winning variation. This method will allow you to have the best of both worlds.

You might also like...

Which Statistical Model is Best for A/B Testing: Bayesian, Frequentist, CUPED, or Sequential?

Is Your Average Order Value (AOV) Misleading You?

Why AB Tasty Delivers 4x Faster