Multi-Armed Bandits: A/B Testing with Fewer Regrets

In any A/B test conducted, traffic is usually split between different variations. Though we don’t know which one will end up as the winning variation, there will be a good amount of traffic directed towards the ‘losing’ or ‘underperforming’ variation, which could cost you conversions and hence sales.

Multi-armed bandit tests and algorithms help to avoid this issue thereby reducing experiment ‘regret’.

The multi-armed bandit problem

The term multi-armed bandit originally comes from an hypothetical scenario where a gambler has to choose which machine to play from a row of slot machines, then how many times to play them and in which order to play these machines.

This is referred to as the ‘multi-armed bandit problem’ where, as in the scenario above, you’re presented with a series of choices and must decide the best course of action to achieve the most profitable outcome possible.

In the aforementioned hypothetical situation, the gambler might choose to take a chance on each machine equally collecting enough data to determine which machine earns them the most winnings, which is essentially what A/B testing boils down to. However, they risk wasting a lot of time, especially on low-winning machines. This is referred to as exploration.

Instead, the gambler may choose to quickly test some of the machines, hence identifying the one that has the highest pay-off and opting to keep playing that machine. In this case, this is exploitation.

A/B testing and multi-armed bandits

When it comes to marketing, a solution to the multi-armed bandit problem comes in the form of a complex type of A/B testing that uses machine learning algorithms to allocate traffic dynamically to variations that are performing well. Meanwhile, less traffic is directed towards variations that are not performing so well.

Therefore, the underlying concept behind multi-armed bandits is dynamic traffic allocation, which uses an algorithm to modify the quantity of traffic sent to each live test variation.

In other words, it detects the highest performing variation thereby sending more traffic to that variation to maximize outcome, for example, the number of conversions that would have been lost to the underperforming variation had we not implemented this type of algorithm.

Or put simply, after choosing your primary KPI, traffic is then re-assigned based on the performance of that KPI. Thus, the goal behind multi-armed bandits is to find the action that has the highest expected reward.

Exploration and exploitation

Previously, we mentioned the concepts of exploration and exploitation when it comes to resolving the multi-armed bandit problem.

Exploration is basically trying out all possible options which may produce the best results while exploitation is choosing an action which has paid off before.

In the case of A/B tests, the focus is on exploration mode. In other words, you are testing variations to reach statistically significant results to determine which one achieved the highest conversions or whatever metric you set at the beginning of the experiment.

Thus, A/B testing allows you to explore the performance of variations by allocating traffic equally between the variations in order to be able to declare a winning variation and direct all users there (exploitation).

As previously mentioned, this could mean losing out on possible conversions since you’ve directed users to an underperforming variation in an effort to gather results and choose the variation that works best.

This is where dynamic allocation comes into play, which helps to gradually move traffic to the high-performing variation instead of having to wait till the end of the experiment to find the winner and only then direct all users to it.

Multi-armed solutions: Striking a balance between exploration and exploitation

There are a number of multi-armed bandit algorithms and calculations in order to find the right balance between exploration and exploitation and resolve the multi-armed bandit problem.

One such algorithm is known as Thompson Sampling, a Bayesian algorithm, which according to Wikipedia “consists of choosing the action that maximizes the expected reward with respect to a randomly drawn belief.”

Therefore, using this algorithm, a variation that appears to be performing better will receive more traffic while the variation with a poor performance will receive fewer visits.

Why is this important?

As already mentioned, this type of algorithm is helpful in limiting the loss of conversions and minimizing regret, the difference between your actual payoff and the payoff you would have achieved had you played the optimal variation at every opportunity, associated with sending traffic to an underperforming variation.

In a classic A/B test, there is no exploitation of a better performing and higher earning variation, which leads to regret and waste of resources since you’d also be exploring inferior variations in an attempt to gather sufficient data.

Bandit algorithms, on the other hand, seek to strike a balance between exploration and exploitation by sufficiently exploring variants to identify the winning one and then exploiting them to reach the maximum reward.

Multi-armed bandits are ideal for situations where you don’t have enough time to run the test long enough to reach statistically significant results since with such tests you’ll be able to obtain results faster. Instead, the focus is on maximizing conversions. For example, if you’re looking to optimize pricing for a special, limited offer.

It is also useful for situations that are time-sensitive such as when wanting to test short-lived content. For example: testing out headlines for a news article.

When are multi-armed bandit tests preferable over traditional A/B tests?

Indeed, multi-armed bandit testing is ideal for the short-term when your goal is maximizing conversions. However, if your objective is to collect data for a critical business decision and to run tests for long-term campaigns then A/B testing may be more useful and relevant to your objectives.

Multi-armed bandit tests are also useful for targeting purposes by finding the best variation for a predefined user-group that you specifically want to target.

Furthermore, this type of testing is more suitable when you have multiple variations to test, say more than 6, as through dynamic allocation, the least performing variations can be quickly detected and tests can be conducted on the relevant ones.

Finally, running multi-armed bandit tests is preferable when there are high opportunity costs associated with each lost conversion that might result from running a classic A/B test.

Just as a word of caution, keep in mind that multi-armed bandit experiments are more complex and so they are more difficult and require more resources and high technical expertise to run.

Summing up

There is no clear winner between A/B and multi-armed bandit tests. Whatever you choose will depend on your objectives, resources and how pressed you are for time.

However, if you’re looking to maximize conversions in the short-term, then multi-armed bandit testing may be your best option.

To sum up, the table below provides a simple means to compare A/B and bandit tests:

	A/B Testing	Multi-armed bandits
When time is limited		✔
Statistical significance	✔
Multiple variations		✔
Short-term promotions and campaigns		✔
Low traffic		✔
Easier/less complex to run	✔
Adaptive over time		✔
Post-experiment analysis for long-term goals	✔