A/B testing & Usain Bolt: different tools, different results?

Recently, a client of ours tried Kissmetrics’ significance calculator to see how their test results compared to those displayed in AB Tasty’s. What a surprise when they found two completely different results for the exact same data!

Here’s a sample:

VariationVisitors diverted to variationVisitors that convertedConversion rate


Given the above figures, it appears variation B is beating variation A – at first glance. The question is: is variation B actually outperforming A or is it due to chance?

This is precisely what statistical algorithms in A/B-testing try to determine. What is the likelihood of the result to be due to chance? Or the other way around, what is the likelihood of the result actually depicting reality?

In this test, AB Tasty’s algorithm displays a confidence rate of 70.4%, which our client decided to compare to those of other online tools, namely splittestcalculator.com and getdatadriven.com (the latter being powered by Kissmetrics – a rather reliable source of information!). In the following table, we brought along a few more sources as well.

ToolConfidence rate
AB Tasty70.02%
Split Test Calculator42.23%
Get Data Driven70%
Evan Miller’s calculator41%

Why is there a difference?

Well, we are talking of two different methods of calculating: the chi-squared method gives 42% whereas the Bayesian approach gives 70.43%.

Choosing one method over the other is arbitrary, so let’s dig a little deeper (do keep reading, there is no need for a degree in astrophysics!).

To make it simple, there are two things to consider when placing a bet:

  • The probability that there is a difference (A beats B)
  • The gain (A is 20% better than B)

Calculation methods take these parameters into account but give them different weights, which leads to different results. The chi-squared method only takes the probability of a difference into account, whereas the Bayesian method is based on both probability and gain (or loss).

Bottom line: both are correct, although different.

How to make informed decisions?

“In most cases, focusing on a single source of information to qualify the difference between two variations leads to making blind decisions”, says AB Tasty’s Chief Data Scientist Hubert Wassner, “it is like saying Usain Bolt wins the race”.

Usain Bolt wins the competition
Usain Bolt smashed it

The first image shows Bolt taking a prize for having won the race, the second image shows by how much margin he won or the size of the difference. Given how far ahead he is when crossing the finish line, Bolt is probably very likely to win the next race as well.

The same goes for A/B testing: Bayesian statistics offer an estimation of the potential gain (or loss) whereas chi-squared statistics stick to providing confidence rate only.

The AB Tasty reporting gives upper and lower limits around the gain

Conversion rates (here 13.98% and 14.24%) and the gain (1.89%), as displayed in most testing tools, give the impression that they are related to the reliability rate. Actually, they are just indicating the empirical conversion rates at the present time. The “real” conversion rates remain unknown.

The most valuable information here are the limits around the gain (-4.8% and +8.85). They should be read as follows: with a 95% confidence rate, the real value of the gain is between -4.8% and +8.85%. The higher the lower limit, the safer the decision.

Bottom line: the confidence rate only offers an indication of when it’s time to make a decision (there is a difference between the two and it’s not due to chance) and the limits indicate what decision should be made. You need a combination of both to have the best predictions of your test results and to spot variations worthy of Usain Bolt.

Related Posts