Recently, a client of ours tried Kissmetrics’ significance calculator to see how their A/B test results compared to those displayed in AB Tasty’s. What a surprise when they found two completely different results for the exact same data!

Here’s a sample:

Variation Visitors diverted to variation Visitors that converted Conversion rate
A 10,299 1,439 13.97%
B 10,505 1,495 14.23%

 

Given the above figures, it appears variation B is beating variation A – at first glance. The question is: is variation B actually outperforming A or is it due to chance?

This is precisely what statistical algorithms in A/B testing try to determine. What is the likelihood of the result to be due to chance? Or the other way around, what is the likelihood of the result actually depicting reality?

In this test, AB Tasty’s algorithm displays a confidence rate of 70.4%, which our client decided to compare to those of other online tools, namely splittestcalculator.com and getdatadriven.com (the latter being powered by Kissmetrics – a rather reliable source of information!). In the following table, we brought along a few more sources as well.

Tool Confidence rate
AB Tasty 70.02%
Split Test Calculator 42.23%
Get Data Driven 70%
Hubspot 70.43%
Evan Miller’s calculator 41%

Why is there a difference?

Well, we are talking about two different methods of calculating: the chi-squared method gives 42% whereas the Bayesian approach gives 70.43%.

Choosing one method over the other is arbitrary, so let’s dig a little deeper (do keep reading, there is no need for a degree in astrophysics!).

To make it simple, there are two things to consider when placing a bet:

  • The probability that there is a difference (A beats B)
  • The gain (A is 20% better than B)

Calculation methods take these parameters into account but give them different weights, which leads to different results. The chi-squared method only takes the probability of a difference into account, whereas the Bayesian method is based on both probability and gain (or loss).

Bottom line: both are correct, although different.

How to make informed decisions?

“In most cases, focusing on a single source of information to qualify the difference between two variations leads to making blind decisions”, says AB Tasty’s Chief Data Scientist Hubert Wassner, “it is like saying Usain Bolt wins the race”.

bolt
Usain Bolt wins the competition
bolt2
Usain Bolt smashed it

The first image shows Bolt taking a prize for having won the race, the second image shows by how much margin he won or the size of the difference. Given how far ahead he is when crossing the finish line, Bolt is probably very likely to win the next race as well.

The same goes for A/B testing: Bayesian statistics offer an estimation of the potential gain (or loss) whereas chi-squared statistics stick to providing confidence rate only.

results-bayesian
The AB Tasty reporting gives upper and lower limits around the gain

Conversion rates (here 13.98% and 14.24%) and the gain (1.89%), as displayed in most A/B testing tools, give the impression that they are related to the reliability rate. Actually, they are just indicating the empirical conversion rates at the present time. The “real” conversion rates remain unknown.

The most valuable information here are the limits around the gain (-4.8% and +8.85). They should be read as follows: with a 95% confidence rate, the real value of the gain is between -4.8% and +8.85%. The higher the lower limit, the safer the decision.

Bottom line: the confidence rate only offers an indication of when it’s time to make a decision (there is a difference between the two and it’s not due to chance) and the limits indicate what decision should be made. You need a combination of both to have the best predictions of your test results and to spot variations worthy of Usain Bolt.