Article

13min read

The Impact of Experimentation on Cumulative Layout Shift (CLS)

We teamed up with our friends at Creative CX to take a look at the impact of experimentation on Core Web Vitals. Read our guest blog from Creative CX’s CTO Nelson Sousa giving you insights into how CLS can affect your Google ranking, the pros and cons of experiments server and client side, as well as organisational and technical considerations to improve your site experience through testing, personalisation and experimentation.

What are Core Web Vitals?

Core Web Vitals (CWV) are a set of three primary metrics that affect your Google search ranking. According to StarCounter, the behemoth search engine accounts for 92% of the global market share. This change has the potential to reshape the way we look at optimising our websites. As more and more competing businesses seek to outdo one another for the top spots in search results.

One notable difference with CWV is that changes are focused on the user experience. Google wants to ensure that users receive relevant content and are directed to optimised applications. The change aims to minimise items jumping around the screen or moving from their initial position. The ability to quickly and successfully interact with an interface and ensure that the largest painted element appears on the screen in a reasonable amount of time.

Core Web Vitals

What is CLS?

Let’s imagine the following scenario:

You are navigated to a website. Click on an element. It immediately moves away from its position on the page. This is a common frustration. It means you click elsewhere on a page, or on a link, which navigates you somewhere else again! Forcing you to go back and attempt to click your desired element again.

You have experienced what is known as Cumulative Layout Shift, or for short, CLS; a metric used to determine visual stability during the entire lifespan of a webpage. It is measured by score, and according to Core Web Vitals, webpages should not exceed a CLS score of 0.1

CLS within Experimentation

When working with client-side experimentation, a large percentage of A/B testing focuses on making experimentation changes on the client side (in the browser). This is a common pattern, which normally involves placing a HTML tag in your website, so that the browser can make a request to the experimentation tool’s server. Such experimentation tools have become increasingly important as Tech teams are no longer the sole entities making changes to a website.

For many, this is a great breakthrough.

It means marketing and other less technical teams access friendly user interfaces to manipulate websites without the need of a developer.It also frees up time for programmers to concentrate on other more technical aspects.

One drawback for client-side, is certain elements can be displayed to the user before the experimentation tool has had a chance to perform its changes. Once the tool finally executes and completes its changes, it may insert new elements in the same position where other elements already exist. Pushing those other elements further down the page. This downward push is an example of CLS in action.

Bear in mind that this only affects experiments above the fold. Elements initially visible on the page without the need of scrolling.

So when should you check for CLS and its impact upon the application? The answer is up for debate. Some companies begin to consider it during the design phase, while others during the User Acceptance Testing phase. No matter what your approach is, however, it should always be considered before publishing an experiment live to your customer base.

Common CLS culprits

According to Google’s article on optimising CLS, the most common causes of CLS are:

  • Images without dimensions
  • Ads, embeds, and iframes without dimensions
  • Dynamically injected content
  • Web Fonts causing FOIT/FOUT
  • Actions waiting for a network response before updating DOM

Overall CLS Considerations

Team awareness and communication

Each variation change creates a unique CLS score. This score is a primary point in your prioritisation mechanism. It shapes the way you approach an idea. It also helps to determine whether or not a specific experiment will be carried out.

Including analysis from performance testing tools during your ideation and design phases can help you understand how your experiment will affect your CLS score. At Creative CX, we encourage weekly communication with our clients, and discuss CLS impact on a per-experiment basis.

Should we run experiments despite possible CLS impact?

Although in an ideal world you would look to keep the CLS score to 0, this isn’t always the case. Some experiment ideas may go over the threshold, but that doesn’t mean you cannot run the experiment.

If you have data-backed reasons to expect the experiment to generate an uplift in revenue or other metrics, the CLS impact can be ignored for the lifetime of the experiment. Don’t let the CLS score to deter you from generating ideas and making them come to life.

Constant monitoring of your web pages

Even after an experiment is live, it is vital to use performance testing tools and continuously monitor your pages to see if your experiments or changes cause unprecedented harmful effects. These tools will help you analyse your CLS impact and other key metrics such as First Contentful Paint and Time to Interactivity

Be aware of everyone’s role and impact

For the impact of experimentation on Web Core Vitals, you should be aware of two main things:

  • What is the impact of your provider?
  • What is the impact of modifications you make through this platform?

Experimentation platforms mainly impact two Web Vitals: Total Blocking Time and Speed Index. The way you use your platform, on the other hand, could potentially impact CLS and LCP (Largest Contentful Paint).

Vendors should do their best to minimize their technical footprint on TBT and Speed Index. There are best practices you should follow to keep your CLS and LCP values, without the vendor being held liable.

Here, we’ll cover both aspects:

Be aware of what’s downloaded when adding a tag to your site (TBT and Speed Index)

When you insert any snippet from an experimentation vendor onto your pages, you are basically making a network request to download a JavaScript file that will then execute a set of modifications on your page. This file is, by its nature, a moving piece: based on your usage – due to the number and nature of your experimentations, its size evolves.

The bigger the file, the more impact it can have on loading time. So, it’s important to always keep an eye on it. Especially as more stakeholders in your company will embrace experimentation and will want to run tests.

To limit the impact of experimenting on metrics such as Total Blocking Time and Speed Index, you should download strictly the minimum to run your experiment. Providers like AB Tasty make this possible using a modular approach.

Dynamic Imports

Using dynamic imports, the user only downloads what is necessary. For instance, if a user is visiting the website from a desktop, the file won’t include modules required for tests that affect mobile. If you have a campaign that targets only logged in users to your site, modifications won’t be included in the JavaScript file downloaded by anonymous users.

Every import also uses a caching policy based on its purpose. For instance, consent management or analytics modules can be cached for a long time. While campaign modules (the ones that hold your modifications) have a much shorter lifespan because you want updates you’re making to be reflected as soon as possible. Some modules can also be loaded asynchronously which has no impact on performance. For example, analytics modules used for tracking purposes.

To make it easy to monitor the impact on performance, AB Tasty also includes a tool, named “Performance Center”. The benefit of this is that you get a real time preview of your file size. It also provides on-going recommendations based on your account and campaign setup:

  • to stop campaigns that have been running for too long and that add unnecessarily weight to the file,
  • to update features on running campaigns, that have benefited from performance updates since their introduction (ex: widgets).

How are you loading your experimentation tool?

A common way to load an A/B testing platform is by inserting a script tag directly into your codebase, usually in the head tag of the HTML. This would normally require the help of a developer; therefore, some teams choose the route of using a tag manager as it is accessible by non-technical staff members.

This is certainly against best practice. Tag managers cannot guarantee when a specific tag will fire. Considering the tool will be making changes to your website, it is ideal for it to execute as soon as possible.

Normally it’s placed as high up the head tag of the HTML as possible. Right after any meta tags (as these provide metadata to the entire document), and before external libraries that deal with asynchronous tasks (e.g. tracking vendors such as ad networks). Even if some vendors provide asynchronous snippets to not block rendering, it’s better to load synchronously to avoid flickering issues, also called FOOC (Flash of Original Content).

Best Practice for flickering issues

Other best practice to solve this flickering issue include:

  • Make sure your solution uses vanilla JavaScript to render modifications. Some solutions still rely on the jQuery library for DOM manipulation, adding one additional network request. If you are already using jQuery on your site, make sure that your provider relies on your version rather than downloading a second version.
  • Optimize your code. For a solution to modify an element on your site, it must first select it. You could simplify this targeting process by adding unique ids or classes to the element. This avoids unnecessary processing to spot the right DOM element to update. For instance, rather than having to resolve “body > header > div > ul > li:first-child > a > span”, a quicker way would be to just resolve “span.first-link-in-header”.
  • Optimize the code auto generated by your provider.When playing around with any WYSIWYG editors, you may add several unnecessary JavaScript instructions. Quickly analyse the generated code and optimize it by rearranging it or removing needless parts.
  • Rely as much as possible on stylesheets. Adding a stylesheet class to apply a specific treatment is generally faster than adding the same treatment using a set of JavaScript instructions.
  • Ensure that your solution provides a cache mechanism for the script and relies on as many points of presence as possible (CDN)so the script can be loaded as quickly as possible, wherever your user is located.
  • Be aware of how you insert the script from your vendor. As performance optimization is getting more advanced, it’s easy to mess around with concepts such as async or defer, if you don’t fully understand them and their consequences.

Be wary of imported fonts

Unless you are using a Web Safe font, which many businesses can’t due to their branding, the browser needs to fetch a copy of the font so that it can be applied to the text on the website. This new font may be larger or smaller than the original font, causing a reflow of the elements. Using the CSS font-display property, alongside preloading your primary webfonts, can increase the change of a font meeting the first paint, and help specify how a font is displayed, potentially eliminating a layout shift.

Think carefully about the variation changes

When adding new HTML to the page, consider if you can replace an existing element with an element of similar size, thus minimising the layout shifts. Likewise, if you are inserting a brand-new element, do preliminary testing, to ensure that the shift is not exceeding the CLS threshold.

Technical CLS considerations

Always use size attributes for the width and height of your images, videos and other embedded items, such as advertisements, and iframes. We suggest using CSS aspect ratio properties for images specifically. Unlike older responsive practices, it will determine the size of the image before it is downloaded by the browser. The more common aspect ratios out there in the present day are 4:3 and 16:9. In other words, for every 4 units across, the screen is 3 units deep, and every 16 units across, the screen is 9 units deep, respectively.

screen sizeKnowing one dimension makes it possible to calculate the other. If you have an element with 1000px width, your height would be 750px. This calculation is made as follows:

height = 1000 x (3 / 4)

When rendering elements to the browser, the initial layout often determines the width of a HTML element. With the aspect ratio provided, the corresponding height can be calculated and reserved. Handy tools such as Calculate Aspect Ratio can be used to do the heavy lifting math for you.

Use CSS transform property

The CSS transform property is a CSS trigger which will not trigger any geometry changes or painting. This will allow changing the element’s size without triggering any layout shifts. Animations and transitions, when done correctly with the user’s experience in mind, are a great way to guide the user from one state to another.

Move experiment to the server-side

Experimenting with many changes at once is considered against best practice. The weight of the tags used can affect the speed of the site. It may be worth moving these changes to the server-side, so that they are brought in upon initial page load. We have seen a shift in many sectors, where security in optimal, such as banking, to experiment server-side to avoid the use of tags altogether. This way, once a testing tool completes the changes, layout shift is minimised.

Working hand in hand with developers is the key to running server-side tests such as this. It requires good synchronisation between all stockholders, from marketing to product to engineering teams. Some level of experience is necessary. Moving to server-side experiments just for the sake of performance must be properly evaluated.

Server-side testing shouldn’t be confused with Tag Management server-side implementation. Some websites that implement a client-side experimentation platform through tag managers (which is a bad idea, as described previously), may be under the impressions that they can move their experimentation tag on the server-side as well and get some of tag management server-side benefits, namely reducing the number of networks request to 3rd party vendors. If this is applicable for some tracking vendors (Goggle Analytics, Facebook conversions API…), this won’t work with experiment tags that need to apply updates on DOM elements.

Summary

The above solutions are there to give you an overview of real life scenarios. Prioritise the work to be done in your tech stack. This is the key factor in improving the site experience in general. This could include moving requests to the server, using a front-end or server-side library that better meets your needs. All the way to rethinking your CDN provider and where that are available versus where most of your users are located.

One way to start is by using a free web tool such as Lighthouse and get reports about your website. This would give you the insight to begin testing elements and features that are directly or indirectly causing low scores.

For example, if you have a large banner image that is the cause of your Largest Contentful Paint appearing long after your page begins loading, you could experiment with different background images and test different designs against one another to understand which one loads the most efficiently. Repeat this process for all the CWV metrics, and if you’re feeling brave, dive into other metrics available in the Lighthouse tools.

While much thought has gone into the exact CWV levels to strive for, it does not mean Google will take you out of their search ranking as they will still prioritise overall information and relevant content over page experience. Not all companies will be able to hit these metrics, but it certainly sets standards to aim for.

Written by Nelson Sousa, Chief Technology Officer, Creative CX

Nelson is an expert in the field of experimentation and website development with over 15 years’ experience, specialising in UX driven web design, development, and optimisation.

Subscribe to
our Newsletter

bloc Newsletter EN

We will process and store your personal data to respond to send you communications as described in our  Privacy Policy.

Article

18min read

Bayesian vs. Frequentist: How AB Tasty Chose Our Statistical Model

The debate about the best way to interpret test results is becoming increasingly relevant in the world of conversion rate optimization.

Torn between two inferential statistical methods (Bayesian vs. frequentist), the debate over which is the “best” is fierce. At AB Tasty, we’ve carefully studied both of these approaches and there is only one winner for us.

There are a lot of discussions regarding the optimal statistical method: Bayesian vs. frequentist.
There are a lot of discussions regarding the optimal statistical method: Bayesian vs. frequentist (Source)

 

But first, let’s dive in and explore the logic behind each method and the main differences and advantages that each one offers. In this article, we’ll go over:

[toc]

 

What is hypothesis testing?

The statistical hypothesis testing framework in digital experimentation can be expressed as two opposite hypotheses:

  • H0 states that there is no difference between the treatment and the original, meaning the treatment has no effect on the measured KPI.
  • H1 states that there is a difference between the treatment and the original, meaning that the treatment has an effect on the measured KPI.

 

The goal is to compute indicators that will help you make the decision of whether to keep or discard the treatment (a variation, in the context of AB Tasty) based on the experimental data. We first determine the number of visitors to test, collect the data, and then check whether the variation performed better than the original.

There are two hypothesis in the statistical hypothesis framework.
There are two hypotheses in the statistical hypothesis framework (Source)

 

Essentially, there are two approaches to statistical hypothesis testing:

  1. Frequentist approach: Comparing the data to a model.
  2. Bayesian approach: Comparing two models (that are built from data).

 

From the first moment, AB Tasty chose the Bayesian approach for conducting our current reporting and experimentation efforts.

 

What is the frequentist approach?

In this approach, we will build a model Ma for the original (A) that will give the probability P to see some data Da. It is a function of the data:

Ma(Da) = p

Then we can compute a p-value, Pv, from Ma(Db), which is the probability to see the data measured on variation B if it was produced by the original (A).

Intuitively, if Pv is high, this means that the data measured on B could also have been produced by A (supporting hypothesis H0). On the other hand, if Pv is low, this means that there are very few chances that the data measured on B could have been produced by A (supporting hypothesis H1).

A widely used threshold for Pv is 0.05. This is equivalent to considering that, for the variation to have had an effect, there must be less than a 5% chance that the data measured on B could have been produced by A.

This approach’s main advantage is that you only need to model A. This is interesting because it is the original variation, and the original exists for a longer time than B. So it would make sense to believe you could collect data on A for a long time in order to build an accurate model from this data. Sadly, the KPI we monitor is rarely stationary: Transactions or click rates are highly variable over time, which is why you need to build the model Ma and collect the data on B during the same period to produce a valid comparison. Clearly, this advantage doesn’t apply to a digital experimentation context.

This approach is called frequentist, as it measures how frequently specific data is likely to occur given a known model.

It is important to note that, as we have seen above, this approach does not compare the two processes.

Note: since p-value are not intuitive, they are often changed into probability like this:

p = 1-Pvalue

And wrongly presented as the probability that H1 is true (meaning a difference between A & B exists). In fact, it is the probability that the data collected on B was not produced by process A.

 

What is the Bayesian approach (used at AB Tasty)?

In this approach, we will build two models, Ma and Mb (one for each variation), and compare them. These models, which are built from experimental data, produce random samples corresponding to each process, A and B. We use these models to produce samples of possible rates and compute the difference between these rates in order to estimate the distribution of the difference between the two processes.

Contrary to the first approach, this one does compare two models. It is referred to as the Bayesian approach or method.

Now, we need to build a model for A and B.

Clicks can be represented as binomial distributions, whose parameters are the number of tries and a success rate. In the digital experimentation field, the number of tries is the number of visitors and the success rate is the click or transaction rate. In this case, it is important to note that the rates we are dealing with are only estimates on a limited number of visitors. To model this limited accuracy, we use beta distributions (which are the conjugate prior of binomial distributions).

These distributions model the likelihood of a success rate measured on a limited number of trials.

Let’s take an example:

  • 1,000 visitors on A with 100 success
  • 1,000 visitors on B with 130 success

 

We build the model Ma = beta(1+success_a,1+failures_a) where success_a = 100 & failures_a = visitors_a – success_a =900.

You may have noticed a +1 for success and failure parameters. This comes from what is called a “prior” in Bayesian analysis. A prior is something you know before the experiment; for example, something derived from another (previous) experiment. In digital experimentation, however, it is well documented that click rates are not stationary and may change depending on the time of the day or the season. As a consequence, this is not something we can use in practice; and the corresponding prior setting, +1, is simply a flat (or non-informative) prior, as you have no previous usable experiment data to draw from.

For the three following graphs, the horizontal axis is the click rate while the vertical axis is the likelihood of that rate knowing that we had an experiment with 100 successes in 1,000 trials.

(Source: AB Tasty)

 

What usually occurs here is that 10% is the most likely, 5% or 15% are very unlikely, and 11% is half as likely as 10%.

The model Mb is built the same way with data from experiment B:

Mb= beta(1+100,1+870)

 

(Source: AB Tasty)

 

For B, the most likely rate is 13%, and the width of the curve’s shape is close to the previous curve.

Then we compare A and B rate distributions.

Blue is for A and orange is for B (Source: AB Tasty)

 

We see an overlapping area, 12% conversion rate, where both models have the same likelihood. To estimate the overlapping region, we need to sample from both models to compare them.

We draw samples from distribution A and B:

  • s_a[i] is the i th sample from A
  • s_b[i] is the i th sample from B

 

Then we apply a comparison function to these samples:

  • the relative gain: g[i] =100* (s_b[i] – s_a[i])/s_a[i] for all i.

 

It is the difference between the possible rates for A and B, relative to A (multiplied by 100 for readability in %).

We can now analyze the samples g[i] with a histogram:

The horizontal axis is the relative gain, and the vertical axis is the likelihood of this gain (Source: AB Tasty)

 

We see that the most likely value for the gain is around 30%.

The yellow line shows where the gain is 0, meaning no difference between A and B. Samples that are below this line correspond to cases where A > B, samples on the other side are cases where A < B.

We then define the gain probability as:

GP = (number of samples > 0) / total number of samples

 

With 1,000,000 (10^6) samples for g, we have 982,296 samples that are >0, making B>A ~98% probable.

We call this the “chances to win” or the “gain probability” (the probability that you will win something).

The gain probability is shown here (see the red rectangle) in the report:

(Source: AB Tasty)

 

Using the same sampling method, we can compute classic analysis metrics like the mean, the median, percentiles, etc.

Looking back at the previous chart, the vertical red lines indicate where most of the blue area is, intuitively which gain values are the most likely.

We have chosen to expose a best- and worst-case scenario with a 95% confidence interval. It excludes 2.5% of extreme best and worst cases, leaving out a total of 5% of what we consider rare events. This interval is delimited by the red lines on the graph. We consider that the real gain (as if we had an infinite number of visitors to measure it) lies somewhere in this interval 95% of the time.

In our example, this interval is [1.80%; 29.79%; 66.15%], meaning that it is quite unlikely that the real gain is below 1.8 %, and it is also quite unlikely that the gain is more than 66.15%. And there is an equal chance that the real rate is above or under the median, 29.79%.

The confidence interval is shown here (in the red rectangle) in the report (on another experiment):

(Source: AB Tasty)

 

What are “priors” for the Bayesian approach?

Bayesian frameworks use the term “prior” to refer to the information you have before the experiment. For instance, a common piece of knowledge tells us that e-commerce transaction rate is mostly under 10%.

It would have been very interesting to incorporate this, but these assumptions are hard to make in practice due to the seasonality of data having a huge impact on click rates. In fact, it is the main reason why we do data collection on A and B at the same time. Most of the time, we already have data from A before the experiment, but we know that click rates change over time, so we need to collect click rates at the same time on all variations for a valid comparison.

It follows that we have to use a flat prior, meaning that the only thing we know before the experiment is that rates are in [0%, 100%], and that we have no idea what the gain might be. This is the same assumption as the frequentist approach, even if it is not formulated.

 

Challenges in statistics testing

As with any testing approach, the goal is to eliminate errors. There are two types of errors that you should avoid:

  • False positive (FP): When you pick a winning variation that is not actually the best-performing variation.
  • False negative (FN): When you miss a winner. Either you declare no winner or declare the wrong winner at the end of the experiment.

Performance on both these measures depends on the threshold used (p-value or gain probability), which depends, in turn, on the context of the experiment. It’s up to the user to decide.

Another important parameter is the number of visitors used in the experiment, since this has a strong impact on the false negative errors.

From a business perspective, the false negative is an opportunity missed. Mitigating false negative errors is all about the size of the population allocated to the test: basically, throwing more visitors at the problem.

The main problem then is false positives, which mainly occur in two situations:

  • Very early in the experiment: Before reaching the targeted sample size, when the gain probability goes higher than 95%. Some users can be too impatient and draw conclusions too quickly without enough data; the same occurs with false positives.
  • Late in the experiment: When the targeted sample size is reached, but no significant winner is found. Some users believe in their hypothesis too much and want to give it another chance.

 

Both of these problems can be eliminated by strictly respecting the testing protocol: Setting a test period with a sample size calculator and sticking with it.

At AB Tasty, we provide a visual checkmark called “readiness” that tells you whether you respect the protocol (a period that lasts a minimum of 2 weeks and has at least 5,000 visitors). Any decision outside these guidelines should respect the rules outlined in the next section to limit the risk of false positive results.

This screenshot shows how the user is informed as to whether they can take action.

(Source: AB Tasty)

 

Looking at the report during the data collection period (without the “reliability” checkmark) should be limited to checking that the collection is correct and to check for extreme cases that require emergency action, but not for a business decision.

 

When should you finalize your experiment?

Early stopping

“Early stopping” is when a user wants to stop a test before reaching the allocated number of visitors.

A user should wait for the campaign to reach at least 1,000 visitors and only stop if a very big loss is observed.

If a user wants to stop early for a supposed winner, they should wait at least two weeks, and only use full weeks of data. This tactic is interesting if and when the business cost of a false positive is okay, since it is more likely that the performance of the supposed winner would be close to the original, rather than a loss.

Again, if this risk is acceptable from a business strategy perspective, then this tactic makes sense.

If a user sees a winner (with a high gain probability) at the beginning of a test, they should ensure a margin for the worst-case scenario. A lower bound on the gain that is near or below 0% has the potential to evolve and end up below or far below zero by the end of a test, undermining the perceived high gain probability at its beginning. Avoiding stopping early with a low left confidence bound will help rule out false positives at the beginning of a test.

For instance, a situation with a gain probability of 95% and a confidence interval like [-5.16%; 36.48%; 98.02%] is a characteristic of early stopping. The gain probability is above the accepted standard, so one might be willing to push 100% of the traffic to the winning variation. However, the worst-case scenario (-5.16%) is relatively far below 0%. This indicates a possible false positive — and, at any rate, is a risky bet with a worst scenario that loses 5% of conversions. It is better to wait until the lower bound of the confidence interval is at least >0%, and a little margin on top would be even safer.

 

Late stopping

“Late stopping” is when, at the end of a test, without finding a significant winner, a user decides to let the test run longer than planned. Their hypothesis is that the gain is smaller than expected and needs more visitors to reach significance.

When deciding whether to extend the life of a test, not following the protocol, one should consider the confidence interval more than the gain probability.

If the user wants to test longer than planned, we advise to only extend very promising tests. This means having a high best-scenario value (the right bound of the gain confidence interval should be high).

For instance, this scenario: gain probability at 99% and confidence interval at [0.42 %; 3.91%] is typical of a test that shouldn’t be extended past its planned duration: A great gain probability, but not a high best-case scenario (only 3.91%).

Consider that with more samples, the confidence interval will shrink. This means that if there is indeed a winner at the end, its best-case scenario will probably be smaller than 3.91%. So is it really worth it? Our advice is to go back to the sample size calculator and see how many visitors will be needed to achieve such accuracy.

Note: These numerical examples come from a simulation of A/A tests, selecting the failed ones.

 

Confidence intervals are the solution

Using the confidence interval instead of only looking at the gain probability will strongly help improve decision-making. Not to mention that even outside of the problem of false positives, it’s important for the business. All variations need to meet the cost of its implementation in production. One should keep in mind that the original is already there and has no additional cost, so there is always an implicit and practical bias toward the original.

Any optimization strategy should have a minimal threshold on the size of the gain.

Another type of problem may arise when testing more than two variations, known as the multiple comparison problem. In this case, a Holm-Bonferroni correction is applied.

 

Why AB Tasty chose the Bayesian approach

Wrapping up, which is better: the Bayesian vs. frequentist method?

As we’ve seen in the article, both are perfectly sound statistical methods. AB Tasty chose the Bayesian statistical model for the following reasons:

  • Using a probability index that corresponds better to what the users think, and not a p-value or a disguised one;
  • Providing confidence intervals for more informed business decisions (not all winners are really interesting to push in production.). It’s also a means to mitigate false positive errors.

 

At the end of the day, it makes sense that the frequentist method was originally adopted by so many companies when it first came into play. After all, it’s an off-the-shelf solution that’s easy to code and can be easily found in any statistics library (this is a particularly relevant benefit, seeing as how most developers aren’t statisticians).

Nonetheless, even though it was a great resource when it was introduced into the experimentation field, there are better options now — namely, the Bayesian method. It all boils down to what each option offers you: While the frequentist method shows whether there’s a difference between A and B, the Bayesian one actually takes this a step further by calculating what the difference is.

To sum up, when you’re conducting an experiment, you already have the values for A and B. Now, you’re looking to find what you will gain if you change from A to B, something which is best answered by a Bayesian test.