Article

11min read

CRO Metrics: Navigating Pitfalls and Counterintuitive KPIs

Metrics play an essential role in measuring performance and influencing decision-making.

However, relying on certain metrics alone can lead you to misguided conclusions and poor strategic choices. Potentially misleading metrics are often referred to as “pitfall metrics” in the world of Conversion Rate Optimization.

Pitfall metrics are data indicators that can give you a distorted version of reality or an incomplete view of your performance if analyzed in isolation. Pitfall metrics can even cause you to backtrack in your performance if you’re not careful about how you evaluate these metrics.

Metrics are typically split into two categories:

  • Session metrics: Any metrics that are measured on a session instead of a visitor basis
  • Count metrics: Metrics that count events (for instance number of pages viewed)

Some metrics can mesh into both categories. Needless to say, that’s the worst option for a few main reasons: no real statistical model is used when meshing into both categories. There is no direct/simple link to business objectives and these metrics may not need standard optimization.

While metrics are very valuable for business decisions, it’s crucial to use them wisely and be mindful of potential pitfalls in your data collection and analysis. In this article, we will explore and explain why some metrics are very not wise to use in practice in CRO.

Session-based metrics vs visitors

One problem with session-based metrics is that “power users” (AKA users returning for multiple sessions during the experimentation) will lead to a bias with the results.

Let’s remember that during experimentation, the traffic split between the variations is a random process.

Typically you think of a traffic split as very random but very even groups. When we talk about big groups of users – this is typically true. However, when you consider a small group, it’s very unlikely that you will have an even split in terms of visitor behaviors, intentions and types.

Let’s say that you have 12 power users that need to be randomly divided between two variations. Let’s say that these power users have 10x more sessions than the average user. It’s quite likely that you will end up with a 4 and 8 split, a 2 and 10 split, or another uneven split. Having an even split randomly occur is very unlikely. You will then end up in one of two very likely situations:

  • Situation 1: Very few users may make you believe you have a winning variation (which doesn’t yet exist)
  • Situation 2: The winning variation is masked because it  received too few of these power users

Another problem with session-based metrics is that a session-based approach blurs the meaning of important metrics like transaction rates. The recurring problem here is that not all visitors display the same type of behavior. If average buyers need 3 sessions to make a purchase while some need 10, this is a difference in user behavior and does not have anything to do with your variation. If your slow buyers are not evenly split between the variations, then you will see a discrepancy in the transaction rate that doesn’t actually exist.

Moreover, the metric itself will lose part of its intuitive meaning over time. If your real conversion rate is around 3%, but counted by session and not by unique visitors, you will only likely only see a 1% conversion rate when switching to unique visitors.

This is not only disappointing but very confusing.

Imagine a variation urging visitors to buy sooner by using “stress marketing” techniques. Let’s say this leads to a one session purchase instead of three sessions. You will see a huge gain (3x) on the conversion per session. BUT this “gain” is not an actual gain since the number of conversions will have no effect on the revenue earned. It’s also good to keep in mind that visitors under pressure may not feel very happy or comfortable with such a quick purchase and may not return.

It’s best practice to avoid using session-based metrics unless you don’t have another choice as they can be very misleading.

Understanding count metrics

We will come back to our comparison of these two types of metrics. But for now, let’s get on the same page about “count metrics.” To understand why count metrics are harder to assess, you need to have more context on how to measure accuracy and where exactly the measure comes from.

To model rate accuracy measures, we use beta distribution. In the graph below, we see the measure of two conversion ratios – one blue and one orange. The X-axis is the rate and Y-axis is the likelihood. When trying to measure the probability that the two rates are different, we implicitly explore the part of the two curves that are overlapping.

In this case, the two curves have very little overlap. Therefore, the probability that these two rates are actually different is quite high.

The more narrow or compact the distribution is, the easier it is to see that they’re different.

Want to start optimizing your website with a platform you can trust? AB Tasty is the best-in-class experience optimization platform that empowers you to create a richer digital experience – fast. From experimentation to personalization, this solution can help you activate and engage your audience to boost your conversions.

The fundamental difference between conversion and count distributions

Conversion metrics are bounded into [0:1] as a rate or [0%:100%] as a percentage. But, for count metrics the range is open, and the counts are in [0,+infinity].

The following figure shows a gamma distribution (in orange) that may be used with this kind of data, along with a beta distribution (in blue).

These two distributions are based on the same data: 10 visitors and 5 successes. This is a 0.5 success rate (or 50%) when considering unique conversions. In the context of multiple conversions, it’s a process with an average of 0.5 rate conversion per visitor.

Notice that the orange curve (for the count metric) is non-0 above x = 1, this clearly shows that it expects that sometimes there will be more than 1 conversion per visitor.

We will see that comparisons between this kind of metric depend on whether we consider it as a count metric or as a rate. There are two options:

  • Either we consider that the process is a conversion process, using a beta distribution (in blue), which is naturally bounded in [0;1].
  • Or we consider that the process is a count process, using gamma distribution (in orange), which is not bounded on the right side.

On the graph, we see an inner property of count data distributions, they are dissymmetric: the right part goes slower to 0 than the left part. This makes it naturally more spread out than the beta distribution.

Since both curves are distributions, their surface under the curve must be 1.

As you can see, the beta distribution (in blue) has a higher peak than the gamma distribution (in orange). This exposes that the gamma distribution is more spread out than the beta distribution. This is a hint that count distributions are harder to get accurate than conversion distributions. This is also why we need more visitors to assess a difference when using count metrics rather than when using conversion metrics.

To understand this problem you have to imagine two gamma distribution curves, one for each variation of an experiment. Then, gradually shift one on the right, showing an increasing difference between the two distributions. (see figure below)

Since both curves are right-skewed, the overlap region will occur on at least one of the skewed parts of the distributions.

This means that differences will be harder to assess with count data than with conversion data. This comes from the fact that count data works on an open range, whereas conversion rates work on a closed range.

Do count metrics need more visitors to get accurate results?

No, it is more complex than that in the CRO context. Typical statistical tests for count metrics are not suited for CRO in practice.

Most of these tests come from the industrial world. A classic usage of count metrics is counting the number of failures of a machine in a given timeframe. In this context, the risk of failure doesn’t depend on previous events. If a machine already had one failure and has been repaired, the chance of a second failure is considered to be the same.

This hypothesis is not suited for the number of pages viewed by a visitor. In reality, if a visitor saw two pages, there’s a higher chance that they will see a third page compared to a visitor that just saw one page (since they have a high probability to “bounce”).

The industrial model does not fit in the CRO context since it deals with human behavior, making it much more complex.

Not all conversions have the same value

The next CRO struggle also comes from the direct exploitation of formulas from the industrial world.

If you run a plant that produces goods with machines, and you test a new kind of machine that produces more goods per day on average, you will conclude that these new machines are a good investment. Because the value of a machine is linear with its average production, each extra product adds the same value to the business.

But this is not the same in CRO.

Imagine this experiment result for a media company:

Variation B is yielding an extra 1,000 page views more than the original A. Based on that data, you put variation B in production. Let’s say that variation B lost 500 people that saw 2 pages and variation B also won 20 people that saw 100 pages each. That makes a net benefit of 1000 page views for variation B.

But what about the value? These 20 people, even if they spent a lot of time on the media, are maybe not the same value as 500 people that come regularly.

In CRO each extra value added to a count metric does not have the same value, so you cannot trust measured increment as a direct added value.

In applied statistics, one adds an extra layer to the analysis: a utility function, which links extra counts to value. This utility function is very specific to the problem and is unknown to most CRO problems. Even if you get some more conversions in a count metric context, you are unsure about the real value of this gain (if any).

Some count metrics are not meant to be optimized

Let’s see some examples where raising the number of a count metric might not be a good thing:

  • Page views: If the count of page views rises, you can think it’s a good thing because people are seeing more of your products. However, you can also think that people get lost and need to browse more pages to find what they need.
  • Items added to cart: We have the same idea for the number of products added to the cart. If you do not check how many products remain in the cart at the checkout stage, you don’t know if the variation helps to sell more or if it just makes the product selection harder.
  • Product purchased: Even the number of products purchased may be misleading as a business objective alone if used alone in an optimization context. Visitors could be buying two cheaper products instead of one high-quality (and more expensive) product.

You can’t tell just by looking at these KPIs if your variation or change is good for your business or not. There is more that needs to be considered when looking at these numbers.

How do we use this count data then?

We see in this article how counterintuitive optimization based on sessions is. And even worse, we see how misleading count metrics are in CRO.

Unless you have both business and statistics expert resources, it’s best practice to avoid them, at least as a unique KPI.

As a workaround, you can use several conversion metrics with specific triggers using business knowledge to set the thresholds. For instance:

  • Use one conversion metric for count is in the range [1; 5] called “light users.”
  • Use another conversion metric in the range [6,10] called “medium users.”
  • Use another one for the range [11,+infinity] called “heavy users”.

Splitting up the conversion metrics in this way will give you a clearer signal about where you gain or lose conversions.

Another piece of advice is to use several KPIs to have a broader view.

For instance, although analyzing the product views alone is not a good idea – you can check the overall conversion rate and average order value at the same time. If product views and conversion KPIs are going up and the average order value is stable or goes up, then you can conclude that your new product page layout is a success.

Counterintuitive Metrics in CRO

Now you see that except for conversions counted on a unique visitor basis, nearly all other metrics can be very counterintuitive to use in CRO. Mistakes can happen because of statistics that work differently, and also because the meaning of these metrics and their evolutions may have several interpretations.

It’s important to understand that CRO skill is a mix of statistics, business and UX knowledge. Since it’s very rare to have all this within one person, the key is to have the needed skills spread across a team with good communication.

Subscribe to
our Newsletter

bloc Newsletter EN

We will process and store your personal data to respond to send you communications as described in our  Privacy Policy.

Article

13min read

How to Deal with Low Traffic in CRO

If your website traffic numbers aren’t as high as you may hope for, that’s no reason to give up on your conversion rate optimization (CRO) goals.

By now you must have noticed that most CRO advice is tailored for high-traffic websites. Luckily, this doesn’t mean you can’t optimize your website even if you have lower traffic.

The truth is, any website can be optimized – you just need to tailor your optimization strategy to suit your unique situation.

In this article, we will cover:

CRO analogy

In order to make this article easier to understand, let’s start with an analogy. Imagine that instead of measuring two variants and picking a winner, we are measuring the performance of two boxers and placing bets on who will win the next 10 rounds.

So, how will we place our bet on who will win?

Imagine that boxer A and boxer B are both newbies that no one knows. After the first round, you have to make your choice. In the end, you will most likely place your bet on the boxer who won the first round. It might be risky if the winning margin is small, but in the end, you have no other way to base your decision.

Imagine now that boxer A is known to be a champion, and boxer B is a challenger that you don’t know. Your knowledge about boxer A is what we would call a prior – information you have before that influences your decision.

Based on the prior, you will be more likely to bet on boxer A as the champion for the next few rounds, even if boxer B wins the first round with a very small margin.

Furthermore, you will only choose boxer B as your predicted champion if they win the first round by a large margin. The stronger your prior, the larger the margin needs to be in order to convince you to change your bet.

Are you following? If so, the following paragraphs will be easy to grasp and you will understand where this “95% threshold” comes from.

Now, let’s move on to tips for optimizing your website with low traffic.

1. Solving the problem: “I never reach the 95% significance”

This is the most common complaint about CRO for websites with lower traffic and for lower traffic pages on bigger websites.

Before we dig into this most common problem, let’s start by answering the question, where does this 95% “golden rule” come from?

The origin of the 95% threshold

Let’s start our explanation with a very simple idea: What if optimization strategies were applied from day one? If two variants with no previous history were created at the same time, there would be no “original” version challenged by a newcomer.

This would force you to choose the best one from the beginning.

In this setting, any small difference in performance could be measured for decision-making. After a short test, you will choose the variant with the higher performance. It would not be good practice to pick the variant that had lower performance and furthermore, it would be foolish to wait for a 95% threshold to pick a winner.

But in practice, optimization is done well after the launch of a business.

So, in most real-life situations, there is a version A that already exists and a new challenger (version B) that is created.

If the new challenger, version B, comes along and the performance difference between the two variants is not significant, you will have no issues declaring version B “not a winner.”

Statistical tests are symmetric. So if we reverse the roles, swapping A and B in the statistical test will tell you that the original is not significantly better than the challenger. The “inconclusiveness” of the test is symmetric.

So, why do you set 100% of traffic toward the original at the end of an inconclusive test, implicitly declaring A as a winner? Because you have three priors:

  1. Version A was the first choice. This choice was made by the initial creator of the page.
  2. Version A has already been implemented and technically trusted. Version B is typically a mockup.
  3. Version A has a lot of data to prove its value, whereas B is a challenger with limited data that is only collected during the test period.

Points 1 & 2 are the bases of a CRO strategy, so you will need to go beyond these two priors. Point 3 explains that version A has more data to back its performance. This explains why you trust version A more than version BVersion A has data.

Now you understand that this 95% confidence rule is a way of explaining a strong prior. And this prior mostly comes from historical data.

Therefore, when optimizing a page with low traffic, your decision threshold should be below 95% because your prior on A is weaker due to its traffic and seniority.

The threshold should be set according to the volume of traffic that went through the original from day one. However, the problem with this approach is that we know that the conversion rates are not stable and can change over time. Think of seasonality – i.e. black Friday rush, vacation days, Christmas time increases in activity, etc. Because of the seasonal changes, you can’t compare performances in different periods.

This is why practitioners only take into account data for version A and version B taken at the same period of time and set a high threshold (95%) to accept the challenger as a winner in order to formalize a strong prior toward version A.

What is the appropriate threshold for low traffic?

It’s hard to suggest an exact number to focus on because it depends on your risk acceptance.

According to the hypothesis protocol, you should structure a time frame for the data collection period in advance.

This means that the “stop” criteria of a test are not a statistical measure or based on a certain number. The “stop” criteria should be a timeframe coming to an end. Once the period is over, then you should look at the stats to make an appropriate decision.

AB Tasty, our customer experience optimization and feature management software, uses the Bayesian framework which produces a “chances to win” index which encourages a direct interpretation instead of a p-value, which has a very complex meaning.

In other words, the “chances to win index” is the probability for a given variation to be better than the original.

Therefore, a 95% “chance to win” means that there is a 95% probability that the given variation will be the winner. This is assuming that we don’t have any prior knowledge or specific trust for the original.

The 95% threshold itself is also a default compromise between the prior you have on the original and a given level of risk acceptance (it could have even been a 98% threshold).

Although it is hard to give an exact number, let’s make a rough scale for your threshold:

  • New A & B variations: If you have a case where variation A and variation B are both new, the threshold could be as low as 50%. If there is no past data on the variations’ performance and you must make a choice for implementation, even a 51% chance to win is better than 49%.
  • New website, low traffic: If your website is new and has very low traffic, you likely have very little prior on variation A (the original variation in this case). In that case, setting 85% as a threshold is reasonable. Since it means that if you put aside the little you know about the original you still have 85% to pick up the winner and only 15% to pick a variation that is equivalent to the original, and a lesser chance that it performs worse. So depending on the context, such a bet can make sense.
  • Mature business, low traffic: If your business has a longer history, but still lower traffic, 90% is a reasonable threshold. This is because there is still little prior on the original.
  • Mature business, high traffic: Having a lot of prior, or data, on variation A suggests a 95% threshold.

The original 95% threshold is far too high if your business has low traffic because there’s little chance that you will reach it. Consequently, your CRO strategy will have no effect and data-driven decision-making becomes impossible.

By using AB Tasty as your experimentation platform, you will be given a report that includes the “chance to win” along with other statistical information regarding your web experiments. A report from AB Tasty would also include the confidence interval on the estimated gain as an important indicator. The boundaries around the estimated gain are also computed in a Bayesian way, which means it can be interpreted as the best and the worst scenario.

The importance of Bayesian statistics

Now you understand the exact meaning of the well-known 95% “significance” level and are able to select appropriate thresholds corresponding to your particular case.

It’s important to remember that this approach only works with Bayesian statistics since frequentist approaches give statistical indices (such as p-Values and confidence intervals that have a totally different meaning and are not suited to the explained logic).

2. Are the stats valid with small numbers?

Yes, they are valid as long as you do not stop the test depending on the result.

Remember the testing protocol says once you decide on a testing period, the only reason to stop a test is when the timeframe has ended. In this case, the stat indices (“chances to win” & confidence interval) are true and usable.

You may be thinking: “Okay, but then I rarely reach the 95% significance level…”

Remember that the 95% threshold doesn’t need to be the magic number for all cases. If you have low traffic, chances are that your website is not old. If you refer back to the previous point, you can take a look at our suggested scale for different scenarios.

If you’re dealing with lower traffic as a newer business, you can certainly switch to a lower threshold (like 90%). The threshold is still higher because it’s typical to have more trust in an original rather than a variant because it’s used for a longer time.

If you’re dealing with two completely new variants, at the end of your testing period, it will be easier to pick the variant with the higher conversions (without using a stat rest) since there is no prior knowledge of the performance of A or B.

3. Go “upstream”

Sometimes the traffic problem is not due to a low-traffic website, but rather the webpage in question. Typically, pages with lower traffic are at the end of the funnel.

In this case, a great strategy is to work on optimizing the funnel closer to the user’s point of entry. There may be more to uncover with optimization in the digital customer journey before reaching the bottom of the funnel.

4. Is the CUPED technique real?

What is CUPED?

Controlled Experiment Using Pre-Experiment Data is a newer buzzword in the experimentation world. CUPED is a technique that claims to produce up to 50% faster results. Clearly, this is very appealing to small-traffic websites.

Does CUPED really work that well?

Not exactly, for two reasons: one is organizational and the other is applicability.

The organizational constraint

What’s often forgotten is that CUPED means Controlled experiment Using Pre-Experiment Data.

In practice, the ideal period of “pre-experiment data” is two weeks in order to hope for a 50% time reduction.

So, for a 2-week classic test, CUPED claims that you can end the test in only 1 week.

However, in order to properly see your results, you will need two weeks of pre-experiment data. So in fact, you must have three weeks to implement CUPED in order to have the same accuracy as a classic 2-week test.

Yes, you are reading correctly. In the end, you will need three weeks time to run the experiment.

This means that it is only useful if you already have two weeks of traffic data that is unexposed to any experiment. Even if you can schedule two weeks of no experimentations into your experimentation planning to collect data, this will be blocking traffic for other experiments.

The applicability constraint

In addition to the organizational/2-week time constraint, there are two other prerequisites in order for CUPED to be effective:

  1. CUPED is only applicable to visitors browsing the site during both the pre-experiment and experiment periods.
  2. These visitors need to have the same behavior regarding the KPI under optimization. Visitors’ data must be correlated between the two periods.

You will see in the following paragraph that these two constraints make CUPED virtually impossible for e-commerce websites and only applicable to platforms.

Let’s go back to our experiment settings example:

  • Two weeks of pre-experiment data
  • Two weeks of experiment data (that we hope will only last one week as there is a supposed 50% time reduction)
  • The optimization goal is a transaction: raising the number of conversions.

Constraint number 1 states that we need to have the same visitors in pre-experiment & experiment, but the visitor’s journey in e-commerce is usually one week.

In other words, there is very little chance that you see visitors in both periods. In this context, only a very limited effect of CUPED is to be expected (up to the portion of visitors that are seen in both periods).

Constraint number 2 states that the visitors must have the same behavior regarding the conversion (the KPI under optimization). Frankly, that constraint is simply never met in e-commerce.

The e-commerce conversion occurs either during the pre-experiment or during the experiment but not in both (unless your customer frequently purchases several times during the experiment time).

This means that there is no chance that the visitors’ conversions are correlated between the periods.

In summary: CUPED is simply not applicable for e-commerce websites to optimize transactions.

It is clearly stated in the original scientific paper, but for the sake of popularity, this buzzword technique is being misrepresented in the testing industry.

In fact, and it is clearly stated in scientific literature, CUPED works only on multiple conversions for platforms that have recurring visitors performing the same actions.

Great platforms for CUPED would be search engines (like Bing, where it has been invented) or streaming platforms where users come daily and do the same repeated actions (playing a video, clicking on a link in a search result page, etc).

Even if you try to find an application of CUPED for e-commerce, you’ll find out that it’s not possible.

  • One may say that you could try to optimize the number of products seen, but the problem of constraint 1 still applies: a very little number of visitors will be present on both datasets. And there is a more fundamental objection – this KPI should not be optimized on its own, otherwise you are potentially encouraging hesitation between products.
  • You cannot even try to optimize the number of products ordered by visitors with CUPED because constraint number 2 still holds. The act of purchase can be considered as instantaneous. Therefore, it can only happen in one period or the other – not both. If there is no visitor behavior correlation to expect then there is also no CUPED effect to expect.

Conclusion about CUPED

CUPED does not work for e-commerce websites where a transaction is the main optimization goal. Unless you are Bing, Google, or Netflix — CUPED won’t be your secret ingredient to help you to optimize your business.

This technique is surely a buzzword spiking interest fast, however, it’s important to see the full picture before wanting to add CUPED into your roadmap. E-commerce brands will want to take into account that this testing technique is not suited for their business.

Optimization for low-traffic websites

Brands with lower traffic are still prime candidates for website optimization, even though they might need to adapt to a less-than-traditional different approach.

Whether optimizing your web pages means choosing a page that’s higher up in the funnel or adopting a slightly lower threshold, continuous optimization is crucial.

Want to start optimizing your website? AB Tasty is the best-in-class experience optimization platform that empowers you to create a richer digital experience – fast. From experimentation to personalization, this solution can help you activate and engage your audience to boost your conversions.