Article

11min read

Frequentist vs Bayesian Methods in A/B Testing

Table of content

When you’re running A/B tests, you’re making a choice—whether you know it or not.

Two statistical methods power how we interpret test results: Frequentist vs Bayesian A/B testing. The debates are fierce. The stakes are real. And at AB Tasty, we’ve picked our side.

If you’re shopping for an A/B testing platform, new to experimentation, or just trying to make sense of your results, understanding these methods matters. It’s the difference between guessing and knowing. Between implementing winners and chasing false positives.

Let’s break it down.

AB testing Bayesian vs frequentist methods

What is Inferential Statistics?

Both Frequentist and Bayesian methods live under the umbrella of inferential statistics.

Unlike descriptive statistics—which simply describes what already happened—inferential statistics help you forecast what’s coming. They let you extrapolate results from a sample to a larger population.

Here’s the question we’re answering: Would version A or version B perform better when rolled out to your entire audience?

A Quick Example

Let’s say you’re studying Olympic swimmers. With descriptive statistics, you could calculate:

  • Average height of the team
  • Height variance across athletes
  • Distribution above or below average

That’s useful, but limited.

Inferential statistics let you go further. Want to know the average height of all men on the planet? You can’t measure everyone. But you can infer that average from smaller, representative samples.

That’s where Frequentist vs Bayesian methods come in. Both help you make predictions from incomplete data—but they do it differently, especially when applied to A/B testing.

What is the Frequentist Statistics Method in A/B Testing?

The Frequentist approach is the classic. You’ve probably seen it in college stats classes or in most A/B testing tools.

This is one of the main Frequentist vs Bayesian A/B testing comparisons: Frequentist statistics focus on long-run frequencies and fixed hypotheses.

Here’s how it works:

The Hypothesis

You start by assuming there is no difference between version A and version B. This is called the null hypothesis.

At the end of your test, you get a P-Value (probability value). The P-Value tells you the probability of seeing your results—or more extreme results—if there really is no difference between your variations. In other words, how likely is it that your results happened by chance?

The smaller the P-Value, the more confident you can be that there’s a real difference between your A/B testing variations.

What is the Bayesian Statistics in A/B Testing?

The Bayesian approach takes a different route—and we think it’s a smarter one for many A/B testing scenarios.

Baye's Theorem formula

The Bayesian approach allows for the inclusion of prior information (‘a prior’) intNamed after British mathematician Thomas Bayes, this method lets you incorporate prior information into your analysis. It’s built around three overlapping concepts:

The Three Pillars of Bayesian Analysis

  • Prior: Information from previous experiments. At the start, we use a “non-informative” prior—essentially a blank slate.
  • Evidence: The data from your current experiment.
  • Posterior: Updated information combining the prior and evidence. This is your result.

Here’s the game-changer: Bayesian A/B testing is designed for ongoing experiments.  Every time you check your data, the previous results become the “prior,” and new incoming data becomes the “evidence.”

That means data peeking is built into the design. Each time you look, the analysis is valid.

Even better? Bayesian statistics let you estimate the actual gain of a winning variation—not just that it won—making Frequentist vs Bayesian methods in A/B testing very different from a decision-making perspective.

Bayesian ProsBayesian Cons
Peek freely: Check your data during a test without compromising accuracy. Stop losing variations early or switch to winners faster.More computational power: Requires a sampling loop, which demands more CPU load at scale (though this doesn’t affect users).
See the gain: Know the actual improvement range, not just which version won.

Fewer false positives: The method naturally rules out many misleading results in A/B testing.

Frequentist vs Bayesian A/B Testing: The Comparison

Let’s be clear: both methods are statistically valid. But when you compare Frequentist vs Bayesian A/B testing, the practical implications are very different.

At AB Tasty, we have a clear preference for the Bayesian a/b testing approach. 

Here’s why.

Gain Size Matters

With Bayesian A/B testing, you don’t just know which version won—you know by how much.

This is critical in business. When you run an A/B test, you’re deciding whether to switch from version A to version B.

That decision involves:

  • Implementation costs (time, resources, budget)
  • Associated costs (vendor licenses, maintenance)

Example: You’re testing a chatbot on your pricing page. Version B (with chatbot) outperforms version A. But implementing version B requires two weeks of developer time plus a monthly chatbot license.

You need to know if the math adds up. Bayesian statistics give you that answer by quantifying the gain from your A/B testing experiment.

Real Example from AB Tasty Reporting

Let’s look at a test measuring three variations against an original, with “CTA clicks” as the KPI.

AB testing dashboard showing an example of transaction rates and growth metrics across 4 variations with performance trend graph.

Variation 3 wins with a 34.1% conversion rate (vs. 25% for the original).

But here’s where it gets interesting:

  • Median gain: +36.4%
  • Lowest possible gain: +2.25%
  • Highest possible gain: +48.40%

In 95% of cases, your gain will fall between +2.25% and +48.40%.

This granularity helps you decide whether to roll out the winner:

  • Both ends positive? Great sign.
  • Narrow interval? High confidence. Go for it.
  • Wide interval but low implementation cost? Probably safe to proceed.
  • Wide interval with high implementation cost? Wait for more data.

This is a concrete illustration of how Frequentist vs Bayesian methods in A/B testing lead to different levels of decision-making insight.

When to Trust Your Results?

At AB Tasty, we recommend waiting until you’ve hit these benchmarks:

  • At least 5,000 unique visitors per variation
  • Test runs for at least 14 days (two business cycles)
  • 300 conversions on your main goal

These thresholds apply regardless of whether you use a Frequentist or Bayesian method, but Bayesian A/B testing gives you more interpretable outputs once you reach them.

Data Peeking: A Bayesian Advantage

Here’s a scenario: You’re running an A/B test for a major e-commerce promotion. Version B is tanking—losing you serious money.

With Bayesian A/B testing, you can stop it immediately. No need to wait until the end.

Conversely, if version B is crushing it, you can switch all traffic to the winner earlier than with Frequentist methods.

This is the logic behind our Dynamic Traffic Allocation feature—and it wouldn’t be possible without Bayesian statistics.

How Does Dynamic Traffic Allocation Work?

Dynamic Traffic Allocation balances exploration (gathering data) with exploitation (maximizing conversions).

AB Tasty traffic allocation interface with slider controls and pie chart showing test split between original and variations.

In practice, you simply:

  • Check the Dynamic Traffic Allocation box.
  • Pick your primary KPI.
  • Let the algorithm decide when to send more traffic to the winner.

This approach shines when:

  • Testing micro-conversions over short periods
  • Running time-limited campaigns (holiday sales, flash promotions)
  • Working with low-traffic pages
  • Testing 6+ variations simultaneously

Again, this is where Frequentist vs Bayesian methods in A/B testing diverge: Frequentist statistics are not naturally designed for safe continuous monitoring and dynamic allocation in the same way.

Bayesian False Positives Explained

A false positive occurs when test results suggest version B improves performance—but in reality, it doesn’t. Often, version B performs the same as version A, not worse.

False positives happen with both Frequentist and Bayesian methods in A/B testing. But here’s the difference:

How Does Bayesian Testing Limit False Positives?

Because Bayesian A/B testing provides a gain interval, you’re less likely to implement a false positive in the first place.

Example: Your test shows version B wins with 95% confidence, but the median improvement is only 1%. Even if this is a false positive, you probably won’t implement it—the resources needed don’t justify such a small gain.

With Frequentist methods, you don’t see the gain interval. You might implement that false positive, wasting time and energy on changes that bring zero return.

Gain probability using Bayesian statistics

The standard rule of thumb is 95% confidence—you’re 95% sure version B performs as indicated, with a 5% risk it doesn’t.

For most campaigns, 95% confidence works just fine. But when the stakes are high—think major product launches or business-critical tests—you can dial up your confidence threshold to 97%, 98%, or even 99%.

Just know this: whether you’re using Frequentist or Bayesian methods, higher confidence means you’ll need more time and traffic to reach statistical significance. It’s a trade-off worth making when precision matters most.

While this seems like a safe bet – and it is the right choice for high-stakes campaigns – it’s not something to apply across the board.

This is because:

  • In order to attain this higher threshold, you’ll have to wait longer for results, therefore leaving you less time to reap the rewards of a positive outcome.
  • You will implicitly only get a winner with a bigger gain (which is rarer), and you will let go of smaller improvements that still could be impactful.
  • If you have a smaller amount of traffic on your web page, you may want to consider a different approach.

Conclusion

So which is better—Frequentist or Bayesian?

Both are sound statistical methods. But when you look at Frequentist vs Bayesian methods in A/B testing, we’ve chosen the Bayesian approach because it helps teams make better business decisions.

Here’s what you get:

  • Flexibility: Peek at data without compromising accuracy.
  • Actionable insights: Know the gain size, not just the winner.
  • Maximized returns: Dynamic Traffic Allocation optimizes automatically.
  • Fewer false positives: Built-in safeguards against misleading results.

When you’re shopping for an A/B testing platform, find one that gives you results you can trust—and act on.

Want to see Bayesian A/B testing in action? AB Tasty makes it easy to set up tests, gather insights via an ROI dashboard, and determine which changes will increase your revenue. 

Ready to go further? Let’s build better experiences together →

FAQs

What’s the main difference between Bayesian and Frequentist A/B testing?

When you compare Frequentist vs Bayesian methods in A/B testing, Frequentist methods test whether there’s a difference between variations using a P-Value at the end of the experiment. Bayesian methods estimate the size of the gain and let you update results continuously as new data comes in.

Can I peek at my A/B test results early?

With Bayesian A/B testing, yes. The method is designed for ongoing analysis. With Frequentist methods, peeking early creates misleading results because it effectively turns one experiment into multiple experiments.

What is a false positive in A/B testing?

A false positive occurs when test results suggest version B improves performance, but in reality it doesn’t. Bayesian methods help limit false positives by showing the gain interval, making it less likely you’ll implement a variation with minimal or no real improvement.

What confidence level should I use for my A/B tests?

95% confidence is standard for most marketing campaigns. For high-stakes A/B testing, you can increase to 97%, 98%, or 99%—but this requires more time and traffic to reach statistical significance, regardless of whether you use Frequentist or Bayesian methods.

How long should I run my A/B test?

At AB Tasty, we recommend running tests for at least 14 days (two business cycles) and collecting at least 5,000 unique visitors per variation and 300 conversions on your main goal. These benchmarks help both Frequentist and Bayesian approaches produce reliable insights.

What is Dynamic Traffic Allocation?

Dynamic Traffic Allocation is an automated feature that balances data exploration with conversion maximization in A/B testing. Once the algorithm identifies a winning variation with confidence, it automatically sends more traffic to that version—helping you maximize returns while still gathering reliable data using Bayesian methods.

Article

15min read

16 Experimentation Influencers You Should Follow

Building a culture of experimentation requires an appetite for iteration, a fearless approach to failure and a test-and-learn mindset. The 1000 Experiments Club podcast digs into all of that and more with some of the most influential voices in the industry. 

From CEOs and Founders to CRO Managers and more, these experts share the lessons they’ve learned throughout their careers in experimentation at top tech companies and insights on where the optimization industry is heading. 

Whether you’re an A/B testing novice or a seasoned pro, here are some of our favorite influencers in CRO and experimentation that you should follow:

Ronny Kohavi

Ronny Kohavi, a pioneer in the field of experimentation, brings over three decades of experience in machine learning, controlled experiments, AI, and personalization.

He was a Vice President and Technical Fellow at Airbnb.  Prior to that, he was Technical Fellow and Corporate Vice President at Microsoft, where he led the analysis and experimentation team (ExP).  Before that, he was Director of Personalization and Data Mining at Amazon.

Ronny teaches an online interactive course on Accelerating Innovation with A/B Testing, which was attended by over 800 students

Ronny’s work has helped lay the foundation for modern online experimentation, influencing how some of the world’s biggest companies approach testing and decision-making.

He advocates for a gradual rollout approach over the typical 50/50 split at launch:

“One thing that turns out to be really useful is to start with a small ramp-up. Even if you plan to go to 50% control and 50% treatment, start at 2%. If something egregious happens—like a metric dropping by 10% instead of the 0.5% you’re monitoring for—you can detect it in near real time.”

This slow ramp-up helps teams catch critical issues early and protect user experience.

Follow Ronny

Talia Wolf

Talia Wolf is a conversion optimization specialist and founder & CEO of Getuplift, where she helps businesses boost revenue, leads, engagement, and sales through emotional targeting, persuasive design, and behavioral data.

She began her career at a social media agency, where she was introduced to CRO, then served as Marketing Director at monday.com before launching her first agency, Conversioner, in 2013.

Talia teaches companies to optimize their online presence using emotionally-driven strategies. She emphasizes that copy and visuals should address customers’ needs rather than focusing solely on the product.

For Talia, emotional marketing is inherently customer-centric and research-based. From there, experiments can be built into A/B testing platforms using a clear North Star metric—whether checkouts, sign-ups, or add-to-carts—to validate hypotheses and drive growth.

Follow Talia

Elissa Quinby

Elissa Quinby is the Head of Product Marketing at e-commerce acceleration platform Pattern, with a career rooted in retail, marketing, and customer experience.

Before joining Pattern, she led retail marketing as Senior Director at Quantum Metric. She began her career as an Assistant Buyer at American Eagle Outfitters, then spent two years at Google as a Digital Marketing Strategist. Elissa went on to spend eight years at Amazon, holding roles across marketing, program management, and product.

Elissa emphasizes the importance of starting small to build trust with new customers. “The goal is to offer value in exchange for data,” she explains, pointing to first-party data as the “secret sauce” behind many successful companies.

She encourages brands to experiment with creative ways of gathering customer information—always with trust at the center—so they can personalize experiences and deepen customer understanding over time.

Follow Elissa

Lukas Vermeer

Lukas Vermeer, Director of Experimentation at Vista, is an expert in designing, implementing, and scaling experimentation programs. He previously spent over eight years at Booking.com, where he held roles as a product manager, data scientist, and ultimately Director of Experimentation.

With a background in machine learning and AI, Lukas specializes in building the infrastructure and processes needed to scale testing and drive business growth. He also consults with companies to help them launch and accelerate their experimentation efforts.

Given today’s fast-changing environment, Lukas believes that roadmaps should be treated as flexible guides rather than rigid plans:
“I think roadmaps aren’t necessarily bad, but they should acknowledge the fact that there is uncertainty. The deliverable should be clarifications of that uncertainty, rather than saying, ‘In two months, we’ll deliver feature XYZ.’”

Instead of promising final outcomes, Lukas emphasizes embracing uncertainty to make better, data-informed decisions.

Follow Lukas

Jonny Longden

Jonny Longden is the Chief Growth Officer at Speero, with over 17 years of experience improving websites through data and experimentation. He previously held senior roles at Boohoo Group, Journey Further, Sky, and Visa, where he led teams across experimentation, analytics, and digital product.

Jonny believes that smaller companies and startups—especially in their early, exploratory stages—stand to benefit the most from experimentation. Without testing, he argues, most ideas are unlikely to succeed.

“Without experimentation, your ideas are probably not going to work,” Jonny says. “The things that seem obvious often don’t deliver results, and the ideas that seem unlikely or even a bit silly can sometimes have the biggest impact.”

For Jonny, experimentation isn’t just a tactic—it’s the only reliable way to uncover what truly works and drive meaningful, data-backed progress.

Follow Jonny

Ruben de Boer

Ruben de Boer is a Lead CRO Manager at Online Dialogue and founder of Conversion Ideas, with over 14 years of experience in data and optimization.

At Online Dialogue, he leads the team of Conversion Managers—developing skills, maintaining quality, and setting strategy and goals. Through his company, Conversion Ideas, Ruben helps people launch their careers in CRO and experimentation by offering accessible, high-quality courses and resources.

Ruben believes experimentation shouldn’t be judged solely by outcomes. “Roughly 25% of A/B tests result in a winner, meaning 75% of what’s built doesn’t get released—and that can feel like failure if you’re only focused on output,” he explains.

Instead, he urges teams to shift their focus to customer-centric insights. When the goal becomes understanding the user—not just releasing features—the entire purpose of experimentation evolves.

Follow Ruben

David Mannheim

David Mannheim is a digital experience strategist with over 15 years of expertise helping brands like ASOS, Sports Direct, and Boots elevate their conversion strategies.

He is the CEO and founder of Made With Intent, focused on advancing innovative approaches to personalization through AI. Previously, he founded User Conversion, which became one of the UK’s largest independent CRO consultancies.

David recently authored a book exploring what he calls the missing element in modern personalization: the person. “Remember the first three syllables of personalization,” he says. “That often gets lost in data.”

He advocates for shifting focus from short-term gains to long-term customer value—emphasizing metrics like satisfaction, loyalty, and lifetime value over volume-based wins.

“More quality than quantity,” David explains, “and more recognition of the intangibles—not just the tangibles—puts brands in a much better place.”

Follow David

Marianne Stjernvall

Marianne Stjernvall has over a decade of experience in CRO and experimentation, having executed more than 500 A/B tests and helped over 30 organizations grow their testing programs.

Marianne is the founder of Queen of CRO and co-founder of ConversionHub, Sweden’s most senior CRO agency. As an established CRO consultant, she helps organizations build experimentation-led cultures grounded in data and continuous learning.

Marianne also teaches regularly, sharing her expertise on the full spectrum of CRO, A/B testing, and experimentation execution.

She stresses the importance of a centralized testing approach:

“If each department runs experiments in isolation, you risk making decisions based on three different data sets, since teams will be analyzing different types of data. Having clear ownership and a unified framework ensures the organization works cohesively with tests.”

Follow Marianne

Ben Labay

Ben Labay is the CEO of Speero, blending academic rigor in statistics with deep expertise in customer experience and UX.

Holding degrees in Evolutionary Behavior and Conservation Research Science, Ben began his career as a staff researcher at the University of Texas, specializing in data modeling and research.

This foundation informs his work at Speero, where he helps organizations leverage customer data to make better decisions.

Ben emphasizes that insights should lead to action and reveal meaningful patterns. “Every agency and in-house team collects data and tests based on insights, but you can’t stop there.”

Passionate about advancing experimentation, Ben focuses on developing new models, applying game theory, and embracing bold innovation to uncover bigger, disruptive insights.

Follow Ben

André Morys

André Morys, CEO and founder of konversionsKRAFT, has nearly three decades of experience in experimentation, digital growth, and e-commerce optimization.

Fueled by a deep fascination with user and customer experience, André guides clients through the experimentation process using a blend of data, behavioral economics, consumer psychology, and qualitative research.

He believes the most valuable insights lie beneath the surface. “Most people underestimate the value of experimentation because of the factors that are hard to measure,” André explains.

“You cannot measure the influence of experimentation on your company’s culture, yet that impact may be ten times more important than the immediate uplift you create.”

This philosophy is central to his “digital experimentation framework,” which features his signature “Iceberg Model” to capture both measurable and intangible effects of testing.

Follow André

Jeremy Epperson

Jeremy Epperson is the founder of Thetamark and has dedicated 14 years to conversion rate optimization and startup growth. He has worked with some of the fastest-growing unicorn startups in the world, researching, building, and implementing CRO programs for more than 150 growth-stage companies.

By gathering insights from diverse businesses, Jeremy has developed a data-driven approach to identify testing roadblocks, allowing him to optimize CRO processes and avoid the steep learning curves often associated with new launches.

In his interview, Jeremy emphasizes focusing on customer experience to drive growth. He explains, “We will do better as a business when we give the customer a better experience, make their life easier, simplify conversion, and eliminate the roadblocks that frustrate them and cause abandonment.”

His ultimate goal with experimentation is to create a seamless process from start to finish.

Follow Jeremy

Chad Sanderson

Chad Sanderson is the CEO and founder of Gable, a B2B data infrastructure SaaS company, and a renowned expert in digital experimentation and large-scale analysis.

He is also a product manager, public speaker, and writer who has lectured on topics such as the statistics of digital experimentation, advanced analysis techniques, and small-scale testing for small businesses.

Chad previously served as Senior Program Manager for Microsoft’s AI platform and was the Personalization Manager for Subway’s experimentation team.

He advises distinguishing between front-end (client-side) and back-end metrics before running experiments. Client-side metrics, such as revenue per transaction, are easier to track but may narrow focus to revenue growth alone.

“One set of metrics businesses mess up is relying only on client-side metrics like revenue per purchase,” Chad explains. “While revenue is important, focusing solely on it can drive decisions that overlook the overall impact of a feature.”

Follow Chad

Carlos Gonzalez de Villaumbrosia

Carlos Gonzalez de Villaumbrosia has spent the past 12 years building global companies and digital products.

With a background in Global Business Management and Marketing, Computer Science, and Industrial Engineering, Carlos founded Floqq—Latin America’s largest online education marketplace.

In 2014, he founded Product School, now the global leader in Product Management training.

Carlos believes experimentation has become more accessible and essential for product managers. “You no longer need a background in data science or engineering to be effective,” he says.

He views product managers as central figures at the intersection of business, design, engineering, customer success, data, and sales. Success in this role requires skills in experimentation, roadmapping, data analysis, and prototyping—making experimentation a core competency in today’s product landscape.

Follow Carlos

Bhavik Patel

Bhavik Patel is the Data Director at Huel, an AB Tasty customer, and the founder of CRAP Talks, a meetup series connecting CRO professionals across Conversion Rate, Analytics, and Product.

Previously, he served as Product Analytics & Experimentation Director at Lean Convert, where he led testing and optimization strategies for top brands. With deep expertise in personalization, experimentation, and data-driven decision-making, Bhavik helps teams evolve from basic A/B testing to strategic, high-impact programs.

With a focus on experimentation, personalization, and data-driven strategy, Bhavik leads teams in creating better digital experiences and smarter testing programs.

His philosophy centers on disruptive testing—bold experiments aimed at breaking past local maximums to deliver statistically meaningful results. “Once you’ve nailed the fundamentals, it’s time to make bigger bets,” he says.

Bhavik also stresses the importance of identifying the right problem before jumping to solutions: “The best solution for the wrong problem isn’t going to have any impact.”

Follow Bhavik

Rand Fishkin

Rand Fishkin is the co-founder and CEO of SparkToro, creators of audience research software designed to make audience insights accessible to all.

He also founded Moz and co-founded Inbound.org with Dharmesh Shah, which was later acquired by HubSpot in 2014. Rand is a frequent global keynote speaker on marketing and entrepreneurship, dedicated to helping people improve their marketing efforts.

Rand highlights the untapped potential in niche markets:
“Many founders don’t consider the power of serving a small, focused group of people—maybe only a few thousand—who truly need their product. If you make it for them, they’ll love it. There’s tremendous opportunity there.”

A strong advocate for risk-taking and experimentation, Rand encourages marketers to identify where their audiences are and engage them directly there.

Follow Rand

Shiva Manjunath

Shiva Manjunath is the Senior Web Product Manager of CRO at Motive and host of the podcast From A to B. With experience at companies like Gartner, Norwegian Cruise Line, and Edible, he’s spent years digging into user behavior and driving real results through experimentation.

Shiva is known for challenging the myth of “best practices,” emphasizing that optimization requires context, not checklists. “If what you believe is this best practice checklist nonsense, all CRO is just a checklist of tasks to do on your site. And that’s so incorrect,” he says.

At Gartner, a simplified form (typically seen as a CRO win) led to a drop in conversions, reinforcing his belief that true experimentation is about understanding why users act, not just what they do.

Through his work and podcast, Shiva aims to demystify CRO and encourage practitioners to think deeper, test smarter, and never stop asking questions.

Follow Shiva

Article

9min read

Heatmaps: Your Team’s Secret Weapon for Uncovering Website Gold

What are heatmaps? (and why your team needs them)

Think of heatmaps as your website’s truth-teller. They’re visual snapshots showing exactly where visitors click, scroll, and linger. No guesswork required.

Here’s how they work: Warm colors (reds, oranges) highlight the hotspots where users engage most. Cool colors (blues, greens) reveal the overlooked zones that might need attention.

The best part? Your visitors do all the heavy lifting. They show you what’s working and what’s not, so your team can make changes that actually move the needle.

Spot the signals: When to bring heatmaps into play

Heatmaps aren’t just pretty pictures—they’re your optimization toolkit’s MVP. Here’s how they deliver the biggest impact:

Measuring real engagement

Writing content that no one reads? Heatmaps show you exactly where readers drop off. If only 10% of visitors reach your CTA, it’s time to shake things up.

Tracking what matters: Actions

Are people clicking where you want them to? Heatmaps reveal if visitors complete your desired actions—or where they’re getting stuck instead.

Highlighting where attention sticks (and slips)

What grabs your attention first? What images distract from your main message? Heatmaps answer these questions so you can double down on what works.

Once you have these insights, bigger questions become easier to tackle:

  • Where should we place our most important content?
  • How can we use images and videos more effectively?
  • What’s pulling attention away from our goals?

The essential heatmap lineup every team needs

Most modern heatmap tools offer multiple views of user behavior. We partner closely with some of the major players already. Let’s break down the most common ones you’ll come across.

Click Heatmaps: The Action Tracker

These maps show every click on your page, with dense concentrations appearing as bright white areas surrounded by warm colors. Think of them as your conversion reality check.

What it tells you: Whether people click where you want them to—or if they’re trying to click non-clickable elements that look interactive.

How to use it: Look for clicks scattered around non-interactive text or images. These “frustrated clicks” signal design problems. If users are clicking on underlined text that isn’t a link, or images they expect to be clickable, you need to either make those elements functional or redesign them to look less interactive.

Pro tip: Compare click density on your primary CTA versus other page elements. If secondary elements are getting more clicks than your main conversion button, it’s time to redesign your visual hierarchy.

Scroll Heatmaps: The Attention Meter

See how far down visitors scroll and what percentage of users reach each section of your page. This is crucial for understanding whether your important content is actually being seen.

What it tells you: If users actually see your important content or bail before reaching your CTA. Most importantly, it shows you the “fold line”—where 50% of users stop scrolling.

How to use it: Identify the scroll percentage where you lose half your audience, then ensure all critical elements (value propositions, CTAs, key benefits) appear above that line. If your main CTA is only seen by 20% of visitors, move it higher or add secondary CTAs above the fold.

Pro tip: Use scroll maps to optimize content length. If 80% of users stop reading halfway through your blog post, either shorten the content or add more engaging elements (images, subheadings, interactive elements) to keep them scrolling.

Click Percentage Maps: The Element Analyzer

This view breaks down clicks by specific elements, showing exactly how many people clicked each button, image, or link as a percentage of total visitors.

What it tells you: Which elements deserve prime real estate and which ones are dead weight. You’ll see precise engagement rates for every clickable element on your page.

How to use it: Rank your page elements by click percentage to understand what’s actually driving engagement. If your newsletter signup gets 15% clicks but your main product CTA only gets 3%, you might need to redesign your primary call-to-action or reconsider your page goals.

Pro tip: Use this data to inform A/B tests. If one button consistently outperforms others, test applying its design (color, size, copy) to underperforming elements.

Confetti Maps: The Individual Click Tracker

Instead of showing click density, these maps display each individual click as a colored dot. Perfect for spotting users trying to click non-clickable areas or understanding click patterns in detail.

What it tells you: Where to add functionality or remove confusion. Each dot represents a real user’s intent to interact with something on your page.

How to use it: Look for clusters of dots over non-interactive elements—these represent frustrated users trying to click things that don’t work. Also watch for dots scattered far from any actual buttons or links, which might indicate responsive design issues or accidental clicks.

Pro tip: Filter confetti maps by traffic source or user segment. Mobile users might have different click patterns than desktop users, and organic traffic might behave differently than paid traffic.

Mobile-Specific Heatmaps: The Touch Tracker

Modern tools capture mobile-specific actions like taps, swipes, pinches, and multi-touch gestures—because mobile behavior is fundamentally different from desktop.

What it tells you: How to optimize for the majority of your traffic (since mobile often dominates). Mobile users have different interaction patterns, attention spans, and conversion behaviors.

How to use it: Create separate heatmaps for mobile and desktop traffic. Mobile users typically scroll faster, have shorter attention spans, and interact differently with buttons and forms. Use this data to optimize button sizes, reduce form fields, and adjust content layout for mobile-first experiences.

Pro tip: Pay special attention to thumb-reach zones on mobile heatmaps. Elements that are easy to tap with a thumb (bottom third of screen, right side for right-handed users) typically get higher engagement rates.

Learn more about best practices for designing for mobile experiences with our Mobile Optimization Guide.

Eyes vs. clicks: Understanding the key differences

While heatmaps track mouse movements and clicks, eye-tracking follows actual gaze patterns. Eye-tracking gives deeper insights but requires specialized equipment most teams don’t have.

The good news? AI-powered tools like Feng-Gui and EyeQuant now simulate eye-tracking through algorithms, making this technology more accessible.

Bottom line: Start with heatmaps. They’re easier to implement and give you actionable insights right away.

Features that make or break your heatmapping game

Not all heatmap tools are created equal. Here’s what your team should prioritize:

Must-have features:

  • Audience Segmentation: Create maps for specific user groups (new vs. returning visitors, mobile vs. desktop)
  • Map Comparison: Easily compare results across different segments
  • Page Templates: Aggregate data for similar page types (crucial for e-commerce sites)
  • Mobile Optimization: Track touch, scroll, and swipe behaviors
  • Export Capabilities: Share results with your team effortlessly
  • Dynamic Element Tracking: Capture interactions with dropdowns, sliders, and AJAX-loaded content
  • Historical Data: Preserve old heatmaps even after design changes

Test smarter with heatmap insights

Here’s where things get exciting. Heatmaps show you the problems, but how do you know if your fixes actually work?

Enter A/B testing.

This three-step approach turns insights into results:

  • Identify problems with heatmaps
  • Test potential solutions with A/B testing
  • Choose the highest-performing solution based on data

Real Example:

Nonprofit UNICEF France wanted to better understand how visitors perceived its homepage ahead of a major redesign.

Their move: UNICEF France combined on-site surveys with heatmapping to gather both qualitative feedback and visual behavioral data.

The result: Heatmaps showed strong engagement with the search bar, while surveys confirmed it was seen as the most useful element. Less-used features, like social share icons, were removed in the redesign—resulting in a cleaner, more user-focused homepage.

Continue reading this case study

Connect the dots and act with confidence

Ready to put heatmaps to work? Here’s your game plan:

Start small. Pick one high-traffic page and run your first heatmap analysis.

Look for patterns. Are users clicking where you expect? Scrolling to your key content? Getting stuck somewhere?

Test your hunches. Use A/B testing to validate any changes before rolling them out site-wide.

Iterate forward. Heatmaps aren’t a one-and-done tool but part of your ongoing optimization process.

Remember: every click tells a story. Every scroll reveals intent. Your visitors are already showing you how to improve—you just need to listen.


Ready to see what your visitors are really doing? Heatmaps give you the insights. A/B testing helps you act on them. Together, they’re your path to better conversions and happier users.


Article

4min read

Transaction Testing With AB Tasty’s Report Copilot

Transaction testing, which focuses on increasing the rate of purchases, is a crucial strategy for boosting your website’s revenue. 

To begin, it’s essential to differentiate between conversion rate (CR) and average order value (AOV), as they provide distinct insights into customer behavior. Understanding these metrics helps you implement meaningful changes to improve transactions.

In this article, we’ll delve into the complexities of transaction metrics analysis and introduce our new tool, the “Report Copilot,” designed to simplify report analysis. Read on to learn more.

Transaction Testing

To understand how test variations impact total revenue, focus on two key metrics:

  • Conversion Rate (CR): This metric indicates whether sales are increasing or decreasing. Tactics to improve CR include simplifying the buying process, adding a “one-click checkout” feature, using social proof, or creating urgency through limited inventory.
  • Average Order Value (AOV): This measures how much each customer is buying. Strategies to enhance AOV include cross-selling or promoting higher-priced products.

By analyzing CR and AOV separately, you can pinpoint which metrics your variations impact and make informed decisions before implementation. For example, creating urgency through low inventory may boost CR but could reduce AOV by limiting the time users spend browsing additional products. After analyzing these metrics individually, evaluate their combined effect on your overall revenue.

Revenue Calculation

The following formula illustrates how CR and AOV influence revenue:

Revenue=Number of Visitors×Conversion Rate×AOV

In the first part of the equation (Number of Visitors×Conversion Rate), you determine how many visitors become customers. The second part (×AOV) calculates the total revenue from these customers.

Consider these scenarios:

  • If both CR and AOV increase, revenue will rise.
  • If both CR and AOV decrease, revenue will fall.
  • If either CR or AOV increases while the other remains stable, revenue will increase.
  • If either CR or AOV decreases while the other remains stable, revenue will decrease.
  • Mixed changes in CR and AOV result in unpredictable revenue outcomes.

The last scenario, where CR and AOV move in opposite directions, is particularly complex due to the variability of AOV. Current statistical tools struggle to provide precise insights on AOV’s overall impact, as it can experience significant random fluctuations. For more on this, read our article “Beyond Conversion Rate.”

While these concepts may seem intricate, our goal is to simplify them for you. Recognizing that this analysis can be challenging, we’ve created the “Report Copilot” to automatically gather and interpret data from variations, offering valuable insights.

Report Copilot

The “Report Copilot” from AB Tasty automates data processing, eliminating the need for manual calculations. This tool empowers you to decide which tests are most beneficial for increasing revenue.

Here are a few examples from real use cases.

Winning Variation:

The left screenshot provides a detailed analysis, helping users draw conclusions about their experiment results. Experienced users may prefer the summarized view on the right, also available through the Report Copilot.

Complex Use Case:


The screenshot above demonstrates a case where CR and OAV have opposite trends and need a deeper understanding of the context.

It’s important to note that the Report Copilot doesn’t make decisions for you; it highlights the most critical parts of your analysis, allowing you to make informed choices.

Conclusion

Transaction analysis is complex, requiring a breakdown of components like conversion rate and average order value to better understand their overall effect on revenue. 

We’ve developed the Report Copilot to assist AB Tasty users in this process. This feature leverages AB Tasty’s extensive experimentation dashboard to provide comprehensive, summarized analyses, simplifying decision-making and enhancing revenue strategies.

Article

5min read

The Past, Present, and Future of Experimentation | Bhavik Patel

What is the future of experimentation? Bhavik Patel highlights the importance of strategic planning and innovation to achieve meaningful results.

A thought leader in the worlds of CRO and experimentation, Bhavik Patel founded popular UK-based meetup community, CRAP (Conversion Rate, Analytics, Product) Talks, seven years ago to fill a gap in the event market – opting to cover a broad range of optimization topics from CRO, data analysis, and product management to data science, marketing, and user experience.

After following his passion throughout the industry from acquisition growth marketing to experimentation and product analytics, Bhavik landed the role of Product Analytics & Experimentation Director at product measurement consultancy, Lean Convert, where his interests have converged. Here he is scaling a team and supporting their development in data and product thinking, as well as bringing analytical and experimentation excellence into the organization.

AB Tasty’s CMO Marylin Montoya spoke with Bhavik about the future of experimentation and how we might navigate the journey from the current mainstream approach to the potentialities of AI technology.

Here are some of the key takeaways from their conversation.

The evolution of experimentation: a scientific approach.

Delving straight to the heart of the conversation, Bhavik talks us through the evolution of A/B testing, from its roots in the scientific method, to recent and even current practices – which involve a lot of trial and error to test basic variables. When projecting into the future, we need to consider everything from people, to processes, and technology.

Until recently, conversion rate optimization has mostly been driven by marketing teams, with a focus on optimizing the basics such as headlines, buttons, and copy. Over the last few years, product development has started to become more data driven. Within the companies taking this approach, the product teams are the recipients of the A/B test results, but the people behind these tests are the analytical and data science teams, who are crafting new and advanced methods, from a statistical standpoint.

Rather than making a change on the homepage and trying to measure its impact on outcome metrics, such as sales or new customer acquisition, certain organizations are taking an alternative approach modeled by their data science teams: focusing on driving current user activity and then building new products based on that data.

The future of experimentation is born from an innovative mindset, but also requires critical thinking when it comes to planning experiments. Before a test goes live, we must consider the hypothesis that we’re testing, the outcome metric or leading indicators, how long we’re going to run it, and make sure that we have measurement capabilities in place. In short, the art of experimentation is transitioning from a marketing perspective to a science-based approach.

Why you need to level up your experiment design today.

While it may be a widespread challenge to shift the mindset around data and analyst teams from being cost centers to profit-enablement centers, the slowing economy might have a silver lining: people taking the experimentation process a lot more seriously. 

We know that with proper research and design, an experiment can achieve a great ROI, and even prevent major losses when it comes to investing in new developments. However, it can be difficult to convince leadership of the impact, efficiency and potential growth derived from experimentation.

Given the current market, demonstrating the value of experimentation is more important than ever, as product and marketing teams can no longer afford to make mistakes by rolling out tests without validating them first, explains Bhavik. 

Rather than watching your experiment fail slowly over time, it’s important to have a measurement framework in place: a baseline, a solid hypothesis, and a proper experiment design. With experimentation communities making up a small fraction of the overall industry, not everyone appreciates the ability to validate, quantify, and measure the impact of their work,  however Bhavik hopes this will evolve in the near future.

Disruptive testing: high risk, high reward.

On the spectrum of innovation, at the very lowest end is incremental innovation, such as small tests and continuous improvements, which hits a local maximum very quickly. In order to break through that local maximum, you need to try something bolder: disruptive innovation. 

When an organization is looking for bigger results, they need to switch out statistically significant micro-optimizations for experiments that will bring statistically meaningful results.

Once you’ve achieved better baseline practices – hypothesis writing, experiment design, and planning – it’s time to start making bigger bets and find other ways to measure it.

Now that you’re performing statistically meaningful tests, the final step in the evolution of experimentation is reverse-engineering solutions by identifying the right problem to solve. Bhavik explains that while we often focus on prioritizing solutions, by implementing various frameworks to estimate their reach and impact, we ought to take a step back and ask ourselves if we’re solving the right problem.

With a framework based on quality data and research, we can identify the right problem and then work on the solution, “because the best solution for the wrong problem isn’t going to have any impact,” says Bhavik.

What else can you learn from our conversation with Bhavik Patel?

  • The common drivers of experimentation and the importance of setting realistic expectations with expert guidance.
  • The role of A/B testing platforms in the future of experimentation: technology and interconnectivity.
  • The potential use of AI in experimentation: building, designing, analyzing, and reporting experiments, as well as predicting test outcomes. 
  • The future of pricing: will AI enable dynamic pricing based on the customer’s behavior?

About Bhavik Patel

A seasoned CRO expert, Bhavik Patel is the Product Analytics & Experimentation Director at Lean Convert, leading a team of optimization specialists to create better online experiences for customers through experimentation, personalization, research, data, and analytics.
In parallel, Bhavik is the founder of CRAP Talks, an acronym that stands for Conversion Rate, Analytics and Product, which unites CRO enthusiasts with thought leaders in the field through inspiring meetup events – where members share industry knowledge and ideas in an open-minded community.

About 1,000 Experiments Club

The 1,000 Experiments Club is an AB Tasty-produced podcast hosted by John Hughes, Head of Marketing at AB Tasty. Join John as he sits down with the experts in the world of experimentation to uncover their insights on what it takes to build and run successful experimentation programs.

Article

5min read

Mutually Exclusive Experiments: Preventing the Interaction Effect

What is the interaction effect?

If you’re running multiple experiments at the same time, you may find their interpretation to be more difficult because you’re not sure which variation caused the observed effect. Worse still, you may fear that the combination of multiple variations could lead to a bad user experience.

It’s easy to imagine a negative cumulative effect of two visual variations. For example, if one variation changes the background color, and another modifies the font color, it may lead to illegibility. While this result seems quite obvious, there may be other negative combinations that are harder to spot.

Imagine launching an experiment that offers a price reduction for loyal customers, whilst in parallel running another that aims to test a promotion on a given product. This may seem like a non-issue until you realize that there’s a general rule applied to all visitors, which prohibits cumulative price reductions – leading to a glitch in the purchase process. When the visitor expects two promotional offers but only receives one, they may feel frustrated, which could negatively impact their behavior.

What is the level of risk?

With the previous examples in mind, you may think that such issues could be easily avoided. But it’s not that simple. Building several experiments on the same page becomes trickier when you consider code interaction, as well as interactions across different pages. So, if you’re interested in running 10 experiments simultaneously, you may need to plan ahead.

A simple solution would be to run these tests one after the other. However, this strategy is very time consuming, as your typical experiment requires two weeks to be performed properly in order to sample each day of the week twice.

It’s not uncommon for a large company to have 10 experiments in the pipeline and running them sequentially will take at least 20 weeks. A better solution would be to handle the traffic allocated to each test in a way that renders the experiments mutually exclusive.

This may sound similar to a multivariate test (MVT), except the goal of an MVT is almost the opposite: to find the best interaction between unitary variations.

Let’s say you want to explore the effect of two variation ideas: text and background color. The MVT will compose all combinations of the two and expose them simultaneously to isolated chunks of the traffic. The isolation part sounds promising, but the “all combinations” is exactly what we’re trying to avoid. Typically, the combination of the same background color and text will occur. So an MVT is not the solution here.

Instead, we need a specific feature: A Mutually Exclusive Experiment.

What is a Mutually Exclusive Experiment (M2E)?

AB Tasty’s Mutually Exclusive Experiment (M2E) feature enacts an allocation rule that blocks visitors from entering selected experiments depending on the previous experiments already displayed. The goal is to ensure that no interaction effect can occur when a risk is identified.

How and when should we use Mutually Exclusive Experiments?

We don’t recommend setting up all experiments to be mutually exclusive because it reduces the number of visitors for each experiment. This means it will take longer to achieve significant results and the detection power may be less effective.

The best process is to identify the different kinds of interactions you may have and compile them in a list. If we continue with the cumulative promotion example from earlier, we could create two M2E lists: one for user interface experiments and another for customer loyalty programs. This strategy will avoid negative interactions between experiments that are likely to overlap, but doesn’t waste traffic on hypothetical interactions that don’t actually exist between the two lists.

What about data quality?

With the help of an M2E, we have prevented any functional issues that may arise due to interactions, but you might still have concerns that the data could be compromised by subtle interactions between tests.

Would an upstream winning experiment induce false discovery on downstream experiments? Alternatively, would a bad upstream experiment make you miss an otherwise downstream winning experiment? Here are some points to keep in mind:

  • Remember that roughly eight tests out of 10 are neutral (show no effect), so most of the time you can’t expect an interaction effect – if no effect exists in the first place.
  • In the case where an upstream test has an effect, the affected visitors will still be randomly assigned to the downstream variations. This evens out the effect, allowing the downstream experiment to correctly measure its potential lift. It’s interesting to note that the average conversion rate following an impactful upstream test will be different, but this does not prevent the downstream experiment from correctly measuring its own impact.
  • Remember that the statistical test is here to take into account any drift of the random split process. The drift we’re referring to here is the fact that more impacted visitors of the upstream test could end up in a given variation creating the illusion of an effect on the downstream test. So the gain probability estimation and the confidence interval around the measured effect is informing you that there is some randomness in the process. In fact, the upstream test is just one example among a long list of possible interfering events – such as visitors using different computers, different connection quality, etc.

All of these theoretical explanations are supported by an empirical study from the Microsoft Experiment Platform team. This study reviewed hundreds of tests on millions of visitors and saw no significant difference between effects measured on visitors that saw just one test and visitors that saw an additional upstream test.

Conclusion

While experiment interaction is possible in a specific context, there are preventative measures that you may take to avoid functional loss. The most efficient solution is the Mutually Exclusive Experiment, allowing you to eliminate the functional risks of simultaneous experiments, make the most of your traffic and expedite your experimentation process.

References:

https://www.microsoft.com/en-us/research/group/experimentation-platform-exp/articles/a-b-interactions-a-call-to-relax/

 

Article

10min read

A/A Testing: What is it and When Should You Use it?

A/A tests are a legacy from the early days of A/B testing. It’s basically creating an A/B test where two identical versions of a web page or element are tested against each other. Variation B is just a copy of A without any modification.

One of the goals of A/A tests is to check the effectiveness and accuracy of testing tools. The expectation is that, if no winner is declared, the test is a success. Whereas detecting a statistical difference would mean a failure, indicating a problem somewhere in the pipeline.

But it’s not always that simple. We’ll dive into this type of testing and the statistics and tech behind the scenes. We’ll look at why a failed A/A test is not a proof of pipeline failure, and that a successful A/A test isn’t a foolproof sanity check.

What is tested during an A/A test?

Why is there so much buzz around A/A testing? An A/A test can be a way to verify two components of an experimentation platform: 

  1. The statistical tool: It may be possible that the formulas chosen don’t fit the real nature of the data, or may contain bugs.
  2. The traffic allocation: The split between variations must be random and respect the proportions it has been given. When a problem occurs, we talk about Sample Ratio Mismatch (SRM); that is, the observed traffic does not match the allocation setting. This means that the split has some bias impacting the analysis quality.
    Let’s explore this in more detail.

Statistical tool test

Let’s talk about a “failed” A/A test

The most common idea behind A/A tests is that the statistical tool should yield no significant difference. It is considered a “failed” A/A test if you detect a difference in performance during an A/A test. 

However, to understand how weak this conclusion is, you need to understand how statistical tests work. Let’s say that your significance threshold is 95%. This means that there is still a 5% chance that the difference you see is a statistical fluke and no real difference exists between the variations. So even with a perfectly working statistical tool, you still have one chance in twenty (1/20=5%) that you will have a “failed” A/A test and you might start looking for a problem that may not exist.

With that in mind, an acceptable statistical procedure would be to perform 20 A/A tests and expect to have 19 that yield no statistical difference, and one that does detect a significant difference. And even in this case, if 2 tests show significant results, it’s a sign of a real problem. In other words, having 1 successful A/A test is in fact not enough to validate a statistical tool. To validate it fully, you need to show that the tests are successful 95% of the time (=19/20).

Therefore, a meaningful approach would be to perform hundreds of A/A tests and expect ~5% of them to “fail”. It’s worth noting that if it “fails” less than 5% of the time it’s also a problem, maybe indicating that the statistical test simply says “no” too often, leading to a strategy that never detects any winning variation. So one A/A “failed” test doesn’t tell much in reality. 

What if it’s a “successful A/A test”? 

A “successful” A/A test (yielding no difference) is not proof that everything is working as it should. To understand why, you need to check another important tool in an A/B test: the sample size calculator.

In the following example, we see that from a 5% conversion rate, you need around 30k visitors per variation to reach the 95% significance level if a variation yields a 10% MDE (Minimal Detectable Effect).

But in the context of an A/A test, the Minimal Detectable Effect (MDE) is in fact 0%. Using the same formula, we’ll plug 0% as MDE.

At this point, you will discover that the form does not let you put a 0% here, so let’s try a very small number then. In this case, you get almost 300M visitors, as seen below.

In fact, to be confident that there is exactly no difference between two variations, you need an infinite number of visitors, which is why the form does not let you set 0% as MDE.

Therefore, a successful A/A test only tells you that the difference between the two variations is smaller than a given number but not that the two variations perform exactly the same.

This problem comes from another principle in statistical tests: the power. 

The power of a test is the chance that you discover a difference if there is any. In the context of an A/A test, this refers to the chance you discover a statistically significant discrepancy between the two variations’ performance. 

The more power, the more chance you will discover a difference. To raise the power of a test you simply raise the number of visitors.

You may have noticed that in the previous screenshots, tests are usually powered at 80%. This means that even if a difference exists between the variations in performance, 20% of the time you will miss it. So one “successful” A/A test (yielding no statistical difference) may just be an occurrence of this 20%. In other words, having just one successful A/A test doesn’t ensure the efficiency of your experimentation tool. You may have a problem and there is a 20% chance that you missed it. Additionally, reaching 100% of power will need an infinite number of visitors, making it impractical.

How do we make sure we can trust the statistical tool then? If you are using a platform that is used by thousands of other customers, chances are that the problem would have already been discovered. 

Because statistical software does not change very often and it is not affected by the variation content (whereas the traffic allocation might change, as we will see later), the best option is to trust your provider, or you can double-check the results with an independent provider. You can find a lot of independent calculators on the web. They only need the number of visitors and the number of conversions for each variation to provide the results making it quick to implement.

Traffic allocation test

In this part, we only focus on traffic, not conversions. 

The question is: does the splitting operation work as it should? We call this kind of failure a SRM or Sample Ratio Mismatch. You may ask yourself how a simple random choice could fail. In fact, the failure happens either before or after the random choice. 

The following demonstrates two examples where that can happen:

  • The variation contains a bug that may crash some navigators. In this case, the corresponding variation will lose visitors. The bug might depend on the navigator and then you will end up with bias in your data.
  • If the variation gives a discount coupon (or any other advantage), and some users find a way to force their navigator to run the variation (to get the coupon), then you will have an excess of visitors for that variation that is not due to random chance, which results in biased data.


It’s hard to detect with the naked eye because the allocation is random, so you never get sharp numbers. 

For instance, a 50/50 allocation never precisely splits the traffic in groups with the exact same size. As a result, we would need statistical tools to check if the split observed corresponds with the desired allocation. 

SRM tests exist. They work more or less the same way as an A/B test except that the SRM formula indicates whether there is a difference between the desired allocation and what really happened. If there is indeed an SRM, then there is a chance that this difference is not due to pure randomness. This means that some data is lost or bias occurred during the experiment entailing trust for future (real) experiments.

On the one hand, detecting an SRM during an A/A test sounds like a good idea. On the other hand, if you think operationally it might not be that useful because the chance of a SRM is low.  

Even if some reports say that they are more frequent than you may think, most of the time it happens on complex tests. In that sense, checking SRM within an A/A test will not help you to prevent having one on a more complex experiment later. 

If you find a Sample Ration Mismatch on a real experiment or in an A/A test, the following actions remain the same: find the cause, fix it, and restart the experiment. So why waste time and traffic on an A/A test that will give you no information? A real experiment would have given you real information if it worked fine on the first try. If a problem does occur, we would detect it even in a real experiment since we only consider traffic and not conversions.

A/A tests are also unnecessary since most trustworthy A/B testing platforms (like AB Tasty) do SRM checks on an automated basis. So if an SRM occurs, you will be notified anyway. 

So where does this “habit” of practicing A/A tests come from?

Over the years, it’s something that engineers building A/B testing platforms have done. It makes sense in this case because they can run a lot of automated experiments, and even simulate users if they don’t have enough at hand, performing a sound statistical approach to A/A tests. 

They have reasons to doubt the platform in the works and they have the programming skills to automatically create hundreds of A/A tests to test it properly. Since these people can be seen as pioneers, their voice on the web is loud when they explain what an A/A test is and why it’s important (from an engineering perspective).

However, for a platform user/customer, the context is different as they’ve paid for a ready-to- use and trusted platform and can start a real experiment as soon as possible to get a return on investment. Therefore, it makes little sense to waste time and traffic on an A/A test that won’t provide any valuable information.

Why sometimes it might be better to skip A/A tests

We can conclude that a failed A/A test is not a problem and that a successful one is not  proof of sanity. 

In order to gain valuable insights from A/A tests, you would need to perform hundreds of them with an infinite number of visitors. Moreover, an efficient platform like AB Tasty does the corresponding checks for you.

That’s why, unless you are developing your own A/B testing platform, running an A/A test may not give you the insights you’re looking for. A/A tests require a considerable amount of time and traffic that could otherwise be used to conduct A/B tests that could give you valuable insights on how to optimize your user experience and increase conversions. 

When it makes sense to run an A/A test

It may seem that running A/A tests may not be the right call after all. However, there may be a couple of reasons why it might still be useful to perform A/A tests. 

First is when you want to check the data you are collecting and compare it to data already collected with other analytics tools but keep in mind that you will never get the exact same results. The reason is that most of the metric definitions vary on different tools. Nonetheless this comparison is an important onboarding step to ensure that the data is properly collected.

The other reason to perform an A/A test is to know the reference value for your main metrics so you can establish a baseline to analyze your future campaigns more accurately. For example, what is your base conversion rate and/or bounce rate? Which of these metrics need to be improved and are, therefore, a good candidate for your first real A/B test?

This is why AB Tasty has a feature that helps users build A/A tests dedicated to reach these goals and avoids the pitfalls of “old school”  methods that are not useful anymore. With our new A/A test feature, A/A test data is collected in one variant (not two); let’s call this an “A test”. 

This allows you to have a more accurate estimation of these important metrics as the more data you have, the more accurate the measurements are. Meanwhile, in a classic A/A test, data is collected in two different variants which provides less accurate estimates since you have less data for each variant.

With this approach, AB Tasty enables users to automatically set up A/A tests, which gives better insights than classic “handmade” A/A tests.

Article

8min read

10 Generative AI Ideas for Your Experimentation Roadmap

Artificial intelligence has been a recurring theme for decades. However, it’s no longer science fiction – it’s a reality.

Since OpenAI launched its own form of generative AI, ChatGPT, in November 2022, the world has yet to stop talking about its striking capabilities. It’s particularly fascinating to see just how easy it is to get results after interacting with this bot which is comprised of deep-learning algorithms for natural language processing.

Even Google quickly followed by launching a new and experimental project, Gemini, to revolutionize its own Search. By harnessing the power of generative AI and the capacity of large language models, Google is seeking to take its search process to the next level.

Given the rapid growth of this technological advancement over the past few months, it’s time that we talk about generative AI in the context of A/B testing and experimentation.

Whether you’re curious about how AI can impact your experiments or are ready for inspiration we’ll discuss some of our ideas around using AI for A/B testing, personalization, and conversion rate optimization.

What is generative AI?

Generative AI is a type of artificial intelligence that doesn’t have programming limitations, which allows it to generate new content (think ChatGPT). Instead of following a specific, pre-existing dataset, generative AI learns from indexing extensive data, focusing on patterns and using deep learning techniques and neural networks to create human-like content based on its learnings.

The way algorithms capture ideas is similar to how humans gather inspiration from previous experiences to create something unique. Based on the large amounts of data used to craft generative AI’s learning abilities, it’s capable of outputting high-quality responses that are similar to what a human would create.

However, some concerns need to be addressed:

  • Biased information: Artificial intelligence is only as good as the datasets used to train it. Therefore if the data used to train it has biases, it may create “ideas” that are equally biased or flawed.
  • Spreading misinformation: There are many concerns about the ethics of generative AI and sharing information directly from it. It’s best practice to fact-check any content written by AI to avoid putting out false or misleading information.
  • Content ownership: Since content generated with AI is not generated by a human, can you ethically can claim it as your own idea? In a similar sense, the same idea could potentially be generated elsewhere by using a similar prompt. Copywriting and ownership are then called into question here.
  • Data and privacy: Data privacy is always a top-of-mind concern. With the new capabilities of artificial intelligence, data handling becomes even more challenging. It’s always best practice to avoid using sensitive information with any form of generative AI.

By keeping these limitations in mind, generative AI has the potential to streamline processes and revolutionize the way we work – just as technology has always done in the past.

10 generative AI uses for A/B testing

In the A/B testing world, we are very interested in how one can harness these technological breakthroughs for experimentation. We are brainstorming a few approaches to re-imagine the process of revolutionizing digital customer experiences to ultimately save time and resources.

Just like everyone else, we started to wonder how generative AI could impact the world of experimentation and our customers. Here are some ideas, some of them concrete and some more abstract, as to how artificial intelligence could help our industry:

DISCLAIMER: Before uploading information into any AI platform, ensure that you understand their privacy and security practices. While AI models strive to maintain a privacy standard, there’s always the risk of data breaches. Always protect your confidential information. 

1. Homepage optimization

Your homepage is likely the first thing your visitors will see so optimization is key to staying ahead of your competitors. If you want a quick comparison of content on your homepage versus your competitors, you can feed this information into generative AI to give it a basis for understanding. Once your AI is loaded with information about your competitors, you can ask for a list of best practices to employ to make new tests for your own website.

2.  Analyze experimentation results

Reporting and analyzing are crucial to progressing on your experimentation roadmap, but it’s also time-consuming. By collecting a summary of testing logs, generative AI can help highlight important findings, summarize your results, and potentially even suggest future steps. Ideally, you can feed your A/B test hypothesis as well as the results to show your thought process and organization. After it recognizes this specific thought process and desired results, it could aid in generating new test hypotheses or suggestions.

3. Recommend optimization barriers

Generative AI can help you prioritize your efforts and identify the most impactful barriers to your conversion rate. Uploading your nonsensitive website performance data gathered from your analytics platforms can give AI the insight it needs into your performance. Whether it suggests that you update your title tags or compress images on your homepage, AI can quickly spot where you have the biggest drop-offs to suggest areas for optimization.

4. Client reviews

User feedback is your own treasure trove of information for optimization. One of the great benefits of AI that we already see is that it can understand large amounts of data quickly and summarize it. By uploading client reviews, surveys and other consumer feedback into the database, generative AI can assist you in creating detailed summaries of your users’ pain points, preferences and levels of satisfaction. The more detailed your reviews – the better the analysis will be.

5. Chatbots

Chatbots are a popular way to communicate with website visitors. As generative AI is a large language model, it can quickly generate conversational scripts, prompts and responses to reduce your brainstorming time. You can also use AI to filter and analyze conversations that your chatbot is already having to determine if there are gaps in the conversation or ways to enhance its interaction with customers.

6. Translation

Language barriers can limit a brand that has a presence in multiple regions. Whether you need translations for your chatbot conversations, CTAs or longer form copy, generative AI can provide you with translations in real time to save you time and make your content accessible to all zones touched by your brand.

7. Google Adwords

Speed up brainstorming sessions by using generative AI to experiment with different copy variations. Based on the prompts you provide, AI can provide you with a series of ideas for targeting keywords and creating copy with a particular tone of voice to use with Google Adwords. Caution: be sure to double-check all keywords proposed to verify their intent. 

8. Personalization

Personalized content can be scaled at speed by leveraging artificial intelligence to produce variations of the same messages. By customizing your copy, recommendations, product suggestions and other messages based on past user interactions and consumer demographics, you can significantly boost your digital consumer engagement.

9. Product Descriptions

Finding the best wording to describe why your product is worth purchasing may be a challenge. With generative AI, you can get more ambitious with your product descriptions by testing out different variations of copy to see which version is the most promising for your visitors.

10. Predict User Behavior

Based on historical data from your user behavior, generative AI can predict behavior that can help you to anticipate your next A/B test. Tailoring your tests according to patterns and trends in user interaction can help you conduct better experiments. It’s important to note that predictions will be limited to patterns interpreted by past customer data collected and uploaded. Using generative AI is better when it’s used as a tool to guide you in your decision-making process rather than to be the deciding force alone.

The extensive use of artificial intelligence is a new and fast-evolving subject in the tech world. If you want to leverage it in the future, you need to start familiarizing yourself with its capabilities.

Keep in mind that it’s important to verify the facts and information AI generates just as you carefully verify data before you upload. Using generative AI in conjunction with your internal experts and team resources can assist in improving ideation and efficiency. However, the quality of the output from generative AI is only as good as what you put in.

Is generative AI a source of competitive advantage in A/B testing?

The great news is that this technology is accessible to everyone – from big industry leaders like Google to start-ups with a limited budget. However, the not-so-great news is that this is available to everyone. In other words, generative AI is not necessarily a source of competitive advantage.

Technology existing by itself does not create more value for a business. Rather, it’s the people driving the technology who are creating value by leveraging it in combination with their own industry-specific knowledge, past experiences, data collection and interpretation capabilities and understanding of customer needs and pain points.

While we aren’t here to say that generative AI is a replacement for human-generated ideas, this technology can definitely be used to complement and amplify your already-existing skills.

Leveraging generative AI in A/B testing

From education to copywriting or coding – all industries are starting to see the impact that these new software developments will have. Leveraging “large language models” is becoming increasingly popular as these algorithms can generate ideas, summarize long forms of text, provide insights and even translate in real-time.

Proper experimentation and A/B testing are at the core of engaging your audience, however, these practices can take a lot of time and resources to accomplish successfully. If generative AI can offer you ways to save time and streamline your processes, it might be time to use it as your not-so-secret weapon. In today’s competitive digital environment, continually enhancing your online presence should be at the top of your mind.

Want to start optimizing your website? AB Tasty is the best-in-class experience optimization platform that empowers you to create a richer digital experience – fast. From experimentation to personalization, this solution can help you activate and engage your audience to boost your conversions.

Article

17min read

AB Tasty’s JavaScript Tag Performance and Report Analysis

Hello! I am Léo, Product Manager at AB Tasty. I’m in charge, among several things, of our JavaScript tag that is currently running on thousands of websites for our clients. As you can guess, my roadmap is full of topics around data collection, privacy and… performance.

In today’s article, we are going to talk about JavaScript tag performance, open-data monitoring and competition. Let’s go!

Performance investigation

As performance has become a big and hot topic during the past few years, mainly thanks to Google’s initiative to deploy their Core Web Vitals, my team and I have been focused a lot on that. We’ve changed a lot of things, improved many parts of our tag and reached excellent milestones. Many of our users have testified of their satisfaction around that. I have already made a (long) series of blog articles about that here. Sorry though, it’s only in French. ??

From time to time, we get tickled by competitors about a specific report around performance that seems to show us as underperforming based on some metrics. Some competitors claim that they are up to 4 times faster than us! And that’s true, I mean, that’s what the report shows.

You can easily imagine how devastating this can be for the image of my company and how hard it could be for our sales team when a client draws this card. This is especially demoralizing for me and my team after all the work we’ve pushed through this topic during the last few years.

Though it was the first feeling I got when seeing this report, I know for a fact that our performance is excellent. We’ve reached tremendous improvements after the release of several projects and optimizations. Today all the benchmarks and audits I run over our customers’ websites show very good performance and a small impact on the famous Core Web Vitals.

Also, it’s very rare that a customer complains about our performance. It can happen, that’s for sure, but most of the time all their doubts disappear after a quick chat, some explanations and hints about optimization best practices.

But that report is still there, right? So maybe I’m missing something. Maybe I’m not looking at the correct metric. Maybe I’ve only audited customers where everything is good, but there’s a huge army of customers that don’t complain that our tag is drastically slowing their website down.

One easy way to tackle that would be to say that we are doing more with our tag than our competitors do.

Is CRO the same as analytics? 

On the report (I promise I will talk about it in depth below ?), we are grouped in the Analytics Category. However, Conversion Rate Optimization isn’t the same as Analytics. An analytics tool only collects data while we activate campaigns, run personalizations, implement widgets, add pop-ins and more. In this sense, our impact will be higher.

Let’s talk about our competitors: Even though we have the best solution out there (?), our competitors do more or less the same things as us by using the same technics with the same limits and issues. Therefore, it’s legit to compare us with the same metrics. It might be true that we do a bit more than they do, but in the end, this shouldn’t explain a 4x difference in our performance.

Back then, and before digging into the details, I took the results of the report with humility. Therefore, my ambition was to crawl the data, analyze websites where their tag is running and try to find what they do better than us. We call that retro-engineering, and I find it healthy as it would help to have a faster website for everyone.

My engagement with my management was to find where we had a performance leak and solve it to be able to decrease our average execution time and get closer to our competitors.

But first, I needed to analyze the data. And, wow, I wasn’t prepared for that.

The report

The report is a dataset that is being monthly generated by The HTTP Archive. Here is a quote from their About Page:

“Successful societies and institutions recognize the need to record their history – this provides a way to review the past, find explanations for current behavior, and spot emerging trends. In 1996, Brewster Kahle realized the cultural significance of the Internet and the need to record its history. As a result he founded the Internet Archive which collects and permanently stores the Web’s digitized content.”

“In addition to the content of web pages, it’s important to record how this digitized content is constructed and served. The HTTP Archive provides this record. It is a permanent repository of web performance information such as size of pages, failed requests, and technologies utilized. This performance information allows us to see trends in how the Web is built and provides a common data set from which to conduct web performance research.”

Every month, they run a Lighthouse audit on millions of websites and generate a dataset containing the raw results.

As it is open-source and legit, it can be used by anyone to draw data visualization and ease access to this type of data.

That’s what the inventor of Google Lighthouse, Patrick Hulce, has done. Through his website, GitHub, he provides a nice visualization of this huge dataset and allows anyone to dig into details through several categories such as Analytics, Ads, Social Media and more. As I said, you’ll find the CRO tools in the Analytics category.

The website is fully open-source. The methodology is known and can be accessed.

So, what’s wrong with the report?

Well, there’s nothing technically wrong with it. We could find it disappointing that the dataset isn’t automatically updated every month, but the repository is open-source, so anyone motivated could do it.

However, this is only displaying the data in a fancy manner and not providing any insights or deep analysis of it. Any flaw or inconsistency will remain hidden and it could lead to a situation where a third party is seen as having bad performance compared to others when it is not necessarily the case.

One issue though, not related to the report itself, is the flaw an average could bring with it. That’s also something we are all aware of but that we tend to forget. If you take 10 people, 9 of them earn 800€ a month but one is earning 12 million euros a month, then we could conclude that everyone earns 1.2 million euros per month. Statistically right, but sounds a bit wrong, doesn’t it? More on that in a minute.

Knowing that, it was time to get my hands a bit dirty. With my team, we downloaded the full dataset from February 2023 to run our own audit and understand where we had performance leaks.

Note that downloading the full dataset is something we have been doing regularly for about one and a half years to monitor our trend. However, this time I decided to dig into the February 2023 report in particular.

The analysis

On this dataset, we could find the full list of websites running AB Tasty that have been crawled and the impact our tag had on them. To be more accurate, we have the exact measured execution time of our tag, in milliseconds.

This is what we extracted. The pixellated column is the website URL. The last column is the execution time in milliseconds.

With the raw data, we were able to calculate a lot of useful metrics.

Keep in mind that I am not a mathematician or anything close to a statistics expert. My methodology might sound odd, but it’s adequate for this analysis.

  • Average execution time

This is the first metric I get — the raw average for all the websites. That’s probably very close, if not equal, to what is used by the thirdpartyweb.today website. We already saw the downside of having an average, however, it’s still an interesting value to monitor.

  • Mean higher half and mean lower half

Then, I split the dataset in half. If I have 2000 rows, I create two groups of 1000 rows. The “higher” one and the “lower” one. It helps me have a view of the websites where we perform – the worst compared to the best. Then, I calculate the average of each half.

  • The difference between the two halves

The difference between the two halves is important as it shows the disparity within the dataset. The closer it is, the less extreme values we have.

  • The number of websites with a value above 6k ms

It’s just an internal metric we follow to give us a mid-term goal of having 0 websites above this value.

  • The evolution of the last dataset

I compute the evolution between the last dataset I have and the current. It helps me see if we get better in general, as well as how many websites are leaving or entering the chart.

The results

These are the results that we have:

Here are their corresponding graphs:

This is the evolution between October 2022 and February 2023:

Watch out: Logarithmic scale! Sorted by February 2023 execution time from left to right.

The figures say it all. But, if I can give a global conclusion, it’s that we made tremendous improvements in the first six months and staled a bit after with finer adjustments (the famous 80/20 of Pareto’s).

However, after the initial fall, two key figures are important.

First of all, the difference between the two halves is getting very close. This means that we don’t have a lot of potential performance leaks anymore (features that lead to an abnormal increase in the execution time). This is our first recent win.

Then, the evolution shows that in general, and except for the worst cases, it is steady or going down. Another recent win.

Digging into the details

What I have just shared is the raw results without having a look at the details of each row and each website that is being crawled.

However, as we say, the devil is in the details. Let’s dig in a bit.

Let’s focus on the websites where AB Tasty takes more than six seconds to execute.

Six seconds might sound like a lot (and it is), but don’t forget that the audit simulates a low-end CPU which is not representative of the average device. Instead, it shows the worst-case scenario.

In the February 2023 report, there are 33 of them. This is an average execution time of 19877 ms. I quickly identified that:

  • 27 of them are from the same AB Tasty customer
  • One of them is abtasty.com and the total execution of resources coming from *abtasty.com on this website is very high ?
  • Two others are also coming from one singular AB Tasty customer

In the end, we have only 5 customers on this list (but still 33 websites, don’t get me wrong).

Let’s now try to group up these two customers with duplicates to see the impact on the average. The customer with 27 duplicates also has websites that are below the 6k ms mark, but I’m going to ignore it for now (and to ease things up).

For each of the two customers with duplicates, I’m going to compute the average of all their duplicates. For the first one, the result is 21671 ms. For the second, the result is 14708 ms.

I’m also going to remove abtasty.com, which is not relevant.

With the new list, I went from 1223 ms for the full list average to 1005 ms. I just improved our average by more than 200 ms! ?

Wait, what? But you’re just removing the worst websites. Obviously, you are getting better…

Yep, that’s true. That’s cheating for sure! But, the point of this whole article is to demonstrate that data doesn’t say it all.

Let’s talk first about what is happening with this customer that has 27 duplicates.

The same tag has been deployed on more than 50 very different websites! You might not be very familiar with AB Tasty, so let me explain why this is an issue.

You might have several websites which have the same layout (that’s often the case when you have different languages). It makes sense to have the same tag on these different domains to be able to deploy the same personalizations on all of them at once. That’s not the most optimal way of doing it, but as of today, that’s the easiest way to do it with our tool.

However, if your websites are all different, there is absolutely no point in doing that. You are going to create a lot of campaigns (in this case, hundreds!) that will almost never be executed on the website (because it’s not the correct domain) but are still at least partially included in the tag. So our tag is going to spend its time checking hundreds of campaigns that have no chance to execute as the URL is rarely going to be valid.

Though we are working on a way to block this behavior (as we have alternatives and better options), it will take months before it disappears from the report.

Note: If you start using AB Tasty, you will not be advised to do that. Furthermore, the performance of your tag will be far better than that.

Again, I didn’t take the time to group all the duplicated domains as it is pointless, the goal was to demonstrate that it is easy to show better performance if we exclude anomalies that are not representative. We can imagine that we would improve more than 200+ ms by keeping only one domain.

I took the most obvious case, but a quick look at the rest of the dataset showed me some other examples.

The competitors’ figures

Knowing these facts and how our score might look worse than it is because of one single anomaly, I started looking into our competitors’ figures to see if they have the same type of issue.

I’m going to say it again: I’m not trying to say that we are better (or worse) than any of our competitors here, that’s not my point. I’m just trying to show you why statistics should be deeply analyzed to avoid any interpretation mistakes.

Let’s start by comparing AB Tasty’s figures for February 2023 with the same metrics for one of them.

Competitor's figures

In general, they look a bit better, right? Better average and even the means for each half is better (and the lower half by a lot!).

However, between the two halves, the factor is huge: 24! Does it mean that depending on your usage, the impact of their tag might get multiplied by 24?

If I wanted to tease them a little bit, I would say that when testing the tag on your website, you might find excellent performance but when starting to use it intensely you might face serious performance drops.

But, that would be interpreting a very small part of what the data said.

Also, they have more than twice the number of websites that are above the 6k ms mark (again: this mark is an AB Tasty internal thing). And that is by keeping the duplicates in AB Tasty’s dataset that we discussed just before! They also have duplicates, but not as many as we do.

A first (and premature) conclusion is that they have more websites with a big impact on performance but at the same time, their impact is lower in general.

Now that I know that in our case we have several customers that have duplicates, I wanted to check if our competitors have the same. And this one does – big time.

Among the 2,537 websites that have been crawled, 40% of them belong to the same customer. This represents 1,016 subdomains of the same domain.

How does this impact their score?

Well, their customer wasn’t using the solution at the moment the data was collected (I made sure of it by visiting some of the subdomains). This means that the tag wasn’t doing anything at all. It was there, but inactive.

The average execution time of these 1,016 rows in the dataset is 59 ms!! ? It also has a max value of 527 ms and a min value of 25 ms.

I don’t need to explain why this “anomaly” interestingly pulls down their average, right?

The 1,016 subdomains are not fake websites at all. I’m not implying that this competitor cheated on purpose to look better- I’m sure they didn’t. It is just a very nice coincidence for them, whether they are aware of it or not.

To finish, let’s compare the average of our two datasets after removing these 1,016 subdomains.

AB Tasty is at 1223 ms (untouched list) when this competitor is now at… 1471 ms.

They went from 361 ms better to 248 ms worse. I told you that I can let the figures say whatever I want. ?

I would have a lot of other things to say about these datasets, but I didn’t run all the analysis that could have been done here. I already spent too much time on it, to be honest.

Hopefully, though, I’ve made my point of showing that the same dataset can be interpreted in a lot of different manners.

What can we conclude from all of this?

The first thing I want to say is: TEST IT.

Our solution is very easy to implement. You simply put the tag on your website and run an audit. To compare, you can put another tool’s tag on your website and run the same audit. Run it several times with the same conditions and compare. Is the second tool better on your website? Fine, then it will probably perform better for your specific case.

Does a random report on the web says that one solution is better than another? Alright, that’s one insight, but you should either crunch the data to challenge it or avoid paying too much attention to it. Just accepting the numbers as it is displayed (or worse: advertised…) might make you miss a big part of the story.

Does AB Tasty have a bad performance?

No, it doesn’t. Most of our customers never complained about performance and some are very grateful for the latest improvements we’ve released on this topic.

So, some customers are complaining?

Yes. This is because sometimes AB Tasty can have a lower performance depending on your usage. But, we provide tools to help you optimize everything directly from our platform. We call this the Performance Center. It is a full section inside the platform and is dedicated to showing you which campaign is impacting your performance and what you can do to improve it. Just follow the guidelines and you’ll be good. It’s a very innovative and unique feature in the market, and we are very proud of it.

Though, I must admit that a few customers (only a few) have unrealistic expectations about performance. AB Tasty is a JS tag that is doing DOM manipulations, asynchronous checks, data collection and a lot of fancy stuff. Of course, it will impact your website more than a simple analytics tool will. The goal for you is to make sure that the effect of optimizing your conversions is higher than what it costs you in terms of performance. And it will be the same, whatever the CRO tool you are using, except if you use a server-side tool like Flagship by AB Tasty, for example.

I am convinced that we should aim towards a faster web. I am very concerned about my impact on the environment, and I’m trying to keep my devices as long as possible. My smartphone is 7 years old (and I’m currently switching to another one that is 10 years old) and my laptop isn’t very recent either. So, I know that a slow website can be a pain.

Final Remarks

Let me assure you that at AB Tasty we are fully committed to improving our performance because our customers are expecting us to do it, because I am personally motivated to do it, and because that is a very fun and interesting challenge for the team (and also because my management asks me to do it ? ).

Also, kudos to HTTP Archive which does very important work in gathering all this data and especially sharing it with everyone. Kudos to Patrick Hulce who took the time to build a very interesting website that helps people have a visual representation of HTTP Archive’s data. Kudos to anyone that works to build a better, faster and more secure web, often for free and because that’s what they believe in.

Want to test our tool for yourself? AB Tasty is the complete platform for experimentation, content personalization, and AI-powered recommendations equipped with the tools you need to create a richer digital experience for your customers — fast. With embedded AI and automation, this platform can help you achieve omnichannel personalization and revolutionize your brand and product experiences.

Article

13min read

How to Deal with Low Traffic in CRO

If your website traffic numbers aren’t as high as you may hope for, that’s no reason to give up on your conversion rate optimization (CRO) goals.

By now you must have noticed that most CRO advice is tailored for high-traffic websites. Luckily, this doesn’t mean you can’t optimize your website even if you have lower traffic.

The truth is, any website can be optimized – you just need to tailor your optimization strategy to suit your unique situation.

In this article, we will cover:

CRO analogy

In order to make this article easier to understand, let’s start with an analogy. Imagine that instead of measuring two variants and picking a winner, we are measuring the performance of two boxers and placing bets on who will win the next 10 rounds.

So, how will we place our bet on who will win?

Imagine that boxer A and boxer B are both newbies that no one knows. After the first round, you have to make your choice. In the end, you will most likely place your bet on the boxer who won the first round. It might be risky if the winning margin is small, but in the end, you have no other way to base your decision.

Imagine now that boxer A is known to be a champion, and boxer B is a challenger that you don’t know. Your knowledge about boxer A is what we would call a prior – information you have before that influences your decision.

Based on the prior, you will be more likely to bet on boxer A as the champion for the next few rounds, even if boxer B wins the first round with a very small margin.

Furthermore, you will only choose boxer B as your predicted champion if they win the first round by a large margin. The stronger your prior, the larger the margin needs to be in order to convince you to change your bet.

Are you following? If so, the following paragraphs will be easy to grasp and you will understand where this “95% threshold” comes from.

Now, let’s move on to tips for optimizing your website with low traffic.

1. Solving the problem: “I never reach the 95% significance”

This is the most common complaint about CRO for websites with lower traffic and for lower traffic pages on bigger websites.

Before we dig into this most common problem, let’s start by answering the question, where does this 95% “golden rule” come from?

The origin of the 95% threshold

Let’s start our explanation with a very simple idea: What if optimization strategies were applied from day one? If two variants with no previous history were created at the same time, there would be no “original” version challenged by a newcomer.

This would force you to choose the best one from the beginning.

In this setting, any small difference in performance could be measured for decision-making. After a short test, you will choose the variant with the higher performance. It would not be good practice to pick the variant that had lower performance and furthermore, it would be foolish to wait for a 95% threshold to pick a winner.

But in practice, optimization is done well after the launch of a business.

So, in most real-life situations, there is a version A that already exists and a new challenger (version B) that is created.

If the new challenger, version B, comes along and the performance difference between the two variants is not significant, you will have no issues declaring version B “not a winner.”

Statistical tests are symmetric. So if we reverse the roles, swapping A and B in the statistical test will tell you that the original is not significantly better than the challenger. The “inconclusiveness” of the test is symmetric.

So, why do you set 100% of traffic toward the original at the end of an inconclusive test, implicitly declaring A as a winner? Because you have three priors:

  1. Version A was the first choice. This choice was made by the initial creator of the page.
  2. Version A has already been implemented and technically trusted. Version B is typically a mockup.
  3. Version A has a lot of data to prove its value, whereas B is a challenger with limited data that is only collected during the test period.

Points 1 & 2 are the bases of a CRO strategy, so you will need to go beyond these two priors. Point 3 explains that version A has more data to back its performance. This explains why you trust version A more than version BVersion A has data.

Now you understand that this 95% confidence rule is a way of explaining a strong prior. And this prior mostly comes from historical data.

Therefore, when optimizing a page with low traffic, your decision threshold should be below 95% because your prior on A is weaker due to its traffic and seniority.

The threshold should be set according to the volume of traffic that went through the original from day one. However, the problem with this approach is that we know that the conversion rates are not stable and can change over time. Think of seasonality – i.e. black Friday rush, vacation days, Christmas time increases in activity, etc. Because of the seasonal changes, you can’t compare performances in different periods.

This is why practitioners only take into account data for version A and version B taken at the same period of time and set a high threshold (95%) to accept the challenger as a winner in order to formalize a strong prior toward version A.

What is the appropriate threshold for low traffic?

It’s hard to suggest an exact number to focus on because it depends on your risk acceptance.

According to the hypothesis protocol, you should structure a time frame for the data collection period in advance.

This means that the “stop” criteria of a test are not a statistical measure or based on a certain number. The “stop” criteria should be a timeframe coming to an end. Once the period is over, then you should look at the stats to make an appropriate decision.

AB Tasty, our customer experience optimization and feature management software, uses the Bayesian framework which produces a “chances to win” index which encourages a direct interpretation instead of a p-value, which has a very complex meaning.

In other words, the “chances to win index” is the probability for a given variation to be better than the original.

Therefore, a 95% “chance to win” means that there is a 95% probability that the given variation will be the winner. This is assuming that we don’t have any prior knowledge or specific trust for the original.

The 95% threshold itself is also a default compromise between the prior you have on the original and a given level of risk acceptance (it could have even been a 98% threshold).

Although it is hard to give an exact number, let’s make a rough scale for your threshold:

  • New A & B variations: If you have a case where variation A and variation B are both new, the threshold could be as low as 50%. If there is no past data on the variations’ performance and you must make a choice for implementation, even a 51% chance to win is better than 49%.
  • New website, low traffic: If your website is new and has very low traffic, you likely have very little prior on variation A (the original variation in this case). In that case, setting 85% as a threshold is reasonable. Since it means that if you put aside the little you know about the original you still have 85% to pick up the winner and only 15% to pick a variation that is equivalent to the original, and a lesser chance that it performs worse. So depending on the context, such a bet can make sense.
  • Mature business, low traffic: If your business has a longer history, but still lower traffic, 90% is a reasonable threshold. This is because there is still little prior on the original.
  • Mature business, high traffic: Having a lot of prior, or data, on variation A suggests a 95% threshold.

The original 95% threshold is far too high if your business has low traffic because there’s little chance that you will reach it. Consequently, your CRO strategy will have no effect and data-driven decision-making becomes impossible.

By using AB Tasty as your experimentation platform, you will be given a report that includes the “chance to win” along with other statistical information regarding your web experiments. A report from AB Tasty would also include the confidence interval on the estimated gain as an important indicator. The boundaries around the estimated gain are also computed in a Bayesian way, which means it can be interpreted as the best and the worst scenario.

The importance of Bayesian statistics

Now you understand the exact meaning of the well-known 95% “significance” level and are able to select appropriate thresholds corresponding to your particular case.

It’s important to remember that this approach only works with Bayesian statistics since frequentist approaches give statistical indices (such as p-Values and confidence intervals that have a totally different meaning and are not suited to the explained logic).

2. Are the stats valid with small numbers?

Yes, they are valid as long as you do not stop the test depending on the result.

Remember the testing protocol says once you decide on a testing period, the only reason to stop a test is when the timeframe has ended. In this case, the stat indices (“chances to win” & confidence interval) are true and usable.

You may be thinking: “Okay, but then I rarely reach the 95% significance level…”

Remember that the 95% threshold doesn’t need to be the magic number for all cases. If you have low traffic, chances are that your website is not old. If you refer back to the previous point, you can take a look at our suggested scale for different scenarios.

If you’re dealing with lower traffic as a newer business, you can certainly switch to a lower threshold (like 90%). The threshold is still higher because it’s typical to have more trust in an original rather than a variant because it’s used for a longer time.

If you’re dealing with two completely new variants, at the end of your testing period, it will be easier to pick the variant with the higher conversions (without using a stat rest) since there is no prior knowledge of the performance of A or B.

3. Go “upstream”

Sometimes the traffic problem is not due to a low-traffic website, but rather the webpage in question. Typically, pages with lower traffic are at the end of the funnel.

In this case, a great strategy is to work on optimizing the funnel closer to the user’s point of entry. There may be more to uncover with optimization in the digital customer journey before reaching the bottom of the funnel.

4. Is the CUPED technique real?

What is CUPED?

Controlled Experiment Using Pre-Experiment Data is a newer buzzword in the experimentation world. CUPED is a technique that claims to produce up to 50% faster results. Clearly, this is very appealing to small-traffic websites.

Does CUPED really work that well?

Not exactly, for two reasons: one is organizational and the other is applicability.

The organizational constraint

What’s often forgotten is that CUPED means Controlled experiment Using Pre-Experiment Data.

In practice, the ideal period of “pre-experiment data” is two weeks in order to hope for a 50% time reduction.

So, for a 2-week classic test, CUPED claims that you can end the test in only 1 week.

However, in order to properly see your results, you will need two weeks of pre-experiment data. So in fact, you must have three weeks to implement CUPED in order to have the same accuracy as a classic 2-week test.

Yes, you are reading correctly. In the end, you will need three weeks time to run the experiment.

This means that it is only useful if you already have two weeks of traffic data that is unexposed to any experiment. Even if you can schedule two weeks of no experimentations into your experimentation planning to collect data, this will be blocking traffic for other experiments.

The applicability constraint

In addition to the organizational/2-week time constraint, there are two other prerequisites in order for CUPED to be effective:

  1. CUPED is only applicable to visitors browsing the site during both the pre-experiment and experiment periods.
  2. These visitors need to have the same behavior regarding the KPI under optimization. Visitors’ data must be correlated between the two periods.

You will see in the following paragraph that these two constraints make CUPED virtually impossible for e-commerce websites and only applicable to platforms.

Let’s go back to our experiment settings example:

  • Two weeks of pre-experiment data
  • Two weeks of experiment data (that we hope will only last one week as there is a supposed 50% time reduction)
  • The optimization goal is a transaction: raising the number of conversions.

Constraint number 1 states that we need to have the same visitors in pre-experiment & experiment, but the visitor’s journey in e-commerce is usually one week.

In other words, there is very little chance that you see visitors in both periods. In this context, only a very limited effect of CUPED is to be expected (up to the portion of visitors that are seen in both periods).

Constraint number 2 states that the visitors must have the same behavior regarding the conversion (the KPI under optimization). Frankly, that constraint is simply never met in e-commerce.

The e-commerce conversion occurs either during the pre-experiment or during the experiment but not in both (unless your customer frequently purchases several times during the experiment time).

This means that there is no chance that the visitors’ conversions are correlated between the periods.

In summary: CUPED is simply not applicable for e-commerce websites to optimize transactions.

It is clearly stated in the original scientific paper, but for the sake of popularity, this buzzword technique is being misrepresented in the testing industry.

In fact, and it is clearly stated in scientific literature, CUPED works only on multiple conversions for platforms that have recurring visitors performing the same actions.

Great platforms for CUPED would be search engines (like Bing, where it has been invented) or streaming platforms where users come daily and do the same repeated actions (playing a video, clicking on a link in a search result page, etc).

Even if you try to find an application of CUPED for e-commerce, you’ll find out that it’s not possible.

  • One may say that you could try to optimize the number of products seen, but the problem of constraint 1 still applies: a very little number of visitors will be present on both datasets. And there is a more fundamental objection – this KPI should not be optimized on its own, otherwise you are potentially encouraging hesitation between products.
  • You cannot even try to optimize the number of products ordered by visitors with CUPED because constraint number 2 still holds. The act of purchase can be considered as instantaneous. Therefore, it can only happen in one period or the other – not both. If there is no visitor behavior correlation to expect then there is also no CUPED effect to expect.

Conclusion about CUPED

CUPED does not work for e-commerce websites where a transaction is the main optimization goal. Unless you are Bing, Google, or Netflix — CUPED won’t be your secret ingredient to help you to optimize your business.

This technique is surely a buzzword spiking interest fast, however, it’s important to see the full picture before wanting to add CUPED into your roadmap. E-commerce brands will want to take into account that this testing technique is not suited for their business.

Optimization for low-traffic websites

Brands with lower traffic are still prime candidates for website optimization, even though they might need to adapt to a less-than-traditional different approach.

Whether optimizing your web pages means choosing a page that’s higher up in the funnel or adopting a slightly lower threshold, continuous optimization is crucial.

Want to start optimizing your website? AB Tasty is the best-in-class experience optimization platform that empowers you to create a richer digital experience – fast. From experimentation to personalization, this solution can help you activate and engage your audience to boost your conversions.