15 Common A/B Testing Mistakes (And How to Avoid Them)

These mistakes invalidate experiments, waste time, and lead to implementing changes that don't actually work.

TL;DR — The 3 Most Critical Mistakes

  1. 1.Peeking at results — Stopping tests early inflates false positive rates to 30%+
  2. 2.No sample size calculation — Underpowered tests can't detect real effects
  3. 3.Changing tests mid-flight — Any modification invalidates all previous data
1

Stopping tests too early (peeking)

critical

Checking results daily and stopping when you see significance is the #1 cause of false positives.

Why it's a problem

Statistical significance fluctuates during a test. Early "wins" often regress to the mean.

How to fix it

Pre-determine sample size and test duration. Don't look at results until the test is complete.

2

Not calculating sample size upfront

critical

Running tests without knowing how many visitors you need leads to underpowered experiments.

Why it's a problem

Without adequate sample size, you can't detect real effects or you'll get false positives.

How to fix it

Use a sample size calculator before starting. Factor in your baseline conversion rate and minimum detectable effect.

3

Testing too many variations

high

Running 5+ variations splits your traffic too thin and extends test duration exponentially.

Why it's a problem

Each additional variation requires more traffic. 5 variations means 5x the sample size.

How to fix it

Limit to 2-3 variations maximum. Test the biggest hypotheses first.

4

Ignoring seasonality and external factors

high

Starting a test during Black Friday and ending it in December will give you misleading results.

Why it's a problem

External factors affect both variants, but the timing can skew your baseline.

How to fix it

Run tests for at least one full week to capture weekly patterns. Avoid major holidays and events.

5

Testing without a hypothesis

medium

Random testing ("let's see what happens") wastes time and teaches you nothing.

Why it's a problem

Without a hypothesis, you can't learn from results or build institutional knowledge.

How to fix it

Write a clear hypothesis: "If we [change], then [metric] will [improve] because [reason]."

6

Focusing only on statistical significance

medium

A statistically significant 0.1% improvement might not be worth implementing.

Why it's a problem

Statistical significance doesn't equal practical significance.

How to fix it

Define your minimum detectable effect upfront. Consider implementation cost vs. expected lift.

7

Not segmenting results

medium

Your overall result might hide that mobile users hate the change while desktop users love it.

Why it's a problem

Aggregate results can mask important segment-level differences.

How to fix it

Analyze results by device, traffic source, new vs. returning users, and other key segments.

8

Testing the wrong metric

high

Optimizing for clicks when you should optimize for revenue leads to hollow wins.

Why it's a problem

Proxy metrics don't always correlate with business outcomes.

How to fix it

Test against metrics that directly impact your business goals. Revenue > clicks.

9

Ignoring flicker and performance

high

If users see the original page flash before your variant loads, your results are biased.

Why it's a problem

Flicker creates a poor experience that affects conversion independent of your change.

How to fix it

Use anti-flicker techniques or tools that load synchronously. Test page load impact.

10

Not documenting experiments

medium

Running tests without recording hypotheses, results, and learnings means you'll repeat mistakes.

Why it's a problem

Institutional knowledge is lost. Teams test the same things repeatedly.

How to fix it

Maintain an experiment log with hypothesis, results, learnings, and next steps.

11

Testing on low-traffic pages

medium

Testing on a page with 100 visitors/month means waiting years for significant results.

Why it's a problem

Low traffic = long test duration = stale results by the time you're done.

How to fix it

Focus on high-traffic pages. For low-traffic pages, consider qualitative research instead.

12

Making changes during the test

critical

Tweaking your variant mid-test invalidates all previous data.

Why it's a problem

You're now testing a different treatment. Previous data doesn't apply.

How to fix it

Never modify a running test. If you must change something, start a new test.

13

Not accounting for novelty effect

medium

New designs often win initially because they're different, not because they're better.

Why it's a problem

Users notice changes and may interact differently just because it's new.

How to fix it

Run tests long enough for novelty to wear off (2+ weeks). Monitor results over time.

14

Testing too small of changes

low

Testing button color changes when you should be testing value propositions.

Why it's a problem

Micro-optimizations have micro-impacts. You need large sample sizes to detect small effects.

How to fix it

Test big, bold changes first. Save micro-optimizations for later.

15

Not considering downstream effects

medium

Your signup test might increase signups but decrease activation or retention.

Why it's a problem

Optimizing one step can negatively impact the rest of the funnel.

How to fix it

Track metrics across the entire funnel. Look at cohort-level impact over time.

Pre-Test Checklist

Before launching any A/B test, verify:

Clear hypothesis documented
Sample size calculated
Test duration determined
Primary metric defined
No major events during test
QA completed on all variants
Tracking verified
Segments to analyze identified

Related Resources

Run Mistake-Free Experiments

ExperimentHQ helps you avoid common mistakes with built-in sample size calculation, no-flicker testing, and clear statistical reporting.