15 Common A/B Testing Mistakes
(And How to Avoid Them)
These mistakes invalidate experiments, waste time, and lead to implementing changes that don't actually work.
TL;DR — The 3 Most Critical Mistakes
- 1.Peeking at results — Stopping tests early inflates false positive rates to 30%+
- 2.No sample size calculation — Underpowered tests can't detect real effects
- 3.Changing tests mid-flight — Any modification invalidates all previous data
Stopping tests too early (peeking)
criticalChecking results daily and stopping when you see significance is the #1 cause of false positives.
Statistical significance fluctuates during a test. Early "wins" often regress to the mean.
Pre-determine sample size and test duration. Don't look at results until the test is complete.
Not calculating sample size upfront
criticalRunning tests without knowing how many visitors you need leads to underpowered experiments.
Without adequate sample size, you can't detect real effects or you'll get false positives.
Use a sample size calculator before starting. Factor in your baseline conversion rate and minimum detectable effect.
Testing too many variations
highRunning 5+ variations splits your traffic too thin and extends test duration exponentially.
Each additional variation requires more traffic. 5 variations means 5x the sample size.
Limit to 2-3 variations maximum. Test the biggest hypotheses first.
Ignoring seasonality and external factors
highStarting a test during Black Friday and ending it in December will give you misleading results.
External factors affect both variants, but the timing can skew your baseline.
Run tests for at least one full week to capture weekly patterns. Avoid major holidays and events.
Testing without a hypothesis
mediumRandom testing ("let's see what happens") wastes time and teaches you nothing.
Without a hypothesis, you can't learn from results or build institutional knowledge.
Write a clear hypothesis: "If we [change], then [metric] will [improve] because [reason]."
Focusing only on statistical significance
mediumA statistically significant 0.1% improvement might not be worth implementing.
Statistical significance doesn't equal practical significance.
Define your minimum detectable effect upfront. Consider implementation cost vs. expected lift.
Not segmenting results
mediumYour overall result might hide that mobile users hate the change while desktop users love it.
Aggregate results can mask important segment-level differences.
Analyze results by device, traffic source, new vs. returning users, and other key segments.
Testing the wrong metric
highOptimizing for clicks when you should optimize for revenue leads to hollow wins.
Proxy metrics don't always correlate with business outcomes.
Test against metrics that directly impact your business goals. Revenue > clicks.
Ignoring flicker and performance
highIf users see the original page flash before your variant loads, your results are biased.
Flicker creates a poor experience that affects conversion independent of your change.
Use anti-flicker techniques or tools that load synchronously. Test page load impact.
Not documenting experiments
mediumRunning tests without recording hypotheses, results, and learnings means you'll repeat mistakes.
Institutional knowledge is lost. Teams test the same things repeatedly.
Maintain an experiment log with hypothesis, results, learnings, and next steps.
Testing on low-traffic pages
mediumTesting on a page with 100 visitors/month means waiting years for significant results.
Low traffic = long test duration = stale results by the time you're done.
Focus on high-traffic pages. For low-traffic pages, consider qualitative research instead.
Making changes during the test
criticalTweaking your variant mid-test invalidates all previous data.
You're now testing a different treatment. Previous data doesn't apply.
Never modify a running test. If you must change something, start a new test.
Not accounting for novelty effect
mediumNew designs often win initially because they're different, not because they're better.
Users notice changes and may interact differently just because it's new.
Run tests long enough for novelty to wear off (2+ weeks). Monitor results over time.
Testing too small of changes
lowTesting button color changes when you should be testing value propositions.
Micro-optimizations have micro-impacts. You need large sample sizes to detect small effects.
Test big, bold changes first. Save micro-optimizations for later.
Not considering downstream effects
mediumYour signup test might increase signups but decrease activation or retention.
Optimizing one step can negatively impact the rest of the funnel.
Track metrics across the entire funnel. Look at cohort-level impact over time.
Pre-Test Checklist
Before launching any A/B test, verify:
Related Resources
Run Mistake-Free Experiments
ExperimentHQ helps you avoid common mistakes with built-in sample size calculation, no-flicker testing, and clear statistical reporting.