The problem: With a typical threshold (α = 0.05), some false positives are expected by design. Bad practices (peeking, many metrics, changing the plan) can inflate the risk dramatically. The fix isn’t “more math”—it’s better process: one primary metric, a real stopping rule, and discipline.
Important note
Any specific percentages you see quoted online (“X% of winners are false”) depend on the experiment design and behavior. This article focuses on the mechanisms that create false positives and the practical fixes you can adopt immediately.
The Base Rate Problem
At 95% confidence (p < 0.05), you accept a 5% false positive rate by design. This means:
If you run 20 tests where there's no real difference, 1 will show as "significant" by pure chance.
What Inflates False Positive Rates
Peeking at results
HighChecking results daily and stopping when “significant” inflates error.
Multiple metrics
HighLooking at many metrics increases the chance one looks “significant.”
Small samples / noisy metrics
MediumUnderpowered tests produce unstable estimates and more churn.
Changing the plan mid-test
HighTweaking targeting, variants, or goals mid-run invalidates inference.
The Cumulative Effect
These effects compound. A typical “bad practice” workflow might look like:
- • Run the test without a planned sample size
- • Check results daily and share screenshots in Slack
- • Look at 5+ metrics until one looks good
- • Stop the moment something looks “significant”
This workflow can create “winners” that disappear on re-test or fail to replicate in production.
How to Reduce False Positives
- Calculate sample size upfront and commit to it
- Don't peek at results until you reach sample size
- Define one primary metric before the test
- Use sequential testing if you must peek (Bayesian or SPRT)
- Apply Bonferroni correction for multiple metrics
False-positive prevention checklist
Pick one primary metric
Decide the one metric that determines “win/lose” before you launch.
Commit to a stopping rule
Fixed-horizon (sample size upfront) or a true sequential method. Avoid “stop when p<0.05.”
Limit multiple comparisons
If you must check multiple metrics/variants, apply a correction or use hierarchical metrics.
Run AA tests occasionally
AA tests (A vs A) help validate your pipeline and reveal bias/bugs.
If you’re seeing lots of “wins” that don’t replicate, also check for sample ratio mismatch (SRM).
Run Valid Tests
ExperimentHQ uses proper statistical methods and warns you about peeking. Get results you can trust.