What is a false positive in A/B testing?

A false positive is when the test reports a statistically significant difference even though there is no real effect. It can happen by chance (e.g. α=0.05 implies ~5% false positives) and can be inflated by bad practices like peeking or multiple comparisons.

Does checking results early increase false positives?

Yes. If you repeatedly look and stop when p<0.05, you’re effectively running many tests and the overall false-positive rate increases.

How do I reduce false positives?

Pre-define sample size and stopping rules, choose one primary metric, correct for multiple comparisons when needed, and use sequential methods if you must monitor results continuously.

False Positive Rates in A/B Testing: The Hidden Problem

Q: How do I reduce false positives?

Pre-define sample size and stopping rules, choose one primary metric, correct for multiple comparisons when needed, and use sequential methods if you must monitor results continuously.

TL;DR

The problem: With a typical threshold (α = 0.05), some false positives are expected by design. Bad practices (peeking, many metrics, changing the plan) can inflate the risk dramatically. The fix isn’t “more math”—it’s better process: one primary metric, a real stopping rule, and discipline.

Important note

Any specific percentages you see quoted online (“X% of winners are false”) depend on the experiment design and behavior. This article focuses on the mechanisms that create false positives and the practical fixes you can adopt immediately.

The Base Rate Problem

At 95% confidence (p < 0.05), you accept a 5% false positive rate by design. This means:

If you run 20 tests where there's no real difference, 1 will show as "significant" by pure chance.

What Inflates False Positive Rates

Peeking at results

High

Checking results daily and stopping when “significant” inflates error.

Multiple metrics

High

Looking at many metrics increases the chance one looks “significant.”

Small samples / noisy metrics

Medium

Underpowered tests produce unstable estimates and more churn.

Changing the plan mid-test

High

Tweaking targeting, variants, or goals mid-run invalidates inference.

The Cumulative Effect

These effects compound. A typical “bad practice” workflow might look like:

• Run the test without a planned sample size
• Check results daily and share screenshots in Slack
• Look at 5+ metrics until one looks good
• Stop the moment something looks “significant”

This workflow can create “winners” that disappear on re-test or fail to replicate in production.

How to Reduce False Positives

Calculate sample size upfront and commit to it
Don't peek at results until you reach sample size
Define one primary metric before the test
Use sequential testing if you must peek (Bayesian or SPRT)
Apply Bonferroni correction for multiple metrics

False-positive prevention checklist

Pick one primary metric

Decide the one metric that determines “win/lose” before you launch.

Commit to a stopping rule

Fixed-horizon (sample size upfront) or a true sequential method. Avoid “stop when p<0.05.”

Limit multiple comparisons

If you must check multiple metrics/variants, apply a correction or use hierarchical metrics.

Run AA tests occasionally

AA tests (A vs A) help validate your pipeline and reveal bias/bugs.

If you’re seeing lots of “wins” that don’t replicate, also check for sample ratio mismatch (SRM).

Run Valid Tests

ExperimentHQ uses proper statistical methods and warns you about peeking. Get results you can trust.

Important note

The Base Rate Problem

What Inflates False Positive Rates

Peeking at results

Multiple metrics

Small samples / noisy metrics

Changing the plan mid-test

The Cumulative Effect

How to Reduce False Positives

False-positive prevention checklist

Run Valid Tests

Related Resources

Ready to start A/B testing?