Statistics Updated 2025

A/B Testing Statistics Explained

A comprehensive guide to understanding statistical significance, p-values, confidence intervals, and sample sizes in A/B testing. Written for practitioners, not statisticians.

1 What Is Statistical Significance?

Definition:

Statistical significance indicates the probability that the observed difference between variants is not due to random chance. A result is considered statistically significant when this probability falls below a predetermined threshold (typically 5%).

In A/B testing, statistical significance helps answer a critical question: "Is the difference I'm seeing real, or could it have happened by chance?"

For example, if Variant A converts at 5.2% and Variant B converts at 4.8%, is that 0.4% difference meaningful? Or could it simply be random variation in your data?

The 95% Confidence Standard

The industry standard is 95% confidence (or p < 0.05), meaning:

  • There is less than 5% probability the result occurred by chance
  • You can be 95% confident the effect is real
  • If you ran this test 100 times, you'd expect ~5 false positives

Important

Statistical significance does not mean practical significance. A result can be statistically significant but too small to matter for your business. Always consider the magnitude of the effect alongside significance.

2 Understanding P-Values

Definition:

The p-value is the probability of observing results at least as extreme as your data, assuming the null hypothesis (no difference between variants) is true.

What P-Values Mean in Practice

p = 0.01→ 1% chance of false positive

Very strong evidence. Safe to implement for high-stakes decisions.

p = 0.05→ 5% chance of false positive

Standard threshold. Acceptable for most business decisions.

p = 0.10→ 10% chance of false positive

Suggestive but not conclusive. May warrant further testing.

p = 0.50→ 50% chance of false positive

No evidence of a real effect. Essentially a coin flip.

Common P-Value Misconceptions

Wrong: "p = 0.05 means there's a 5% chance the null hypothesis is true"
Right: "p = 0.05 means there's a 5% chance of seeing this result if there were no real difference"

3 Confidence Intervals Explained

Definition:

A confidence interval is a range of values that likely contains the true effect size. A 95% confidence interval means that if you repeated the experiment many times, 95% of the intervals would contain the true value.

Confidence intervals are often more useful than p-values because they show both the direction and magnitude of the effect, plus the uncertainty around your estimate.

Reading Confidence Intervals

Example: Conversion rate lift = +12% (95% CI: +5% to +19%)

  • Best estimate: 12% improvement
  • Likely range: between 5% and 19% improvement
  • Since the interval doesn't include 0, it's statistically significant

Key insight: If the confidence interval includes zero (or crosses from positive to negative), the result is not statistically significant.

4 Sample Size and Statistical Power

Definition:

Statistical power is the probability of detecting a real effect when one exists. Standard practice is 80% power, meaning you have an 80% chance of detecting a true effect.

Sample size depends on four factors:

Baseline Conversion Rate

Lower baseline rates require larger samples to detect changes.

Minimum Detectable Effect (MDE)

Smaller effects require larger samples to detect reliably.

Statistical Significance (α)

Lower p-value thresholds require larger samples.

Statistical Power (1-β)

Higher power requirements increase sample needs.

Sample Size Examples

Baseline RateMDESample/Variant
3%10% relative~85,000
3%20% relative~21,000
5%10% relative~50,000
10%10% relative~24,000

* Based on 95% confidence and 80% power. Use our sample size calculator for precise estimates.

5 Common Statistical Mistakes

Peeking and Early Stopping

Checking results repeatedly and stopping when you see significance dramatically increases false positive rates. A test that looks significant at 50% completion may not be significant at 100%.

Underpowered Tests

Running tests with insufficient sample sizes leads to high false negative rates. You might conclude there's no effect when one actually exists.

Multiple Comparisons

Testing many metrics or segments without correction inflates false positive rates. If you test 20 metrics, you'd expect 1 false positive at p < 0.05 even with no real effects.

Ignoring Practical Significance

A 0.1% conversion lift might be statistically significant with enough data, but may not justify implementation costs. Always consider business impact alongside statistical results.

6 Bayesian vs Frequentist Testing

There are two main statistical approaches to A/B testing, each with different philosophies and practical implications:

Frequentist Approach

  • Uses p-values and confidence intervals
  • Requires fixed sample sizes
  • Industry standard approach
  • Can't peek at results early

Bayesian Approach

  • Calculates probability of one variant being better
  • Allows continuous monitoring
  • More intuitive interpretation
  • Requires prior assumptions

ExperimentHQ uses frequentist statistics with safeguards against common pitfalls. We display clear confidence levels and provide guidance on when results are trustworthy.

Key Takeaways

  • Statistical significance indicates whether results are likely real or due to chance
  • P-values below 0.05 are the standard threshold for significance
  • Confidence intervals show both direction and magnitude of effects
  • Calculate sample size before starting tests to ensure reliable results
  • Avoid peeking and other common mistakes that inflate false positives

Apply These Concepts Today

ExperimentHQ handles the statistics for you — showing clear confidence levels and significance indicators so you can make data-driven decisions.

Related Guides