A/B Testing Sample Size:
Complete Guide
Learn how to calculate the right sample size for your A/B tests. Avoid underpowered experiments that miss real effects or waste time running tests too long.
TL;DR — Quick Reference
- Sample size depends on: baseline rate, minimum detectable effect, power, and significance
- Standard settings: 80% power, 95% significance level
- Smaller effects need larger sample sizes (exponentially more)
- Always calculate before starting — never peek at results early
Quick Sample Size Calculator
Your current conversion rate
Smallest improvement you want to detect
Probability of detecting a real effect (typically 80%)
Confidence level (typically 95%)
Per variation:
8,150
Total (both variations):
16,300
For a more comprehensive calculator, try our full sample size calculator.
What is Sample Size in A/B Testing?
Definition:
Sample size in A/B testing refers to the number of users or observations needed in each variation to detect a statistically significant difference, if one exists. It determines how long you need to run your test.
Calculating sample size before starting an experiment is crucial. Without adequate sample size:
- You might miss real improvements (false negatives)
- You might declare winners that aren't real (false positives)
- You waste time running tests too long
- You make decisions based on noise, not signal
Factors That Determine Sample Size
1. Baseline Conversion Rate
Your current conversion rate before any changes. Lower baseline rates require larger sample sizes because there's more variance in the data.
Example: A 2% conversion rate needs more samples than a 10% rate to detect the same relative improvement.
2. Minimum Detectable Effect (MDE)
Definition:
Minimum Detectable Effect (MDE) is the smallest relative improvement you want to be able to detect with your test. It's expressed as a percentage of the baseline.
Smaller MDEs require exponentially larger sample sizes. A 5% MDE needs ~4x more samples than a 10% MDE.
Practical tip: Start with a 10-20% MDE. If you need to detect smaller effects, consider whether the business impact justifies the longer test duration.
3. Statistical Power
Definition:
Statistical power is the probability of detecting a real effect when one exists. The industry standard is 80%, meaning you'll correctly identify a real winner 80% of the time.
Higher power (90% or 95%) reduces false negatives but requires larger sample sizes.
4. Significance Level (α)
Definition:
Significance level is the probability of declaring a winner when there's no real difference (false positive rate). The standard is 5% (95% confidence level).
Lower significance levels (99% confidence) require larger sample sizes but reduce false positives.
The Sample Size Formula
For a two-sample proportion test (standard A/B test):
n = 2 × (Zα/2 + Zβ)² × p̄(1-p̄) / (p₂ - p₁)²
Where:
- n = sample size per variation
- Zα/2 = Z-score for significance level (1.96 for 95%)
- Zβ = Z-score for power (0.84 for 80%)
- p̄ = pooled conversion rate ((p₁ + p₂) / 2)
- p₁ = baseline conversion rate
- p₂ = expected conversion rate with improvement
In practice, you don't need to calculate this by hand. Use our sample size calculator or the quick calculator above.
Common Sample Size Mistakes
❌ Not calculating sample size upfront
Running a test without knowing how many visitors you need leads to either stopping too early (invalid results) or running too long (wasted time).
❌ Peeking at results
Checking results daily and stopping when you see significance inflates your false positive rate from 5% to 30%+. Wait until you reach your calculated sample size.
❌ Using unrealistic MDEs
Expecting a 50% improvement is unrealistic for most tests. A 5-20% relative improvement is more typical. Use realistic MDEs to get accurate sample size estimates.
❌ Ignoring the number of variations
Each additional variation requires more total traffic. Testing 5 variations means you need 5x the traffic of an A/B test to maintain the same power.
Related Resources
Run Properly Powered Experiments
ExperimentHQ helps you calculate sample size automatically and alerts you when tests reach significance.