A/B Testing Best Practices
Run reliable experiments that drive real business results. This guide covers everything from hypothesis design to interpreting results correctly.
TL;DR — The 5 Golden Rules
- 1.Start with a clear, written hypothesis
- 2.Calculate sample size before you start
- 3.Never peek at results or stop early
- 4.Run for at least 1-2 full weeks
- 5.Document everything for future learning
Before the Test
Start with a clear hypothesis
Write a specific hypothesis: "If we [change], then [metric] will [improve] because [reason]." This focuses your test and helps you learn regardless of outcome.
Bad: "Let's test a new button." Good: "If we change the CTA from 'Sign Up' to 'Start Free Trial', signups will increase 15% because it reduces perceived commitment."
Calculate sample size upfront
Determine how many visitors you need before starting. This prevents stopping too early (invalid results) or running too long (wasted time).
Use our sample size calculator with your baseline conversion rate and minimum detectable effect.
Define your primary metric
Choose one primary metric to determine the winner. Secondary metrics provide context but shouldn't change your decision.
Primary metrics should directly impact business goals. Revenue > clicks.
Set a test duration
Plan to run for at least 1-2 full weeks to capture weekly patterns. Don't stop based on early results.
Account for weekday/weekend differences and any known seasonal patterns.
During the Test
Don't peek at results
Checking results daily and stopping when you see significance inflates false positive rates from 5% to 30%+. Wait until you reach your planned sample size.
If you must monitor, use sequential testing methods that account for multiple looks.
Never change the test mid-flight
Any modification to your variant invalidates all previous data. If you need to change something, start a new test.
This includes changing copy, design, targeting rules, or traffic allocation.
Monitor for technical issues only
It's okay to check that the test is running correctly (no errors, tracking working), but don't look at conversion data.
Set up alerts for technical issues rather than checking manually.
Avoid external interference
Don't run marketing campaigns or make other site changes that could affect your test during the experiment.
If something unavoidable happens (site outage, major news), document it for analysis.
After the Test
Wait for statistical significance
Don't declare a winner until you reach 95% confidence (or your pre-defined threshold) AND your planned sample size.
Reaching significance early doesn't mean you should stop. Wait for your full sample size.
Consider practical significance
A statistically significant 0.1% improvement might not be worth implementing. Consider the business impact and implementation cost.
Calculate the expected revenue impact before deciding to implement.
Analyze segments
Your overall result might hide important segment-level differences. Check mobile vs. desktop, new vs. returning users, etc.
Be careful of multiple comparisons — segment analysis is exploratory, not confirmatory.
Document everything
Record your hypothesis, results, learnings, and next steps. This builds institutional knowledge and prevents repeating tests.
Include screenshots of variants and any unexpected observations.
Common Mistakes to Avoid
Testing too many variations
Why it's a problem
Each variation splits your traffic. 5 variations = 5x the sample size needed.
How to fix it
Limit to 2-3 variations. Test the biggest hypotheses first.
Testing tiny changes
Why it's a problem
Button color changes rarely move the needle. You need huge sample sizes to detect small effects.
How to fix it
Test bold changes that could have meaningful impact.
Ignoring flicker
Why it's a problem
If users see the original before the variant loads, it biases results and hurts UX.
How to fix it
Use anti-flicker techniques or tools like ExperimentHQ that handle this automatically.
Running tests on low-traffic pages
Why it's a problem
Testing a page with 100 visitors/month means waiting years for results.
How to fix it
Focus on high-traffic pages. Use qualitative research for low-traffic pages.
Not accounting for novelty effect
Why it's a problem
New designs often win initially just because they're different.
How to fix it
Run tests for 2+ weeks. Monitor results over time to see if the effect persists.
Pre-Launch Checklist
Before launching any A/B test, verify:
Related Resources
Ready to Run Better Experiments?
ExperimentHQ makes it easy to follow best practices with built-in sample size calculation, no-flicker testing, and clear statistical reporting.