A/B Testing Best Practices

Run reliable experiments that drive real business results. This guide covers everything from hypothesis design to interpreting results correctly.

TL;DR — The 5 Golden Rules

  1. 1.Start with a clear, written hypothesis
  2. 2.Calculate sample size before you start
  3. 3.Never peek at results or stop early
  4. 4.Run for at least 1-2 full weeks
  5. 5.Document everything for future learning

Before the Test

Start with a clear hypothesis

Write a specific hypothesis: "If we [change], then [metric] will [improve] because [reason]." This focuses your test and helps you learn regardless of outcome.

Bad: "Let's test a new button." Good: "If we change the CTA from 'Sign Up' to 'Start Free Trial', signups will increase 15% because it reduces perceived commitment."

Calculate sample size upfront

Determine how many visitors you need before starting. This prevents stopping too early (invalid results) or running too long (wasted time).

Use our sample size calculator with your baseline conversion rate and minimum detectable effect.

Define your primary metric

Choose one primary metric to determine the winner. Secondary metrics provide context but shouldn't change your decision.

Primary metrics should directly impact business goals. Revenue > clicks.

Set a test duration

Plan to run for at least 1-2 full weeks to capture weekly patterns. Don't stop based on early results.

Account for weekday/weekend differences and any known seasonal patterns.

During the Test

Don't peek at results

Checking results daily and stopping when you see significance inflates false positive rates from 5% to 30%+. Wait until you reach your planned sample size.

If you must monitor, use sequential testing methods that account for multiple looks.

Never change the test mid-flight

Any modification to your variant invalidates all previous data. If you need to change something, start a new test.

This includes changing copy, design, targeting rules, or traffic allocation.

Monitor for technical issues only

It's okay to check that the test is running correctly (no errors, tracking working), but don't look at conversion data.

Set up alerts for technical issues rather than checking manually.

Avoid external interference

Don't run marketing campaigns or make other site changes that could affect your test during the experiment.

If something unavoidable happens (site outage, major news), document it for analysis.

After the Test

Wait for statistical significance

Don't declare a winner until you reach 95% confidence (or your pre-defined threshold) AND your planned sample size.

Reaching significance early doesn't mean you should stop. Wait for your full sample size.

Consider practical significance

A statistically significant 0.1% improvement might not be worth implementing. Consider the business impact and implementation cost.

Calculate the expected revenue impact before deciding to implement.

Analyze segments

Your overall result might hide important segment-level differences. Check mobile vs. desktop, new vs. returning users, etc.

Be careful of multiple comparisons — segment analysis is exploratory, not confirmatory.

Document everything

Record your hypothesis, results, learnings, and next steps. This builds institutional knowledge and prevents repeating tests.

Include screenshots of variants and any unexpected observations.

Common Mistakes to Avoid

Testing too many variations

Why it's a problem

Each variation splits your traffic. 5 variations = 5x the sample size needed.

How to fix it

Limit to 2-3 variations. Test the biggest hypotheses first.

Testing tiny changes

Why it's a problem

Button color changes rarely move the needle. You need huge sample sizes to detect small effects.

How to fix it

Test bold changes that could have meaningful impact.

Ignoring flicker

Why it's a problem

If users see the original before the variant loads, it biases results and hurts UX.

How to fix it

Use anti-flicker techniques or tools like ExperimentHQ that handle this automatically.

Running tests on low-traffic pages

Why it's a problem

Testing a page with 100 visitors/month means waiting years for results.

How to fix it

Focus on high-traffic pages. Use qualitative research for low-traffic pages.

Not accounting for novelty effect

Why it's a problem

New designs often win initially just because they're different.

How to fix it

Run tests for 2+ weeks. Monitor results over time to see if the effect persists.

Pre-Launch Checklist

Before launching any A/B test, verify:

Hypothesis documented
Sample size calculated
Primary metric defined
Test duration planned
QA completed on all variants
Tracking verified
No conflicting campaigns
Team aligned on decision criteria

Related Resources

Ready to Run Better Experiments?

ExperimentHQ makes it easy to follow best practices with built-in sample size calculation, no-flicker testing, and clear statistical reporting.