Email A/B Testing Guide | EmailGuide.dev

The Problem With Most Email A/B Tests

Here's a scenario that plays out every day: a marketer sends subject line A to 500 people and subject line B to 500 people. A gets a 22% open rate, B gets a 24% open rate. The marketer declares B the winner and moves on. The problem? With those sample sizes and that difference, there's roughly a 40% chance the result is pure noise. The test told them nothing, but they think it did — and that's worse than not testing at all.

Meaningful A/B testing requires statistical discipline. You need a clear hypothesis, adequate sample sizes, proper test duration, and the ability to interpret results honestly — including the interpretation "this test was inconclusive."

Step 1: Form a Specific Hypothesis

Every test should start with a hypothesis that is specific and measurable. Vague hypotheses produce vague results.

Bad hypothesis: "Let's test two subject lines to see which one performs better."

Good hypothesis: "Adding the recipient's first name to the subject line will increase open rate by at least 3 percentage points compared to the non-personalized version."

A good hypothesis defines:

The variable being changed: What exactly is different between A and B?
The expected metric impact: Which metric will change, and by how much?
The direction: Will it increase or decrease? By how much (your minimum detectable effect)?

The minimum detectable effect (MDE) is the smallest difference you care about detecting. If a 0.5 percentage point improvement in click rate isn't meaningful to your business, don't design a test to detect it — you'll need an impractically large sample size.

Step 2: Calculate the Required Sample Size

Sample size is where most email A/B tests fall apart. Most people dramatically underestimate how many subscribers they need per variant.

The Key Variables

Sample size depends on four factors:

Baseline rate: Your current metric value (e.g., 3% CTR, 25% open rate).
Minimum detectable effect: The smallest improvement you want to detect (e.g., 1 percentage point).
Confidence level: Typically 95% (meaning a 5% chance of a false positive).
Statistical power: Typically 80% (meaning a 20% chance of missing a real effect).

Sample Size Examples

Here are realistic sample sizes needed per variant (not total — per variant) for common email metrics at 95% confidence and 80% power:

Metric	Baseline	MDE	Sample Per Variant
Open rate	25%	3pp	~2,500
Open rate	25%	1pp	~22,000
Click-through rate	3%	1pp	~4,800
Click-through rate	3%	0.5pp	~19,000
Conversion rate	1%	0.5pp	~7,000

The numbers are humbling. Detecting a 1 percentage point change in a 3% CTR requires nearly 5,000 subscribers per variant — meaning your total test population needs to be at least 10,000. If your list is smaller than this, you either need a larger MDE or you need to accept that your tests will frequently be inconclusive.

Use an online sample size calculator (Evan Miller's is a good free option at evanmiller.org/ab-testing) to compute the exact requirement for your specific baseline and MDE.

Step 3: Determine Test Duration

Email opens and clicks don't all arrive at once. Most opens happen within the first 24 hours, but a significant tail extends for days — especially for B2B audiences who may not check email on weekends.

Rules of Thumb

B2C emails: Run the test for at least 7 days to capture a full weekly cycle. Weekend behavior is different from weekday behavior, and you need both in your data.
B2B emails: Run for at least 14 days. Business audiences check email primarily Monday-Friday, and you need at least two full business weeks for reliable data.
Never call a test early. Decide on your test duration before you start and commit to it. Checking results daily and stopping when you see a "winner" is called "peeking," and it dramatically inflates your false positive rate. A result that looks significant after 2 days may reverse by day 7.

Step 4: Choose What to Test

Not all test variables are created equal. Prioritize tests based on the expected impact and the ease of implementation.

High Impact, Easy to Test

Subject lines: The highest-impact, most frequently tested variable. Subject lines affect open rates, which cascade to every downstream metric. Test length (short vs. long), tone (formal vs. casual), personalization (first name vs. no name), urgency (deadline vs. no deadline), and question vs. statement formats.
Send time: Test 2-hour windows (e.g., 8-10 AM vs. 12-2 PM vs. 6-8 PM) rather than specific times. The optimal send time varies significantly by audience. Some ESPs offer send-time optimization that tests this automatically.

Medium Impact, Moderate Effort

CTA (call to action): Test copy ("Buy Now" vs. "Shop the Sale" vs. "See Details"), color (brand color vs. contrasting color), and placement (above the fold vs. below content). CTAs directly affect click rate.
Content length: Short vs. long email bodies. Some audiences prefer concise emails; others engage more with detailed content. Test this with your actual audience rather than assuming.
Preview text: The text snippet shown next to the subject line. Often overlooked but can significantly affect open rates. Test complementary preview text (extends the subject line) vs. summarizing text (gives an overview of the email content).

Lower Impact (But Worth Testing)

Personalization: First name in body copy vs. generic greeting. The impact varies by industry — often smaller than people expect.
From name: Company name vs. person's name vs. "Name at Company." This can have a surprisingly large impact on open rates.
Image vs. no image: Some audiences engage more with image-heavy emails; others prefer text-forward content.

The One Variable Rule

Test only one variable at a time. If you change both the subject line and the CTA, you can't determine which change caused the difference in results. This feels slow, but it's the only way to build reliable knowledge about what works for your audience.

The exception is multivariate testing, where you intentionally test multiple variables simultaneously using a matrix of combinations. This requires significantly larger sample sizes (multiply the per-variant requirement by the number of combinations) and statistical methods like factorial analysis. For most email programs, stick with simple A/B tests until you have the volume and analytical capability for multivariate.

Step 5: Interpret Results Correctly

When your test concludes, resist the urge to simply compare the two percentages.

Check the Confidence Interval

Don't just look at the p-value or the "statistical significance" badge your ESP shows. Look at the confidence interval for the difference between the variants. If variant B's CTR is 3.5% and variant A's is 3.0%, but the 95% confidence interval for the difference is [-0.2%, +1.2%], the interval includes zero — meaning the true difference could be negative. The result is inconclusive, even though it might appear significant at a surface level.

Statistical Significance vs. Practical Significance

A result can be statistically significant but practically meaningless. If you test on 100,000 subscribers and find a 0.1 percentage point improvement in open rate with high confidence, that's a real effect — but it's too small to matter for your business. Always ask: "Is this difference large enough to justify changing our approach?"

Common Mistakes

Calling tests too early: Peeking at results before the test duration is complete and declaring a winner inflates false positive rates to 20-30% instead of the intended 5%.
Testing on too-small segments: If you're testing on 500 people per variant, you can only detect enormous effects (10+ percentage point differences). Most real improvements are 1-3 percentage points.
Not accounting for day-of-week effects: If variant A is sent on Tuesday and variant B on Wednesday, day-of-week differences will confound your results. Always send both variants simultaneously.
Running too many tests simultaneously: If you run 20 tests per month, one of them will show a "significant" result by pure chance (that's what 95% confidence means — 5% false positive rate). Track your test volume and be appropriately skeptical of isolated wins.
Never testing at all: Analysis paralysis about sample sizes and statistics leads some teams to never test. Even imperfect tests build intuition over time. Start testing and improve your methodology as you go.

Email A/B Testing: A Statistical Approach