Open Rate Is Lying to You (How to Actually Pick an Email A/B Test Winner)

The Scene Most Founders Know Too Well

You ran an A/B test last week. Variant A beat Variant B on open rate by 8 points. You declared a winner, rolled it out to the rest of your list, and waited for the replies to pour in.

They didn't.

Three weeks later you're staring at the numbers and the "winning" email is generating fewer meetings than the loser did in its small test cohort. What went wrong? You picked the wrong metric. That's the short answer. The longer answer is what this post is about.

I've watched this exact pattern play out at every business I've touched, from the trading desks I ran on Wall Street to the small business owners we now work with at theKrew. Someone runs a test, celebrates the number that's easiest to celebrate, and later finds out it had nothing to do with revenue. The cycle repeats because nobody taught them how to read the data, and running the tests manually is tedious enough that most people stop after three or four anyway.

Here's how to call a real winner in three rules, and how to make sure you actually keep doing this long enough for it to matter.

Rule 1: Open Rate Tells You About Your Subject Line. That's It.

Open rate measures one thing: how many people saw the subject line in their inbox and decided it was worth a click. That's useful. It's also not the same as "this email is working."

Picture a subject line like "Quick question about your Q2 pipeline." It might pull a 42% open rate. Great hook. Now the body of the email asks the reader to watch a 14-minute video before booking a call. Most people bail. The open rate said "great email." The behavior said otherwise.

Mailchimp's benchmarks put the average open rate around 21% across industries, with B2B cold outreach usually running 15-25%. If Variant A hits 35% and Variant B hits 25%, yes, A's subject line is winning. But that only tells you which version got more people through the door. What happens once they're inside is a separate question.

The trap: when open rates spike, founders assume the whole email is better and ship it to the rest of the list. Then they're surprised when the reply rate doesn't move or actually drops.

Before you call a winner, check whether reply rate followed the same pattern. If A won opens but B won replies, the body copy of A is losing the reader after the click. Keep B's body. Use A's subject line. That Frankenstein is the actual winner.

This is also the first place where theKrew's Campaign Strategist agent earns its keep. Most email tools report opens and replies in separate dashboards, on separate days, with no pattern analysis. Our system pairs them automatically and flags the mismatch the moment it shows up. You don't have to notice it, and you don't have to remember to check. That's the whole point of letting a system run your experiments instead of doing it in spreadsheets at midnight.

Rule 2: Reply Rate Is the Only Metric That Becomes Meetings

Replies are the goal. Not opens. Not clicks. Replies.

A reply means someone read the email, thought about it, and decided you were worth 30 seconds of typing back. That's one step from a sales conversation. Everything else is noise you're measuring because it's easy to measure.

HubSpot's 2024 data puts healthy cold outreach reply rates at 1-5%, with anything above 5% being strong. Warm email to an engaged list should hit 8-15%. If you're below 1% on cold, the email isn't doing its job regardless of what the open rate says.

The pattern that trips people up goes like this. Variant A pulls 42% opens and 1.2% replies. Variant B pulls 28% opens and 2.4% replies. A looks better on the surface. B is better by every measure that matters for revenue.

I watched this play out at Tuple Technologies last year on a cold sequence targeting IT decision-makers. The "winning" variant by open rate produced 3 meetings out of 400 sends. The "losing" variant produced 11 meetings out of 400. Same sender, same list, different body copy. The reply rate told me the truth a week earlier than revenue did, and if I'd shipped the open-rate winner to the rest of the list the way I originally planned, I would have quietly killed a sequence that was working.

Trust replies above all else. Even when the open rate is yelling at you to pick the other one.

The reason this matters for your theKrew subscription: reply rates move slowly. They show up weeks behind open rates. Most founders cancel at month two because the big headline numbers (opens, clicks) look flat and they haven't learned to read the signal underneath. By month three, the reply-rate compounding kicks in. That's the same arc I wrote about in why marketing takes time and it applies here in miniature. Don't pull the plug before the data can talk back.

Rule 3: Small Samples Lie. Big Samples Lie Less.

Here's where most tests fall apart.

Best practice is to send each variant to at least 100 recipients before calling anything. In reality, most small business A/B tests run on 40-60 sends per variant. At that sample size, random noise can flip which variant looks like the winner by 15-20 percentage points. You're not measuring email quality. You're measuring which 50 people happened to be at their desk that hour.

Campaign Monitor's guidance shows that to detect a 5-point difference in open rate with 95% confidence, you need at least 385 recipients per variant. For reply rate differences, which are much smaller absolute numbers, you need even more.

What to do if your list is small:

If your results are close (within 2-3 points on reply rate), the test hasn't told you anything useful yet. Run the same experiment again next week to see if the pattern holds. Same variant wins twice, you have signal. Flips, you had noise.

If the results are wildly different (6+ points on reply rate), you probably have signal even with a small sample. But don't ship to your entire list right away. Roll it out to the next 250-500 recipients and confirm before fully committing.

If your list is genuinely tiny (under 500 total contacts), you have a different problem: you probably can't A/B test productively at all. Focus on qualitative feedback instead. Send your best guess, then read the replies and the rejections carefully for pattern recognition. At that size, 10 thoughtful reads beat 10 statistics tabs.

This is the second place where running this manually kills most founders. Statistical discipline is boring, and nobody has the patience to run the same test three weeks in a row to confirm signal when they're also running the business. theKrew's AI agents pool data across sequences so your 60-person test gets analyzed against the broader pattern of what's working in your campaigns, not just in isolation. The math that takes a founder three hours of spreadsheet time happens in the background while you sleep. That's what you're paying for, and it's why the system gets sharper the longer you stay subscribed. Month one's tests are decent. Month six's tests are surgical, because the data has compounded.

How to Actually Call a Winner

Most email platforms have a Statistics tab in their Sequences view. Whether you're using software or a spreadsheet, the workflow is the same:

Check reply rate first. If one variant is clearly winning, use its body copy.
Check open rate second. If the winning-body variant also won opens, ship it as-is. If a different variant won opens, combine: best subject line from one, best body from the other.
Check sample size. Below 100 per variant, don't call it yet. Run again or expand the test.
Lock the winner in your sequences. Archive the losing body copy. Write down what you think caused the difference, because that hypothesis will make your next test sharper.

The part nobody tells you: your first three or four A/B tests won't teach you anything definitive. Sample sizes will be too small, differences will be within noise, and you'll flip-flop on what's working. That's normal. Testing is a skill that compounds. By test number ten you'll be reading signal much faster. By test twenty you'll stop arguing with your data. By test fifty, which most solo founders never reach because they quit running tests, you'll have a library of subject lines and body copy that converts for your specific audience.

This is the real reason theKrew's subscription pays for itself. Not because $99/month replaces an agency. Because it runs test fifteen and test thirty and test fifty while you're in client meetings or on family dinner. The compounding only works if the experiments keep running. Humans stop. Systems don't.

The Takeaway

Open rate is seductive because it's big, it moves fast, and it gives you something to celebrate. Reply rate is less fun to look at. The numbers are smaller, the validation is slower, but it's the metric that shows up in your bank account.

Pick the variant that drives replies. Trust the body over the subject line. Be patient with small samples. And if you're tired of calling winners from single unreliable tests and want a system that runs dozens of experiments in the background and actually remembers what worked three months ago, theKrew does that for $99/month. You still make the final call. You just don't do the math, and you don't quit before the compounding kicks in.

P.S. The best A/B test I ever ran at Tuple Tech was one where both variants lost. The winning version came from a third email I wrote out of frustration two days later, the day I stopped trying to be clever. Sometimes the data is telling you to start over, not pick between two bad options. The good news is that if theKrew is running your sequences, it's already writing that third email while you're still staring at the first two.