
How to A/B Test Push Notifications (And What to Actually Measure)
A practical guide to A/B testing push notifications. Covers what to test, how to reach statistical significance, which metrics matter, and common mistakes that invalidate results.
Most teams that run push notification A/B tests are doing it wrong. They test the wrong things, declare winners too early, measure the wrong metrics, and end up with results that don't actually improve engagement. The good news is that fixing these mistakes is straightforward once you know what to look for.
This guide covers everything you need to run valid, actionable A/B tests on your push notifications — from picking what to test first, to reaching statistical significance, to translating results into real improvements in open rate and conversion.
Why A/B Test Push Notifications?
Push notifications are one of the few marketing channels where small copy changes produce measurable lift within hours. An email A/B test might take a week to reach significance. A push notification test on a mid-size app can reach significance in a single send. That speed makes testing extremely valuable — if you're willing to actually run the tests.
The business case is straightforward. If your current push notifications have a 3% open rate and a good test reveals a message format that gets 4.5%, that's a 50% improvement in one of your primary re-engagement channels. At scale, that's a meaningful difference in daily active users, session frequency, and ultimately revenue.
The compounding effect of continuous testing
A single well-designed test might improve open rates by 20-30%. Run one good test per month and the compounding effect is significant. Teams that test consistently typically see open rates 2-3× higher than the industry average within six months — not because they had a breakthrough, but because they eliminated bad practices one test at a time.
The challenge isn't running a test. Most push platforms have split-testing built in. The challenge is structuring tests so the results are actually trustworthy and actionable.
What to Test (And in What Order)
Not everything is worth testing equally. Some variables produce large, reliable lifts. Others produce small or inconsistent results that waste your audience. Start with the highest-impact variables and work down.
Notification Title / Headline
The title is displayed in bold and is the first thing a user reads before deciding to tap or dismiss. It has the single largest impact on open rate of any testable variable. Test:
- • Question vs. statement ("Did you miss this?" vs. "Your weekly recap is ready")
- • Urgency vs. curiosity ("Last chance: offer ends tonight" vs. "Something interesting happened")
- • Personalised vs. generic ("[Name], your streak is at risk" vs. "Don't break your streak")
- • Short (<5 words) vs. medium (5-10 words)
Send Timing
When you send has a massive impact on open rates. The optimal time varies by app category, user timezone, and user behaviour. Don't assume industry benchmarks apply to your audience — test it directly. Good timing comparisons:
- • Morning (7-9am) vs. evening (7-9pm) local time
- • Weekday vs. weekend sends for the same message type
- • User-local time vs. a fixed timezone
- • Behaviour-triggered (within 1hr of last session) vs. scheduled
Body Copy and Emoji
The body text is often truncated, especially on lock screens, but it still affects tap rate. Test body copy length (one sentence vs. two), tone (casual vs. professional), and the presence or absence of emoji. Emoji results vary significantly by audience — never assume they help.
- • With emoji in title vs. without emoji
- • Emoji at start vs. emoji at end
- • Specific emoji (🔥 vs. ⚡ vs. 📣) for the same message
- • Descriptive body vs. teaser body (withholding information to drive tap)
Rich Media and Action Buttons
Images and action buttons (on supported platforms) add complexity and production overhead. Test these after you've established strong copy and timing baselines — otherwise you're optimising the decoration before the foundation.
Statistical Significance Made Simple
Statistical significance answers one question: how confident can you be that the difference you observed isn't just random noise? The standard threshold is 95% confidence — meaning if you ran the same test 100 times, you'd expect to see the result 95 times or more.
For push notifications, you're typically testing open rate (a binary event: opened or didn't). The math involves comparing two proportions. Most push platforms calculate this automatically. What you need to know as a practitioner:
You can trust the result when:
- ✓ p-value is below 0.05 (95% confidence)
- ✓ Both variants had the same exposure time
- ✓ Users were randomly assigned (not by segment)
- ✓ You tested one variable at a time
- ✓ The test ran to your planned sample size
Don't trust the result when:
- ✗ You stopped the test early because one variant looked better
- ✗ The sample was too small (<1,000 per variant)
- ✗ You changed the test while it was running
- ✗ Variants were sent to different segments
- ✗ The time window crossed a major holiday or event
The most common mistake is peeking at results early and stopping when one variant is ahead. This inflates false positive rates dramatically. Decide your stopping point before you launch.
How Many Users Do You Need?
Sample size depends on three things: your baseline open rate, the minimum improvement you want to be able to detect (called the minimum detectable effect or MDE), and your significance level. As a practical guide for push notifications:
| Baseline Open Rate | Want to detect 20% lift | Want to detect 10% lift | Want to detect 5% lift |
|---|---|---|---|
| 2% | ~17,000 / variant | ~67,000 / variant | ~268,000 / variant |
| 5% | ~6,500 / variant | ~26,000 / variant | ~104,000 / variant |
| 10% | ~3,000 / variant | ~12,000 / variant | ~48,000 / variant |
| 20% | ~1,300 / variant | ~5,200 / variant | ~20,800 / variant |
If your audience is too small to reach significance
Focus on testing large, obvious differences (not subtle copy tweaks) and accumulate results across multiple sends of the same test. Treat each test as directional evidence rather than a definitive result, and only act when you see a consistent pattern across 3+ tests.
Most push platforms calculate sample size for you. If yours doesn't, use a free A/B test sample size calculator (search "Evan Miller sample size calculator" — it's the most accurate free tool available).
Metrics That Actually Matter
Don't just look at open rate. It's the most available metric but not always the most important one. Here's the hierarchy of metrics to track and how to interpret each:
Open Rate (Direct Open Rate)
The percentage of delivered notifications that resulted in the app opening via the notification. Good baseline metric. Measures attention and immediate appeal.
Watch for: iOS vs Android differences (iOS requires permission, so opted-in rates will be higher)
Click-Through Rate (CTR)
The percentage of users who tapped through to a specific in-app destination from the notification. More meaningful than open rate for conversion-focused campaigns.
Watch for: Users who open the app from the icon rather than the notification itself
Conversion Rate
The percentage of users who completed the desired action (purchase, subscription, feature use) within a defined attribution window after receiving the notification. The true north-star metric for ROI.
Watch for: Attribution windows vary — 1hr, 24hr, and 7-day windows give very different numbers
Opt-Out Rate
The percentage of users who disabled notifications after receiving this campaign. An often-ignored metric that's crucial for long-term health. A campaign with a high open rate but elevated opt-outs is destroying your future reach.
Alert threshold: if opt-out rate is >0.5% on a single campaign, investigate immediately
Session Length / Feature Engagement
Advanced metric. Measures whether the notification brought users back into a meaningful session or just a quick open-and-close. Useful for evaluating deep-link notifications vs. generic re-engagement sends.
Requires custom event tracking and a longer measurement window (24-48hr)
What Not to Test (Yet)
Testing everything simultaneously is a good way to learn nothing. Some variables require prerequisites before they're worth testing:
Don't test yet if you haven't established:
- Personalisation — test only after you have reliable user data. Bad personalisation ("Hi !" due to missing name) performs worse than generic copy.
- Rich media — adds production overhead. Only worth testing after copy and timing are optimised.
- Notification frequency — frequency tests require a multi-week window and cohort tracking. Don't run these until you have robust analytics.
- Multivariate tests — testing title + timing + emoji simultaneously requires a very large audience. Start with single-variable tests.
Also avoid:
- Testing during anomalous periods — major app updates, seasonal spikes, and marketing campaigns all distort results.
- Testing on your most valuable segment — run initial tests on a broad segment before exposing your highest-LTV users to experimental content.
- Testing too many variants — two variants is almost always optimal. Three or more require proportionally more sample size and extend test duration.
Common A/B Testing Mistakes
These are the mistakes that produce invalid results most often. They're surprisingly common even in teams that have been running push notifications for years.
1. Peeking and stopping early
The most damaging mistake. When you stop a test early because one variant looks ahead, you're exploiting random variation, not detecting a real effect. False positive rates can reach 30-40% from early stopping, even if you check at 95% confidence. Set your sample size before launch and don't check results until you've hit it.
2. Testing on different segments by accident
If variant A goes to Android users and variant B goes to iOS users, you're not testing your copy — you're testing platforms. Random assignment must be at the individual user level, not the device type, geography, or cohort level.
3. Changing the test mid-run
Editing the notification copy, adjusting the send schedule, or changing the attribution window after the test starts invalidates all prior data. If you spot an error, end the test, fix it, and start fresh.
4. Optimising for open rate when conversion is the goal
Clickbait titles reliably increase open rate. They do not reliably increase conversions, and they often increase opt-out rates. Always include a downstream metric (at minimum, a session quality metric) alongside open rate when declaring a winner.
5. Not keeping a test log
Test results are only valuable if you remember what you tested, when, against what baseline, and what you concluded. Without a log, you'll repeat the same tests and lose institutional knowledge when team members leave. Even a simple spreadsheet is better than nothing.
Running Tests with PushPilot
PushPilot has built-in A/B testing with automatic winner selection and significance tracking. The workflow:
- 1
Create a campaign and enable A/B test mode
Select the variable you're testing (title, body, send time) and enter your variants.
- 2
Set your test audience and sample split
Choose what percentage of your audience to include in the test. PushPilot randomises assignment automatically at the user level.
- 3
Set your stopping criterion
Choose between a time-based stop (run for N hours), a sample-size stop, or automatic winner detection at 95% confidence.
- 4
Review results and send winner to remainder
Once the test completes, PushPilot shows confidence intervals and recommended winner. Optionally auto-send the winner to the remaining audience.
- 5
Log and apply learnings
Export results to your test log. Apply the winning pattern as a default template for future campaigns of the same type.
AI-generated variants in PushPilot
PushPilot can generate multiple notification variants automatically based on your campaign brief. Instead of manually writing 2-3 title variants, you describe the campaign goal and the AI produces a set of diverse variants — optimised for different psychological triggers (urgency, curiosity, social proof, benefit-led) — ready to test directly.
How to Act on Your Results
Winning a test is only useful if you apply the learning. A structured approach:
p < 0.05, meaningful effect size (≥15% lift). Apply pattern to all future campaigns of this type. Update your template library.
p > 0.05 or effect size <10%. Neither variant is better. Re-test with a more dramatic difference between variants, or accept the null.
The variant you expected to win lost. Don't dismiss — this is valuable signal. Investigate why and check if the result holds across segments.
Resist the temptation to over-generalise. A title format that wins for your re-engagement campaigns may not win for your transactional notifications. Segment your test log by notification type (re-engagement, transactional, promotional, onboarding) and maintain separate learnings for each.
The most valuable output of a consistent testing programme isn't any single winning variant — it's the accumulated knowledge of what works for your specific audience. That knowledge compounds and becomes a durable competitive advantage that new entrants can't replicate quickly.
FAQs
How long should a push notification A/B test run?
Until you reach your pre-set sample size, not a fixed time. For most push campaigns, this happens within 24-48 hours of sending. Don't base stopping on time elapsed — base it on sample size reached.
Can I test more than two variants at once?
Yes, but each additional variant multiplies the required sample size. Three variants requires roughly 1.5× the sample of two variants (due to multiple comparison corrections). Unless you have a large audience, stick to two variants.
My open rates are improving but conversions aren't. Why?
The winning title is attracting taps from users who aren't actually interested in converting — a classic relevance mismatch. Audit whether the notification content is accurately representing the in-app experience. Mismatched expectations (notification says one thing, app shows another) is the most common cause of high open rates with low conversion.
Should I A/B test every notification I send?
No. Test notification templates and campaign types, not individual one-off sends. Once you establish strong patterns for a template (e.g., your weekly digest push), apply the winning format systematically. Reserve testing budget for new campaign types, major copy changes, or when you see a significant drop in performance from an established template.
What's a good open rate benchmark for push notifications in 2026?
Industry averages range from 2-5% for broad campaigns to 8-15% for well-segmented, behaviour-triggered notifications. Opt-in rates have declined on Android since the introduction of the runtime permission requirement in Android 13, so many apps now have a smaller but more engaged opted-in audience. Don't benchmark against cross-industry averages — benchmark against your own historical baseline.
Ready to automate your push notifications?
Start sending AI-powered push notifications in under 5 minutes. Free plan available — no credit card required.
Get Started Free