Enter both variants

On the Significance tab, enter visitors and conversions for the Control (A) and Variant (B). Pick the confidence level (95% is the marketing default) and choose two-tailed (Variant could win or lose) or one-tailed (you only care if Variant wins).

2

Read the verdict

The result card returns a verdict (Significant / Not significant), p-value, z-score, and confidence interval for the difference in conversion rates. The shaded chart shows where your z-score falls relative to the rejection region for the chosen confidence.

3

Plan future tests

Use the Sample Size Planner to find how many visitors per variant you need to reliably detect a given uplift, then the Test Duration tab translates that sample into the number of days at your current traffic level (50/50 split).

Pooled two-proportion z-test

z = (p2 − p1) / sqrt( p̄(1−p̄) · (1/n1 + 1/n2) )

p1 and p2 are the control and variant conversion rates, n1 and n2 are visitor counts, and p̄ = (conversions1 + conversions2) / (n1 + n2) is the pooled rate under the null hypothesis that both variants share the same true rate.

P-value from the standard normal

p = 2 · (1 − Φ(|z|)) (two-tailed) or p = 1 − Φ(z) (one-tailed)

Φ is the standard normal CDF. We approximate it with Abramowitz & Stegun 7.1.26 (max error 1.5e-7), and use Acklam's rational approximation for the inverse CDF used in critical-value and sample-size calculations.

Required sample size per variant

n = (z_α + z_β)² · 2 · p̄ · (1 − p̄) / (p2 − p1)²

z_α is the critical value for the chosen confidence (1.96 for 95% two-tailed) and z_β is the critical value for the chosen power (0.84 for 80%). p̄ is the average of the baseline and target conversion rates; the target equals baseline × (1 + MDE).

P-value The probability of observing a difference at least as large as yours if both variants truly converted at the same rate. Smaller p-values mean stronger evidence that the variants differ.

Z-score The number of standard errors separating the two observed conversion rates. A z of 1.96 corresponds to p ≈ 0.05 on a two-tailed test; 2.58 corresponds to p ≈ 0.01.

Statistical significance The result clears the chosen threshold (e.g. p < 0.05 at 95% confidence). It is a probability statement, not a guarantee that the variant is truly better.

Minimum Detectable Effect (MDE) The smallest relative lift you want the test to be able to detect. A 10% MDE at a 5% baseline means you want to reliably catch any improvement to 5.5% or better.

Statistical power The probability of detecting a real effect of the size specified by your MDE. 80% is the marketing convention; under-powered tests miss real wins.

Confidence interval The range of plausible values for the true difference (p2 − p1), expressed in percentage points. Narrower intervals mean a more precise estimate.

One-tailed vs. two-tailed Two-tailed tests treat losses and wins symmetrically and are the safe default. One-tailed tests only test for a win in one direction — use sparingly, and decide direction before the test starts.

🎯

Landing page redesign

Significant result at 95%

Control visitors 5,000 Control conversions 250 Variant visitors 5,000 Variant conversions 300 Confidence 95% two-tailed

Control converts at 5.00%, variant at 6.00% — a 20% relative uplift. The pooled z is 2.19 with p ≈ 0.0283, which clears the 95% threshold. You can ship the variant, but the 99% confidence threshold would still reject this result, so a follow-up confirmation test is reasonable.

⚠️

Email CTA — under-powered

Not significant — small sample

Control visitors 1,000 Control conversions 50 Variant visitors 1,000 Variant conversions 55 Confidence 95% two-tailed

A 10% relative uplift looks real, but with only 1,000 visitors per arm the test is under-powered: p ≈ 0.59 and the confidence interval for the difference straddles zero. The Sample Size Planner shows you need roughly 30,000 visitors per variant to reliably catch a 10% lift off a 5% baseline at 95%/80% — keep running the test.

A/B testing turns a question — does this change move the metric? — into a hypothesis test. This calculator runs a two-proportion z-test on your control and variant conversion rates, then helps you plan future tests by computing required sample sizes and test durations. It is meant for marketers, growth engineers, and product managers comparing conversion rates on landing pages, signup forms, email CTAs, checkout flows, and similar split tests.

How significance testing works

We start by assuming both variants share the same true conversion rate (the null hypothesis) and ask how unlikely your observed difference would be under that assumption. The pooled z-score measures how many standard errors separate the two observed rates; the p-value converts that z into a probability via the standard normal CDF.

A small p-value (e.g. below 0.05) means: if both variants really converted at the same rate, you'd see a difference this big or bigger by chance less than 5% of the time. That's the threshold we call 95% statistical significance. The chart on the calculator tab shades the rejection region for your chosen confidence and marks where your observed z falls.

Inputs that move the result

Sample size is the biggest lever. Doubling visitors roughly halves the standard error, sharpens the confidence interval, and makes small uplifts detectable. Effect size matters too — a 50% relative lift needs far less traffic than a 5% lift. Confidence level trades off Type I error (false positives): 95% is the marketing default, 99% is appropriate when the cost of shipping a bad variant is high. The tail choice rarely changes business decisions — keep two-tailed unless you have a pre-registered directional hypothesis.

Pitfalls and limits

The single biggest mistake in A/B testing is peeking — checking the p-value daily and stopping as soon as it crosses 0.05. This inflates the false-positive rate dramatically. Run the test for the duration the Sample Size Planner returns, then check once. Multiple variants (A/B/C/n tests) need a multiple-comparisons correction (Bonferroni, Holm) or you'll see false winners. Novelty and primacy effects can move metrics in week one but disappear by week three; weekly cycles in traffic mean tests shorter than a full week often miss real patterns. Finally, this calculator uses a normal approximation — for very small samples (n < 30 conversions per variant) or extreme rates (< 1% or > 99%), prefer an exact test such as Fisher's.

How do I calculate A/B test significance?+

Use a two-proportion z-test. Compute each variant's conversion rate, pool them under the null hypothesis, and divide the difference by the pooled standard error to get a z-score. Convert z to a p-value via the standard normal CDF. p below 0.05 means significant at the 95% level. This calculator does all of that for you.

What p-value is considered statistically significant?+

The convention is p < 0.05 (95% confidence) for marketing experiments and p < 0.01 (99%) for higher-stakes decisions. Both are arbitrary thresholds — 0.049 and 0.051 carry essentially the same evidence. Pre-register your threshold and the test duration before you start collecting data.

How long should I run an A/B test?+

Run it for the duration the Sample Size Planner returns — usually 1 to 4 weeks for typical web traffic — and resist the urge to peek and stop early. At minimum, cover one full week so day-of-week effects average out. Don't extend a test indefinitely 'until it turns significant' either; that inflates false positives.

Should I use a one-tailed or two-tailed test?+

Two-tailed is the safer default and matches how most decisions actually work: you want to know if the variant is better OR worse, not just better. Use one-tailed only when you have a strong directional hypothesis decided before the test starts and would not act on a loss the other direction. The p-value is roughly half the two-tailed value when the effect is positive.

Can I peek at the results and stop early?+

Not with a fixed-horizon test like this one. Repeated peeking at a 95% threshold can inflate the real false-positive rate to 20% or higher. If you need to stop early, use a sequential testing method (group sequential designs, mSPRT, Bayesian decision rules) that explicitly accounts for multiple looks at the data.

A/B Test Significance Calculator

Test Results

Control (A)

Variant (B)

Result

Distribution & Rejection Region

Sample Size Inputs

Required Sample

Traffic Inputs

Test Duration

How to Use This Calculator

Enter both variants

Read the verdict

Plan future tests

Formula & Methodology

Key Terms Explained

Real-World Examples

Landing page redesign

Email CTA — under-powered

Understanding the A/B Test Significance Calculator

How significance testing works

Inputs that move the result

Pitfalls and limits

Frequently Asked Questions

A/B Test Significance Calculator

Test Results

Control (A)

Variant (B)

Result

Distribution & Rejection Region

Sample Size Inputs

Required Sample

Traffic Inputs

Test Duration

How to Use This Calculator

Enter both variants

Read the verdict

Plan future tests

Formula & Methodology

Key Terms Explained

Real-World Examples

Understanding the A/B Test Significance Calculator

How significance testing works

Inputs that move the result

Pitfalls and limits

Frequently Asked Questions

Keep Exploring

Related calculators

Guides & articles