A/B testing turns a question โ€” does this change move the metric? โ€” into a hypothesis test. This calculator runs a two-proportion z-test on your control and variant conversion rates, then helps you plan future tests by computing required sample sizes and test durations. It is meant for marketers, growth engineers, and product managers comparing conversion rates on landing pages, signup forms, email CTAs, checkout flows, and similar split tests.

How significance testing works

We start by assuming both variants share the same true conversion rate (the null hypothesis) and ask how unlikely your observed difference would be under that assumption. The pooled z-score measures how many standard errors separate the two observed rates; the p-value converts that z into a probability via the standard normal CDF.

A small p-value (e.g. below 0.05) means: if both variants really converted at the same rate, you'd see a difference this big or bigger by chance less than 5% of the time. That's the threshold we call 95% statistical significance. The chart on the calculator tab shades the rejection region for your chosen confidence and marks where your observed z falls.

Inputs that move the result

Sample size is the biggest lever. Doubling visitors roughly halves the standard error, sharpens the confidence interval, and makes small uplifts detectable. Effect size matters too โ€” a 50% relative lift needs far less traffic than a 5% lift. Confidence level trades off Type I error (false positives): 95% is the marketing default, 99% is appropriate when the cost of shipping a bad variant is high. The tail choice rarely changes business decisions โ€” keep two-tailed unless you have a pre-registered directional hypothesis.

Pitfalls and limits

The single biggest mistake in A/B testing is peeking โ€” checking the p-value daily and stopping as soon as it crosses 0.05. This inflates the false-positive rate dramatically. Run the test for the duration the Sample Size Planner returns, then check once. Multiple variants (A/B/C/n tests) need a multiple-comparisons correction (Bonferroni, Holm) or you'll see false winners. Novelty and primacy effects can move metrics in week one but disappear by week three; weekly cycles in traffic mean tests shorter than a full week often miss real patterns. Finally, this calculator uses a normal approximation โ€” for very small samples (n < 30 conversions per variant) or extreme rates (< 1% or > 99%), prefer an exact test such as Fisher's.