Test Setup

Used for effect size (Cohen's d) and power estimation.

P-Value
Enter test values and click Calculate
P-Value
Test Statistic
α Level
Decision
Critical Value
Effect Size (d)
p = 2×(1 − Φ(|z|)) Reject if p < α

Current value highlighted with ◀

Z-Score to P-Value Reference Table

One-tailed and two-tailed p-values for z-scores 0.0 to 3.5 (step 0.1).

Z-Score Right-tail p Two-tail p Significant at α=0.05?

P-Value Decision Guide

P-Value RangeEvidence Against H₀Action
p > 0.10Little or noneFail to reject H₀
0.05 < p ≤ 0.10Weak / marginalInconclusive — gather more data
0.01 < p ≤ 0.05ModerateReject H₀ at 5% level
0.001 < p ≤ 0.01StrongReject H₀ at 1% level
p ≤ 0.001Very strongReject H₀ at 0.1% level

Multiple Comparisons — Bonferroni Correction

Running k tests simultaneously inflates the Type I error rate. The Bonferroni correction divides your α threshold by the number of tests to control the family-wise error rate.

Adjusted α = 0.05 / 5 = 0.0100

Each individual test should be evaluated against the adjusted α to maintain overall significance.

Power Calculator

Enter the expected effect size, sample size, and α to compute statistical power.

|mean difference| / pooled SD. Small=0.2, Medium=0.5, Large=0.8

Total sample size (or per group for two-sample tests).

Statistical Power
1 − β (probability of detecting the effect)
Power (1−β)
Type II Error β
n for 80% Power
n for 90% Power
Min Detectable d
α Level

Power vs Sample Size

How statistical power increases as sample size grows. Dashed lines mark 80% and 90% power thresholds.

Understanding Type I & Type II Errors

Type I Error (α)

False Positive

Rejecting H₀ when it is actually true. Probability = α (your significance level). Also called a "false alarm."

Controlled by choosing a smaller α (e.g. 0.01 instead of 0.05).

Type II Error (β)

False Negative

Failing to reject H₀ when it is actually false. Probability = β (1 − Power). Also called a "miss."

Reduced by increasing sample size or effect size.

Power (1−β)

True Positive Rate

The probability of correctly detecting a real effect. Most studies aim for ≥80% power before data collection.

Increases with larger n, larger effect size, or higher α.

How to Use This Calculator

1

Choose Your Test Type

Select one-sample Z, two-sample Z, or T-test. Enter the test statistic from your analysis. For T-tests, also enter degrees of freedom.

2

Set Tail Direction & α

Choose left-tailed, right-tailed, or two-tailed based on your alternative hypothesis. Select your significance level (0.05 is standard).

3

Read the Results

The calculator shows the p-value, decision (reject/fail to reject H₀), effect size, and a live bell curve with the shaded p-value region.

Formula & Methodology

Right-Tailed P-Value

p = P(Z ≥ z) = 1 − Φ(z)

The area in the right tail beyond the test statistic. Use when H₁: μ > μ₀.

Left-Tailed P-Value

p = P(Z ≤ z) = Φ(z)

The area in the left tail up to the test statistic. Use when H₁: μ < μ₀.

Two-Tailed P-Value

p = 2 × P(Z ≥ |z|) = 2 × (1 − Φ(|z|))

Twice the one-tailed probability; tests for any difference from H₀ regardless of direction. Most common in practice.

T-Distribution (small samples)

p = I(df/(df+t²); df/2, 1/2)

For T-tests with df < 200, uses the regularized incomplete beta function for exact p-values (heavier tails than normal).

Key Terms

P-Value
The probability of obtaining a test statistic as extreme as the one observed, given that H₀ is true.
Null Hypothesis (H₀)
The default assumption of no effect or no difference in the population.
Alternative Hypothesis (H₁)
The claim being tested; asserts that an effect or difference exists.
Significance Level (α)
The threshold below which p leads to rejection of H₀; commonly 0.05.
Type I Error
Rejecting a true null hypothesis (false positive). Probability = α.
Type II Error (β)
Failing to reject a false null hypothesis (false negative). Probability = 1 − power.
Cohen's d
Standardised effect size: |mean difference| / pooled SD. Small≈0.2, medium≈0.5, large≈0.8.
Statistical Power
Probability of correctly rejecting a false H₀ = 1 − β. Target ≥ 0.80.

Real-World Examples

Example 1 — Drug Trial

Does a new drug lower blood pressure?

z = 2.15, two-tailed, α = 0.05, n = 50

p = 0.0316 → Significant. Reject H₀. d ≈ 0.30 (small-medium effect).

Example 2 — A/B Test

Does version B of a landing page convert better?

z = 1.50, right-tailed, α = 0.05, n = 200

p = 0.0668 → Not significant. Fail to reject H₀. Need more data.

Example 3 — Psychology Study

Does a mindfulness intervention reduce anxiety? (T-test)

t = 2.57, df = 25, two-tailed, α = 0.05

p ≈ 0.017 → Significant. Reject H₀. Exact t-distribution used.

P-Values: Interpreting Statistical Significance

What a P-Value Is (and Is Not)

A p-value is the probability of seeing data as extreme as yours if the null hypothesis were true. It is not the probability that the null hypothesis is true. A p-value of 0.03 means: if H₀ were true, there is only a 3% chance of seeing a result this extreme — that's all. It says nothing about whether H₀ is actually correct.

The Replication Crisis & P-Hacking

Over-reliance on the p < 0.05 threshold has contributed to irreproducible research. "P-hacking" — running multiple analyses until something is significant — is a real problem. Modern best practices include pre-registering hypotheses, reporting effect sizes alongside p-values, using confidence intervals, and adopting Bayesian methods where appropriate.

Effect Size Is As Important As P-Value

With large enough samples, nearly any difference becomes statistically significant. A p-value of 0.0001 for a Cohen's d of 0.05 means you have a real but trivially small effect. Always pair p-values with effect size estimates (Cohen's d, r², η²) and confidence intervals for a complete picture.

Frequently Asked Questions

What is a p-value?

A p-value is the probability of observing a test statistic as extreme as yours (or more extreme), given that the null hypothesis is true. It does NOT tell you the probability that the null hypothesis is correct — that's a common and important misconception.

When should I use a one-tailed vs two-tailed test?

Use a two-tailed test when you're testing for any difference regardless of direction (H₁: μ ≠ μ₀) — this is the most common and conservative choice. Use a one-tailed test only when you have a strong pre-specified directional hypothesis (e.g. H₁: μ > μ₀) before seeing the data. Never choose one-tailed to get a smaller p-value after seeing results.

What is the difference between a Z-test and a T-test?

A Z-test is used when the population standard deviation is known or when sample sizes are large (n > 30). A T-test is used when the population SD is unknown and must be estimated from the data — which is nearly always the case in practice. The t-distribution has heavier tails than the normal distribution, especially for small df, which leads to larger (more conservative) p-values.

Can p > 0.05 prove the null hypothesis?

No. A large p-value means you lack sufficient evidence to reject H₀ — not that H₀ is proven true. The study may have been underpowered (too small n), or the effect may be real but smaller than detectable. Equivalence tests (TOST) or Bayesian methods are needed to formally support the null.

What is the Bonferroni correction and when should I use it?

When you run multiple hypothesis tests, each at α=0.05, the probability of at least one false positive grows. With 20 tests, you'd expect one false positive by chance. The Bonferroni correction sets the threshold to α/k (e.g. 0.05/5 = 0.01 for 5 tests). Use it when tests are independent and you want to control the family-wise error rate. For correlated tests, consider FDR (Benjamini-Hochberg) instead.

How do I interpret statistical power?

Power (1−β) is the probability of correctly detecting a real effect. Power of 0.80 means an 80% chance of a significant result if the effect is real. Underpowered studies often fail to replicate. Use the Power Analysis tab to determine the required sample size for your expected effect size before collecting data.

What is Cohen's d and how do I interpret it?

Cohen's d is a standardized effect size: d = |μ₁ − μ₂| / σ_pooled. Rules of thumb: d ≈ 0.2 is small, d ≈ 0.5 is medium, d ≈ 0.8 is large. Context matters — a small d may be practically important in medicine (e.g. mortality reduction) but trivial in psychology. Always interpret effect sizes within your domain.

What is a critical value?

The critical value is the threshold your test statistic must exceed to reject H₀ at a given α. For a two-tailed z-test at α=0.05, the critical values are ±1.96. If |z| > 1.96, you reject H₀. Equivalently, p < α ⟺ |z| > z_critical.