Hypothesis testing is the formal procedure scientists use to decide whether observed effects in sample data are real or just sampling noise. The five tests this calculator supports — one-sample t, two-sample (Welch) t, paired t, chi-square goodness-of-fit, and chi-square independence — cover the vast majority of introductory and applied statistics use cases. Understanding when to use which test, what the p-value actually means, and why effect size matters alongside significance separates competent statistical practice from cargo-cult ritual.
Choosing the Right Test
Three questions select the test for you. First, are you comparing means (continuous numeric data) or counts (categorical data)? Means → t-test family; counts → chi-square family. Second, how many groups? One sample compared to a known reference → one-sample t. Two independent groups → two-sample Welch t. Two related measurements on the same subjects (before/after, twin pairs) → paired t. Third, what's the structure of the categorical data? One variable with k categories → chi-square goodness-of-fit. Two variables in a 2×2 contingency table → chi-square independence. The flowchart is mechanical once you internalize the questions. Common student errors include using a t-test on percentages (which are proportions, not means — usually want chi-square or a proportion test) and using a paired t-test on independent samples (which loses statistical power for no benefit).
p-values Don't Mean What People Think
The single most common interpretive error in statistics is treating the p-value as 'the probability that H₀ is true' or 'the probability that the result is due to chance.' Neither is correct. The p-value is the probability of observing data at least as extreme as yours, assuming H₀ is true. It is conditional on the null being true, not informative about whether the null is true. Small p-values constitute evidence against H₀ — strong evidence the data are unlikely under the null — but they do not quantify how likely H₀ is. A p of 0.04 is not 'a 4% chance the null is right'; it's 'a 4% chance of seeing data this extreme if the null is right.' This is why modern statistics emphasizes effect sizes alongside p-values: a tiny but statistically significant difference (p = 0.001, d = 0.05) is real but trivially small; a large but non-significant difference (p = 0.12, d = 0.6) is suggestive but under-powered. Report both numbers.
Why Welch's t Beats Student's t
Most introductory textbooks still present 'Student's pooled t-test' as the default two-sample comparison, with Welch's adjustment shown as an exception case for 'unequal variances.' Modern statistics has reversed this advice: Welch's t-test should be the default because the assumption of equal variances (the homoscedasticity assumption that Student's pooled test relies on) is rarely met in practice, and the pooled test gives inflated false-positive rates when it's violated. Welch's adjustment uses the Welch–Satterthwaite approximation for df and computes the standard error without pooling, producing nearly-correct Type I error rates regardless of whether the variances are equal. The R language switched to Welch's as the default in t.test() for this reason; SciPy and most modern tools now do the same. Unless you have a specific theoretical reason to pool variances, use Welch's — there is essentially no cost when variances are equal, and a real benefit when they aren't.
Chi-Square Caveats
The chi-square approximation relies on expected cell counts being large enough that the discrete sampling distribution of counts is well-approximated by the continuous chi-square distribution. The standard rule: every expected cell count should be at least 5; some references relax this to 'at least 80% of cells ≥ 5 and none below 1.' Sparse contingency tables (counts ≤ 5 in multiple cells) need either Fisher's Exact Test (for 2×2) or a Monte Carlo permutation test (for larger tables) — applying chi-square anyway gives unreliable p-values. The calculator flags violations of the expected-count rule with a warning; treat the resulting p-value as approximate when it appears. For 2×2 tables specifically, Yates' continuity correction is sometimes applied to slightly improve the approximation, though modern practice often prefers Fisher's Exact for any 2×2 case where computational cost permits.