What is Welch's t-test and when should I use it?

Welch's t-test (the two-sample option here) does not assume equal variances between groups, making it more robust than Student's pooled t-test. Use it whenever you compare two independent groups and cannot confidently assume their population variances are equal.

Hypothesis Test Calculator

1

Choose Your Test Type

Select from one-sample t, two-sample t (Welch), paired t, chi-square goodness-of-fit, or chi-square independence. The input fields rearrange automatically for the chosen test — t-tests want means and SDs; chi-square wants counts.

2

Enter Your Sample Statistics

For t-tests, input sample mean(s), standard deviation(s), and sample size(s). For chi-square, enter observed and expected cell counts (GoF) or the four 2×2 cells (independence). The calculator validates input ranges and flags impossible values.

3

Set Significance Level and Tail

Choose α (typically 0.05 in social science, 0.01 in clinical trials) and the tail direction — two-tailed for 'is there a difference?', one-tailed for 'is the effect in a specific direction?'. One-tailed tests have more power but require strong a-priori justification.

4

Read the Decision and Effect Size

The calculator returns the test statistic, degrees of freedom, p-value, and critical value, plus the rejection-region chart. Switch to the Effect Size tab to see Cohen's d (for t-tests) or Cramér's V (for chi-square) — practical magnitude separate from significance.

One-Sample t-Statistic

t = (x̄ − μ₀) / (s / √n) with df = n − 1

Tests whether a sample mean differs from a hypothesized population mean μ₀. The denominator (s/√n) is the standard error of the mean. A large |t| relative to the critical value rejects the null hypothesis.

Welch's Two-Sample t-Statistic

t = (x̄₁ − x̄₂) / √(s₁²/n₁ + s₂²/n₂)

Compares means of two independent groups without assuming equal variances. Degrees of freedom use the Welch–Satterthwaite approximation. Recommended as the default two-sample test in all modern statistics texts (more robust than Student's pooled test).

Chi-Square Statistic

χ² = Σ (O − E)² / E

Tests whether observed counts O differ from expected counts E. Used for goodness-of-fit (one variable, df = k − 1 categories minus 1) and 2×2 independence (df = 1). Requires expected cell counts ≥ 5 in every category for the chi-square approximation to be valid.

Null Hypothesis (H₀) The default 'no effect' assumption being tested — typically that means are equal, or that observed counts match expected. A hypothesis test asks whether the data provide sufficient evidence to reject this default in favor of an alternative.

Alternative Hypothesis (H₁) The claim you want to support — that there is a difference, an effect, or an association. The alternative can be two-sided (any difference) or one-sided (a specific direction). One-sided alternatives need strong a-priori justification.

Significance Level (α) The probability of rejecting H₀ when it is actually true (Type I error rate). Conventionally 0.05 in social science, 0.01 in clinical trials. Lower α means stronger evidence is required before rejecting the null.

p-value The probability of observing data at least as extreme as yours, assuming H₀ is true. Small p-values (less than α) constitute evidence against H₀. The p-value is NOT the probability that H₀ is true — a common misinterpretation.

Test Statistic A standardized number computed from the sample (t-statistic, chi-square statistic, z-score) that summarizes how far the data are from what H₀ predicts. Large absolute values fall in the rejection region and produce small p-values.

Degrees of Freedom The number of independent values used in computing the statistic, after estimating parameters from the data. For one-sample t, df = n − 1. For 2×2 chi-square independence, df = 1. df determines which row of the critical-value table to use.

Effect Size A measure of practical magnitude independent of sample size — Cohen's d for t-tests, Cramér's V for chi-square. Statistical significance with tiny effect size means a real but trivially small difference; report both for an honest summary of findings.

Welch's Correction An adjustment to the two-sample t-test that does not assume equal variances. Uses the Welch–Satterthwaite approximation for df. More robust than Student's pooled t-test and recommended as the default by modern statistics references.

🧪

One-Sample t — Drug Trial

Does a new drug change blood pressure from baseline 120?

Sample mean 116.5 mmHg Sample SD 8.2 Sample size n 30 Hypothesized μ₀ 120 α, tail 0.05, two-tailed

t = (116.5 − 120) / (8.2/√30) = −2.34, df = 29, p = 0.026. Since p < 0.05, reject H₀ — the drug significantly lowers blood pressure. Cohen's d = −0.43 (medium effect), so the change is also practically meaningful, not just statistically detectable.

👥

Two-Sample t — A/B Test

Compare conversion-time means for Variant A vs B

Mean A, SD A, n A 12.3 s, 3.1, 50 Mean B, SD B, n B 10.8 s, 2.7, 50 α, tail 0.05, two-tailed

Welch's t ≈ 2.58, df ≈ 96, p ≈ 0.011. Reject H₀: Variant B reduces conversion time by ~1.5 seconds. Cohen's d ≈ 0.52 (medium effect). The 1.5-second difference is both statistically significant AND large enough to be practically meaningful for users.

📊

Chi-Square — Survey Independence

Is voter preference independent of region?

Cell A (NE, Yes) 210 Cell B (NE, No) 180 Cell C (SW, Yes) 140 Cell D (SW, No) 220

χ² ≈ 20.7, df = 1, p < 0.0001. Reject H₀ — voter preference and region are statistically associated. Cramér's V ≈ 0.16 indicates a small-to-moderate effect size; the association is statistically robust but modest in practical magnitude.

Hypothesis testing is the formal procedure scientists use to decide whether observed effects in sample data are real or just sampling noise. The five tests this calculator supports — one-sample t, two-sample (Welch) t, paired t, chi-square goodness-of-fit, and chi-square independence — cover the vast majority of introductory and applied statistics use cases. Understanding when to use which test, what the p-value actually means, and why effect size matters alongside significance separates competent statistical practice from cargo-cult ritual.

Choosing the Right Test

Three questions select the test for you. First, are you comparing means (continuous numeric data) or counts (categorical data)? Means → t-test family; counts → chi-square family. Second, how many groups? One sample compared to a known reference → one-sample t. Two independent groups → two-sample Welch t. Two related measurements on the same subjects (before/after, twin pairs) → paired t. Third, what's the structure of the categorical data? One variable with k categories → chi-square goodness-of-fit. Two variables in a 2×2 contingency table → chi-square independence. The flowchart is mechanical once you internalize the questions. Common student errors include using a t-test on percentages (which are proportions, not means — usually want chi-square or a proportion test) and using a paired t-test on independent samples (which loses statistical power for no benefit).

p-values Don't Mean What People Think

The single most common interpretive error in statistics is treating the p-value as 'the probability that H₀ is true' or 'the probability that the result is due to chance.' Neither is correct. The p-value is the probability of observing data at least as extreme as yours, assuming H₀ is true. It is conditional on the null being true, not informative about whether the null is true. Small p-values constitute evidence against H₀ — strong evidence the data are unlikely under the null — but they do not quantify how likely H₀ is. A p of 0.04 is not 'a 4% chance the null is right'; it's 'a 4% chance of seeing data this extreme if the null is right.' This is why modern statistics emphasizes effect sizes alongside p-values: a tiny but statistically significant difference (p = 0.001, d = 0.05) is real but trivially small; a large but non-significant difference (p = 0.12, d = 0.6) is suggestive but under-powered. Report both numbers.

Why Welch's t Beats Student's t

Most introductory textbooks still present 'Student's pooled t-test' as the default two-sample comparison, with Welch's adjustment shown as an exception case for 'unequal variances.' Modern statistics has reversed this advice: Welch's t-test should be the default because the assumption of equal variances (the homoscedasticity assumption that Student's pooled test relies on) is rarely met in practice, and the pooled test gives inflated false-positive rates when it's violated. Welch's adjustment uses the Welch–Satterthwaite approximation for df and computes the standard error without pooling, producing nearly-correct Type I error rates regardless of whether the variances are equal. The R language switched to Welch's as the default in t.test() for this reason; SciPy and most modern tools now do the same. Unless you have a specific theoretical reason to pool variances, use Welch's — there is essentially no cost when variances are equal, and a real benefit when they aren't.

Chi-Square Caveats

The chi-square approximation relies on expected cell counts being large enough that the discrete sampling distribution of counts is well-approximated by the continuous chi-square distribution. The standard rule: every expected cell count should be at least 5; some references relax this to 'at least 80% of cells ≥ 5 and none below 1.' Sparse contingency tables (counts ≤ 5 in multiple cells) need either Fisher's Exact Test (for 2×2) or a Monte Carlo permutation test (for larger tables) — applying chi-square anyway gives unreliable p-values. The calculator flags violations of the expected-count rule with a warning; treat the resulting p-value as approximate when it appears. For 2×2 tables specifically, Yates' continuity correction is sometimes applied to slightly improve the approximation, though modern practice often prefers Fisher's Exact for any 2×2 case where computational cost permits.

What is a hypothesis test?+

A hypothesis test is a statistical procedure that uses sample data to evaluate a claim about a population. You start with a null hypothesis (H₀) — the default, no-effect assumption — and an alternative hypothesis (H₁). The test computes a test statistic and p-value, then decides whether to reject H₀ based on a chosen significance level α.

When should I use a t-test vs a chi-square test?+

Use a t-test when comparing means of continuous numeric data (test scores, blood pressure, conversion times). Use a chi-square test when analyzing counts or frequencies in categorical data (survey responses, defect counts by category, contingency tables). Chi-square tests never require normally distributed data — they work on counts.

What is Welch's t-test and why should I use it?+

Welch's t-test compares two-sample means without assuming equal variances. It uses a corrected degrees-of-freedom formula (Welch–Satterthwaite) and computes the standard error without pooling. Modern statistics references recommend it as the default two-sample test because it's robust to variance inequality with essentially no cost when variances are actually equal.

What does 'fail to reject H₀' mean?+

It means the sample data didn't provide sufficient evidence to conclude H₀ is false at your chosen significance level α. It does NOT mean H₀ is true — only that the test was inconclusive, often because of a small sample size or a small effect size. 'Failing to reject' is not the same as 'proving the null'; absence of evidence is not evidence of absence.

What is effect size and why does it matter alongside p-values?+

Effect size (Cohen's d for t-tests, Cramér's V for chi-square) measures practical magnitude independent of sample size. A result can be statistically significant with a tiny effect (large n, trivially small difference) or have a large effect but miss significance (small n). Always report both — p-value alone hides whether a difference is practically meaningful.

What is the difference between a one-tailed and two-tailed test?+

A two-tailed test asks 'is there any difference?' — rejecting H₀ for either direction. A one-tailed test asks 'is there a difference in a specific direction?' — rejecting only when the data move the way you predicted. One-tailed tests have more power but require strong a-priori justification; using them post-hoc to chase significance is statistically dishonest.

Hypothesis Test Calculator

Test Setup

State Your Hypotheses

Understand the Test Statistic

Read the P-Value

Make the Decision

Report Effect Size

Effect Size

Cohen's d Benchmarks (t-tests)

Cramér's V Benchmarks (chi-square)

Confidence Interval for Difference

How to Use This Calculator

Choose Your Test Type

Enter Your Sample Statistics

Set Significance Level and Tail

Read the Decision and Effect Size

Formula & Methodology

Key Terms Explained

Real-World Examples

One-Sample t — Drug Trial

Two-Sample t — A/B Test

Chi-Square — Survey Independence

How Hypothesis Testing Actually Works

Choosing the Right Test

p-values Don't Mean What People Think

Why Welch's t Beats Student's t

Chi-Square Caveats

Frequently Asked Questions

Hypothesis Test Calculator

Test Setup

State Your Hypotheses

Understand the Test Statistic

Read the P-Value

Make the Decision

Report Effect Size

Effect Size

Cohen's d Benchmarks (t-tests)

Cramér's V Benchmarks (chi-square)

Confidence Interval for Difference

How to Use This Calculator

Choose Your Test Type

Enter Your Sample Statistics

Set Significance Level and Tail

Read the Decision and Effect Size

Formula & Methodology

Key Terms Explained

Real-World Examples

How Hypothesis Testing Actually Works

Choosing the Right Test

p-values Don't Mean What People Think

Why Welch's t Beats Student's t

Chi-Square Caveats

Frequently Asked Questions

Keep Exploring

Related calculators

Guides & articles