Correlation measures the degree to which two variables move together. It says nothing about why they move together. This distinction — correlation vs. causation — is one of the most important in data analysis, yet it is violated constantly in headlines, research summaries, and business reports.
Why Correlation Is Not Causation
Two variables can be correlated for three reasons: (1) X causes Y, (2) Y causes X, or (3) a third variable Z causes both. Ice cream sales and drowning rates correlate strongly — not because ice cream causes drowning, but because summer heat drives both. This is called a confounding variable or spurious correlation.
When Correlation Is Useful Without Causation
You do not need causation for correlation to be valuable. If credit scores correlate strongly with loan defaults, a bank can use the score to predict risk without knowing the causal mechanism. Prediction and explanation are different goals. For prediction alone, correlation is sufficient.
Establishing Causation
The gold standard is a randomized controlled experiment (RCT) where subjects are randomly assigned to conditions, ruling out confounders. Observational data can support causal inference through methods like difference-in-differences, instrumental variables, or regression discontinuity — but these require strong assumptions that correlation alone cannot satisfy.
Anscombe's Quartet — Always Plot Your Data
In 1973, Francis Anscombe demonstrated that four completely different datasets can share the same Pearson r, mean, and variance. One dataset is linear. One has a curved relationship. One has an outlier distorting a perfectly linear relationship. This is why the scatter plot in this calculator is not optional decoration — it is essential information. No number alone replaces visual inspection.