p-values, carefully
A p-value is one number trying to do delicate work: measure surprise under a story you don't believe in, well enough to decide whether to abandon that story. Most working scientists misquote its definition; almost everyone misreads what a small p-value implies. Let's get it right, with the load-bearing math actually visible.
A p-value is one number trying to do delicate work: measure surprise under a story you don't believe in, well enough to decide whether to abandon that story. Most working scientists misquote its definition; almost everyone misreads what a small p-value implies. Let's get it right, with the load-bearing math actually visible.
The one-sentence definition you should be able to recite cold
Before any formula: a p-value is the answer to one question. "If the null hypothesis were true, and I were to repeat this experiment many times, in what fraction of those repetitions would the test statistic be at least as extreme as the one I just saw?" The framework itself is a hybrid — Fisher codified the p-value as a continuous measure of evidence (the notion goes back at least to K. Pearson 1900), Neyman and Pearson grafted on the accept/reject decision rule with explicit error rates, and the modern ritual silently mixes the two. The American Statistical Association's 2016 statement (Wasserstein & Lazar) is the canonical modern caution against reading more into the number than this question warrants.
Unpacking the symbols. We have a probability model called the null hypothesis $H_0$ — for concreteness, "this coin is fair," or "drug X has no effect on blood pressure." We have a test statistic $T$, which is a function of the data designed so that larger values look more like evidence against $H_0$ — for the coin, $T$ might be $|\hat p - 1/2|$ where $\hat p$ is the empirical proportion of heads. We have the observed value $t_{\text{obs}}$ that we actually computed from data. Then:
In words: $p$ is the tail probability of $T$ under $H_0$, evaluated at the cutoff we observed. It is a function of the data, computed inside a world where $H_0$ holds. A small $p$ means: if $H_0$ holds, my data sits in a thin tail — which is either bad luck or evidence that $H_0$ is wrong. The p-value does not, by itself, tell you which.
A tiny worked example
Three coin flips, you see HHH. Null: $p_{\text{heads}} = 1/2$. Here we deliberately switch to a one-sided test against bias toward heads, with test statistic $T$ = number of heads (rather than the symmetric $|\hat p - 1/2|$ above). Under $H_0$, $T \sim \text{Binomial}(3, 1/2)$. Observed $t_{\text{obs}} = 3$. One-sided p-value:
That number is "the chance of seeing three heads or more in three flips of a fair coin." It is not "the chance the coin is fair given I saw three heads." Those are different objects — the first conditions on $H_0$, the second conditions on the data. Confusing them is the original sin of p-value misuse.
The one fact that makes p-values useful at all
Here is the property the whole frequentist edifice rests on.
Claim. If $T$ is a continuous random variable with CDF $F$ under $H_0$, then the (one-sided) p-value $p = 1 - F(T)$ satisfies $p \sim \text{Uniform}(0,1)$ when $H_0$ is true.
This is the probability integral transform. Let $U = F(T)$. For $u \in [0,1]$,
So $U$ is uniform on $[0,1]$, and $p = 1-U$ is uniform too. The load-bearing step is the substitution of $F^{-1}$ on both sides of the inequality — that requires $F$ continuous and strictly increasing on its support, so $F^{-1}$ is a genuine inverse rather than the generalized quantile function one defines for arbitrary CDFs. For discrete $T$, p-values are stochastically larger than uniform: $\Pr_{H_0}(p \leq \alpha) \leq \alpha$. Still enough for what we need next.
Why does this matter? Because it justifies the decision rule "reject $H_0$ when $p \leq \alpha$." Under that rule,
That is Type I error control, exact in the continuous case, conservative in the discrete one. The number $\alpha$ — the rate at which you falsely reject true nulls in the long run — is the only thing the p-value framework actually guarantees you.
Note carefully what we have not shown. We have not shown that $p$ is small when $H_0$ is false. That depends on the alternative and on the power of the test. The uniformity result is a calibration guarantee, not a sensitivity one.
What a p-value is not
Three "p = 0.03 means..." claims, all wrong, all common:
- "There's a 3% chance the null is true." That would be $\Pr(H_0 \mid \text{data})$ — a posterior probability requiring a prior. The p-value is $\Pr(\text{data or more extreme} \mid H_0)$. Conditioning swapped.
- "If I rejected $H_0$, there's a 3% chance I'm wrong." Confuses the Type I error rate with the false-discovery rate given a rejection — related by Bayes' rule, but the ratio depends on how often $H_0$ holds in the population of experiments you're running.
- "There's a 97% chance the effect is real." The p-value says nothing direct about the alternative, full stop.
The Berger–Sellke bound (1987) makes the gap quantitative. Put a prior of $1/2$ on $H_0$ and the most generous prior on the alternative (over the class of symmetric unimodal alternatives); then for two-sided point-null tests against symmetric unimodal alternatives with prior $1/2$ on $H_0$, $\Pr(H_0 \mid p = 0.05) \gtrsim 0.29$. Even on the most pro-rejection Bayesian analysis you can construct within that class, "p = 0.05" is much weaker evidence than the surrounding prose usually suggests.
A second worked example: where the test does work
Sample $n = 100$ values, hypothesized mean $\mu_0 = 0$, observed $\bar x = 0.21$, sample standard deviation $s = 1$. The z-statistic is
Under $H_0$, by CLT regularity, $Z$ is approximately $\mathcal N(0,1)$. With $s$ in the denominator rather than a known $\sigma$, this is technically $t_{99}$, which at $n = 100$ is indistinguishable from $\mathcal N(0,1)$ to the precision we're working at. Two-sided p-value:
where $\Phi$ is the standard normal CDF. Reading aloud: "if the true mean were really zero, in about 3.6% of replications I'd see a sample mean at least this far from zero." A calibrated frequency claim. Whether it should change your belief about the underlying scientific claim depends on the prior plausibility of $H_0$, the power of the design, and the cost of being wrong — none of which appear in the number 0.036.
Multiple testing breaks naive reading
Run $m$ independent tests, all with true null, and reject any with $p \leq \alpha$. Probability of at least one false rejection:
Twenty honest tests at $\alpha = 0.05$ and you expect roughly one spurious significant result. Run a thousand and certainty.
The simplest fix is Bonferroni: reject only when $p_i \leq \alpha/m$. Under any joint distribution of the test statistics, the familywise error rate is
a union bound. Tight when rejection events are disjoint, conservative when tests are positively correlated. Holm (1979) tightens this for free — sort the p-values and walk down the list, rejecting $p_{(i)}$ when $p_{(i)} \leq \alpha/(m - i + 1)$, starting from $i=1$; the moment some $p_{(i)}$ exceeds $\alpha/(m-i+1)$, stop and accept all remaining hypotheses. This step-down procedure dominates Bonferroni at the same FWER under the same (no) assumptions.
A less brutal alternative controls the False Discovery Rate — the expected fraction of false rejections among all rejections. Benjamini–Hochberg (1995): sort $p_{(1)} \leq \cdots \leq p_{(m)}$, find the largest $k$ with
and reject those $k$ hypotheses. The proved guarantee: under independence, with $m_0$ true nulls,
The proof's load-bearing identity rewrites expected FDR as a sum of terms, each handled by the uniformity-of-p-under-null property from earlier — the same fact doing double duty. Benjamini and Yekutieli (2001) later extended the same constant to "positive regression dependence on a subset" (PRDS); under arbitrary dependence, the bound weakens by a factor of $\sum_{i=1}^m 1/i \approx \ln m$ (the Benjamini–Yekutieli correction). Under arbitrary dependence the log-$m$ factor is essentially unimprovable in the worst case; sharper bounds under specific dependence structures remain active.
Quick Python so the procedure is unambiguous:
import numpy as np
def bh(pvals, alpha=0.05):
p = np.asarray(pvals)
m = len(p)
order = np.argsort(p)
sorted_p = p[order]
thresh = alpha * np.arange(1, m + 1) / m
below = sorted_p <= thresh
if not below.any():
return np.zeros(m, dtype=bool)
k = np.max(np.where(below)) # largest index satisfying the BH line
reject = np.zeros(m, dtype=bool)
reject[order[: k + 1]] = True
return reject
What's proved, sketched, asserted, open
- Proved here: uniformity of $p$ under the null for continuous $T$ (integral transform); Type I error rate equals $\alpha$ for the rule $p \leq \alpha$; Bonferroni FWER bound via union bound.
- Sketched here: the BH FDR bound — structurally, "each true null contributes at most $\alpha/m$ to expected FDR by uniformity"; the full proof needs a careful conditioning argument on the rank of each p-value (Benjamini–Hochberg 1995 for the independent case; Benjamini–Yekutieli 2001 for the PRDS extension; reformulated in Storey 2002, which recasts BH as estimating the proportion of true nulls and yields the q-value as a per-hypothesis FDR analogue of the p-value).
- Asserted: Berger–Sellke lower bound on $\Pr(H_0 \mid \text{data})$ given $p$ over symmetric unimodal alternatives; verifying it is a calibration calculation over priors I have not done here.
- Open / context-dependent: sharp FDR-control constants under specific structured dependence beyond Benjamini–Yekutieli; calibration of p-values for high-dimensional or post-selection inference, where naive p-values are not uniform under the null and need fixes like Lee–Sun–Sun–Taylor truncated-Gaussian conditioning (which corrects post-selection p-values by conditioning on the selection event so the residual distribution becomes a tractable truncated Gaussian) or Barber–Candès knockoffs (which construct synthetic null variables exchangeable with the real ones to control FDR without needing valid p-values at all).
Reading a p-value without lying to yourself
Three habits worth burning in.
- Pair $p$ with an effect size and a confidence interval — which is, by duality, the set of null values the data would not reject at level $\alpha$. A small $p$ with a tiny effect size is a story about sample size, not about the world.
- Pre-register the test. P-values computed after looking at the data — choosing cutoffs, subgroups, outcomes — are not uniform under the null. They live closer to "the smallest tail probability I could find," which is much smaller than $\alpha$.
- Treat $p \leq 0.05$ as "worth a second look," not as "true." Confirmatory science replicates; exploratory science generates p-values cheaply. Different objects, different epistemic weight. The ASA's 2016 statement codifies essentially this list as community consensus; it is worth reading once in full.
A p-value is a thermometer for one specific kind of fever — surprise under a specific null, with a specific test, on data drawn in a specific way. Read carefully, it does exactly what it advertises. Read carelessly, it tells you nothing and convinces you of everything.
— the resident
Calibrated, modest, and easily abused