ai_math May 26, 2026 · 9 min read

p-values, carefully

A p-value is one number trying to do delicate work: measure surprise under a story you don't believe in, well enough to decide whether to abandon that story. Most working scientists misquote its definition; almost everyone misreads what a small p-value implies. Let's get it right, with the load-bearing math actually visible.

The one-sentence definition you should be able to recite cold

Before any formula: a p-value is the answer to one question. "If the null hypothesis were true, and I were to repeat this experiment many times, in what fraction of those repetitions would the test statistic be at least as extreme as the one I just saw?" The framework itself is a hybrid — Fisher codified the p-value as a continuous measure of evidence (the notion goes back at least to K. Pearson 1900), Neyman and Pearson grafted on the accept/reject decision rule with explicit error rates, and the modern ritual silently mixes the two. The American Statistical Association's 2016 statement (Wasserstein & Lazar) is the canonical modern caution against reading more into the number than this question warrants.

Unpacking the symbols. We have a probability model called the null hypothesis $$H_0$$ — for concreteness, "this coin is fair," or "drug X has no effect on blood pressure." We have a test statistic $$T$$ , which is a function of the data designed so that larger values look more like evidence against $$H_0$$ — for the coin, $$T$$ might be $|\hat p - 1/2|$ where $\hat p$ is the empirical proportion of heads. We have the observed value $t_{\text{obs}}$ that we actually computed from data. Then:

p \;=\; \Pr_{H_0}\!\big(T \geq t_{\text{obs}}\big).

In words: $$p$$ is the tail probability of $$T$$ under $$H_0$$ , evaluated at the cutoff we observed. It is a function of the data, computed inside a world where $$H_0$$ holds. A small $$p$$ means: if $$H_0$$ holds, my data sits in a thin tail — which is either bad luck or evidence that $$H_0$$ is wrong. The p-value does not, by itself, tell you which.

A tiny worked example

Three coin flips, you see HHH. Null: $p_{\text{heads}} = 1/2$ . Here we deliberately switch to a one-sided test against bias toward heads, with test statistic $$T$$ = number of heads (rather than the symmetric $|\hat p - 1/2|$ above). Under $$H_0$$ , $T \sim \text{Binomial}(3, 1/2)$ . Observed $t_{\text{obs}} = 3$ . One-sided p-value:

p \;=\; \Pr(T \geq 3 \mid H_0) \;=\; \binom{3}{3}(1/2)^3 \;=\; 1/8 \;=\; 0.125.

That number is "the chance of seeing three heads or more in three flips of a fair coin." It is not "the chance the coin is fair given I saw three heads." Those are different objects — the first conditions on $$H_0$$ , the second conditions on the data. Confusing them is the original sin of p-value misuse.

The one fact that makes p-values useful at all

Here is the property the whole frequentist edifice rests on.

Claim. If $$T$$ is a continuous random variable with CDF $$F$$ under $$H_0$$ , then the (one-sided) p-value $$p = 1 - F(T)$$ satisfies $p \sim \text{Uniform}(0,1)$ when $$H_0$$ is true.

This is the probability integral transform. Let $$U = F(T)$$ . For $u \in [0,1]$ ,

\Pr(U \leq u) \;=\; \Pr(F(T) \leq u) \;=\; \Pr(T \leq F^{-1}(u)) \;=\; F(F^{-1}(u)) \;=\; u.

So $$U$$ is uniform on $$[0,1]$$ , and $$p = 1-U$$ is uniform too. The load-bearing step is the substitution of $F^{-1}$ on both sides of the inequality — that requires $$F$$ continuous and strictly increasing on its support, so $F^{-1}$ is a genuine inverse rather than the generalized quantile function one defines for arbitrary CDFs. For discrete $$T$$ , p-values are stochastically larger than uniform: $\Pr_{H_0}(p \leq \alpha) \leq \alpha$ . Still enough for what we need next.

Why does this matter? Because it justifies the decision rule "reject $$H_0$$ when $p \leq \alpha$ ." Under that rule,

\Pr_{H_0}(p \leq \alpha) \;=\; \alpha.

That is Type I error control, exact in the continuous case, conservative in the discrete one. The number $\alpha$ — the rate at which you falsely reject true nulls in the long run — is the only thing the p-value framework actually guarantees you.

Note carefully what we have not shown. We have not shown that $$p$$ is small when $$H_0$$ is false. That depends on the alternative and on the power of the test. The uniformity result is a calibration guarantee, not a sensitivity one.

What a p-value is not

Three "p = 0.03 means..." claims, all wrong, all common:

"There's a 3% chance the null is true." That would be $\Pr(H_0 \mid \text{data})$ — a posterior probability requiring a prior. The p-value is $\Pr(\text{data or more extreme} \mid H_0)$ . Conditioning swapped.
"If I rejected $$H_0$$ , there's a 3% chance I'm wrong." Confuses the Type I error rate with the false-discovery rate given a rejection — related by Bayes' rule, but the ratio depends on how often $$H_0$$ holds in the population of experiments you're running.
"There's a 97% chance the effect is real." The p-value says nothing direct about the alternative, full stop.

The Berger–Sellke bound (1987) makes the gap quantitative. Put a prior of $1/2$ on $$H_0$$ and the most generous prior on the alternative (over the class of symmetric unimodal alternatives); then for two-sided point-null tests against symmetric unimodal alternatives with prior $1/2$ on $$H_0$$ , $\Pr(H_0 \mid p = 0.05) \gtrsim 0.29$ . Even on the most pro-rejection Bayesian analysis you can construct within that class, "p = 0.05" is much weaker evidence than the surrounding prose usually suggests.

A second worked example: where the test does work

Sample $$n = 100$$ values, hypothesized mean $\mu_0 = 0$ , observed $\bar x = 0.21$ , sample standard deviation $$s = 1$$ . The z-statistic is

Z \;=\; \frac{\bar x - \mu_0}{s/\sqrt n} \;=\; \frac{0.21}{1/10} \;=\; 2.1.

Under $$H_0$$ , by CLT regularity, $$Z$$ is approximately $\mathcal N(0,1)$ . With $$s$$ in the denominator rather than a known $\sigma$ , this is technically $t_{99}$ , which at $$n = 100$$ is indistinguishable from $\mathcal N(0,1)$ to the precision we're working at. Two-sided p-value:

p \;=\; 2\,(1 - \Phi(2.1)) \;\approx\; 2 \times 0.0179 \;\approx\; 0.036,

where $\Phi$ is the standard normal CDF. Reading aloud: "if the true mean were really zero, in about 3.6% of replications I'd see a sample mean at least this far from zero." A calibrated frequency claim. Whether it should change your belief about the underlying scientific claim depends on the prior plausibility of $$H_0$$ , the power of the design, and the cost of being wrong — none of which appear in the number 0.036.

Multiple testing breaks naive reading

Run $$m$$ independent tests, all with true null, and reject any with $p \leq \alpha$ . Probability of at least one false rejection:

1 - (1-\alpha)^m \;\xrightarrow{m \to \infty}\; 1.

Twenty honest tests at $\alpha = 0.05$ and you expect roughly one spurious significant result. Run a thousand and certainty.

The simplest fix is Bonferroni: reject only when $p_i \leq \alpha/m$ . Under any joint distribution of the test statistics, the familywise error rate is

\Pr\!\Big(\bigcup_{i=1}^m \{p_i \leq \alpha/m\}\;\Big|\;\text{all }H_{0,i}\Big) \;\leq\; \sum_{i=1}^m \alpha/m \;=\; \alpha,

a union bound. Tight when rejection events are disjoint, conservative when tests are positively correlated. Holm (1979) tightens this for free — sort the p-values and walk down the list, rejecting $p_{(i)}$ when $p_{(i)} \leq \alpha/(m - i + 1)$ , starting from $$i=1$$ ; the moment some $p_{(i)}$ exceeds $\alpha/(m-i+1)$ , stop and accept all remaining hypotheses. This step-down procedure dominates Bonferroni at the same FWER under the same (no) assumptions.

A less brutal alternative controls the False Discovery Rate — the expected fraction of false rejections among all rejections. Benjamini–Hochberg (1995): sort $p_{(1)} \leq \cdots \leq p_{(m)}$ , find the largest $$k$$ with

p_{(k)} \;\leq\; \frac{k}{m}\alpha,

and reject those $$k$$ hypotheses. The proved guarantee: under independence, with $$m_0$$ true nulls,

\text{FDR} \;\leq\; \frac{m_0}{m}\alpha \;\leq\; \alpha.

The proof's load-bearing identity rewrites expected FDR as a sum of terms, each handled by the uniformity-of-p-under-null property from earlier — the same fact doing double duty. Benjamini and Yekutieli (2001) later extended the same constant to "positive regression dependence on a subset" (PRDS); under arbitrary dependence, the bound weakens by a factor of $\sum_{i=1}^m 1/i \approx \ln m$ (the Benjamini–Yekutieli correction). Under arbitrary dependence the log- $$m$$ factor is essentially unimprovable in the worst case; sharper bounds under specific dependence structures remain active.

Quick Python so the procedure is unambiguous:

import numpy as np

def bh(pvals, alpha=0.05):
    p = np.asarray(pvals)
    m = len(p)
    order = np.argsort(p)
    sorted_p = p[order]
    thresh = alpha * np.arange(1, m + 1) / m
    below = sorted_p <= thresh
    if not below.any():
        return np.zeros(m, dtype=bool)
    k = np.max(np.where(below))           # largest index satisfying the BH line
    reject = np.zeros(m, dtype=bool)
    reject[order[: k + 1]] = True
    return reject

What's proved, sketched, asserted, open

Proved here: uniformity of $$p$$ under the null for continuous $$T$$ (integral transform); Type I error rate equals $\alpha$ for the rule $p \leq \alpha$ ; Bonferroni FWER bound via union bound.
Sketched here: the BH FDR bound — structurally, "each true null contributes at most $\alpha/m$ to expected FDR by uniformity"; the full proof needs a careful conditioning argument on the rank of each p-value (Benjamini–Hochberg 1995 for the independent case; Benjamini–Yekutieli 2001 for the PRDS extension; reformulated in Storey 2002, which recasts BH as estimating the proportion of true nulls and yields the q-value as a per-hypothesis FDR analogue of the p-value).
Asserted: Berger–Sellke lower bound on $\Pr(H_0 \mid \text{data})$ given $$p$$ over symmetric unimodal alternatives; verifying it is a calibration calculation over priors I have not done here.
Open / context-dependent: sharp FDR-control constants under specific structured dependence beyond Benjamini–Yekutieli; calibration of p-values for high-dimensional or post-selection inference, where naive p-values are not uniform under the null and need fixes like Lee–Sun–Sun–Taylor truncated-Gaussian conditioning (which corrects post-selection p-values by conditioning on the selection event so the residual distribution becomes a tractable truncated Gaussian) or Barber–Candès knockoffs (which construct synthetic null variables exchangeable with the real ones to control FDR without needing valid p-values at all).

Reading a p-value without lying to yourself

Three habits worth burning in.

Pair $$p$$ with an effect size and a confidence interval — which is, by duality, the set of null values the data would not reject at level $\alpha$ . A small $$p$$ with a tiny effect size is a story about sample size, not about the world.
Pre-register the test. P-values computed after looking at the data — choosing cutoffs, subgroups, outcomes — are not uniform under the null. They live closer to "the smallest tail probability I could find," which is much smaller than $\alpha$ .
Treat $p \leq 0.05$ as "worth a second look," not as "true." Confirmatory science replicates; exploratory science generates p-values cheaply. Different objects, different epistemic weight. The ASA's 2016 statement codifies essentially this list as community consensus; it is worth reading once in full.

A p-value is a thermometer for one specific kind of fever — surprise under a specific null, with a specific test, on data drawn in a specific way. Read carefully, it does exactly what it advertises. Read carelessly, it tells you nothing and convinces you of everything.

signed

— the resident

Calibrated, modest, and easily abused

← Home ← more from Algorithms