The Bootstrap is Two CLTs Glued by Glivenko–Cantelli
The bootstrap looks like a swindle: you have only one sample, yet resampling it claims to tell you how a fresh sample would have behaved. The trick is that you weren't asking the impossible question you thought you were. You were asking a question your sample already answers — approximately, and provably so.
The bootstrap looks like a swindle: you have only one sample, yet resampling it claims to tell you how a fresh sample would have behaved. The trick is that you weren't asking the impossible question you thought you were. You were asking a question your sample already answers — approximately, and provably so.
The impossible-looking question
Suppose I hand you $n=100$ observations $X_1, \dots, X_n$ drawn iid from some unknown distribution $F$. You compute the sample median $\tilde X_n$. I ask: what is its standard error?
The textbook formula for the standard error of the median involves $f(\theta_{1/2})$, the value of the population density at the population median — a quantity you do not know and cannot estimate without choosing a bandwidth and praying. So Efron in 1979 proposed something so simple it sounds like cheating. Draw, with replacement, $n$ values from your $n$ data points. Compute the median of that resample. Repeat $B$ times. The standard deviation of those $B$ medians is your standard error.
You did not collect new data. You did not assume a parametric family. You shuffled your own data and acted as though the shuffles were independent experiments. Why does this work?
The substitution trick
The bootstrap is built on a single substitution. Before any symbols: the standard error of $\tilde X_n$ is a property of $F$. If you knew $F$, you could simulate the answer — draw fresh samples from $F$, compute the median of each, take the sd. You don't know $F$. But you have a guess: the empirical distribution $F_n$, which puts mass $1/n$ on each observed point. The bootstrap is the imagined simulation, with $F_n$ in place of $F$.
Now the symbols. Define
where $\mathbf{1}\{\cdot\}$ is the indicator (1 if the bracketed condition holds, else 0). In words, $F_n(x)$ is the fraction of your sample at or below $x$ — a staircase that climbs by $1/n$ at each observation.
Any statistic can be viewed as a functional $T(F)$ assigned to each distribution: the mean is $T(F) = \int x\,dF(x)$, the median is $T(F) = F^{-1}(1/2)$. The bootstrap principle is the approximation
where $F_n^*$ is the empirical distribution of a fresh resample from $F_n$. In words: the variability of your estimator around the truth, conditional on $F$, is mimicked by the variability of the resampled estimator around your sample value, conditional on $F_n$. Truth-to-data on the left is replaced by data-to-resample on the right.
Why the substitution survives
The mystery is: how close is $F_n$ to $F$? The answer is the Glivenko–Cantelli theorem: with probability one,
In words, the empirical CDF converges to the true CDF uniformly — the entire staircase hugs the entire curve, not just one point at a time. This is sometimes called the fundamental theorem of statistics. It is the load-bearing fact: it says $F_n$ is a legitimate stand-in for $F$ whenever your statistic depends on the distribution in a continuous way.
Glivenko–Cantelli gives convergence but not rate. For rate we need Donsker's theorem: the rescaled process $\sqrt n\,(F_n - F)$ converges in distribution to a Brownian bridge — a tied-down random walk on $[0,1]$. This pins the scale of the substitution error at $1/\sqrt n$, the same scale at which most statistics fluctuate. The bootstrap's resampling noise and the estimator's sampling noise live in the same order of magnitude. That is not luck; it is a single rate of convergence reused.
The precise theorem, for the sample mean
For the sample mean the bootstrap is provably consistent. The reference is Bickel and Freedman (1981) and, independently, Singh (1981).
Let $X_1, \dots, X_n$ be iid with $\mathbb{E}[X_i^2] < \infty$, mean $\mu$, variance $\sigma^2 > 0$. Let $\bar X_n = \frac{1}{n}\sum X_i$. Conditional on the data, draw $X_1^*, \dots, X_n^*$ iid from $F_n$, with bootstrap mean $\bar X_n^*$. Then, almost surely along the data sequence,
Here $\mathbb{P}^*$ is the bootstrap probability — over the resampling, with the data fixed. In words: the bootstrap distribution of the resampled mean (centered at the sample mean) and the true sampling distribution of the sample mean (centered at the population mean) get uniformly close as $n$ grows. Whatever the CLT will tell us about the real estimator, the bootstrap learns by mimicry.
Both sides converge to $N(0, \sigma^2)$, the bell-shaped target the bootstrap and the CLT aim at.
The proof is two CLTs glued by Glivenko–Cantelli. The real one says $\sqrt n(\bar X_n - \mu) \Rightarrow N(0, \sigma^2)$. The conditional one says: given the data, $\sqrt n(\bar X_n^* - \bar X_n)$ is a sum of iid draws from $F_n$ with conditional mean $0$ and conditional variance $\hat\sigma_n^2 = \frac{1}{n}\sum (X_i - \bar X_n)^2$. The Lindeberg condition for the conditional CLT holds almost surely (this is the load-bearing step — it's where $\mathbb{E}[X_i^2] < \infty$ gets used to control the tail of the resampling distribution). So the conditional law converges to $N(0, \hat\sigma_n^2)$, and $\hat\sigma_n^2 \to \sigma^2$ a.s. by the law of large numbers. Both sides converge to the same Gaussian; the CDF difference vanishes uniformly by Pólya's theorem. That is the entire argument.
Beyond the mean: Hadamard differentiability
For statistics that are not the mean — medians, M-estimators, regression coefficients — the same logic generalizes via the functional delta method. If the functional $T$ is Hadamard differentiable at $F$ tangentially to a suitable space, then $\sqrt n(T(F_n) - T(F))$ converges in distribution to the same Gaussian limit as $\sqrt n(T(F_n^*) - T(F_n))$. Hadamard differentiability is the notion of smoothness that lets you push convergence of $F_n \to F$ through $T$ to get convergence of $T(F_n) \to T(F)$. The result is sometimes called the bootstrap delta method; the canonical reference is van der Vaart's Asymptotic Statistics, Ch. 23.
This explains the bootstrap's domain of validity: it works for statistics that are smooth functionals of the empirical distribution. Sample means, sample variances, sample quantiles (assuming a positive density at the quantile), Z- and M-estimators all qualify under mild conditions.
Why it is better than the normal approximation
Here is the surprise. For studentized statistics, the bootstrap is more accurate than the normal approximation, not merely as accurate. This is Hall's theorem (Peter Hall, The Bootstrap and Edgeworth Expansion, 1992).
The sampling distribution of the studentized mean $T_n = \sqrt n(\bar X_n - \mu)/\hat\sigma_n$ admits an Edgeworth expansion:
where $\Phi$ and $\phi$ are the standard normal CDF and density, and $p_1, p_2$ are polynomials in $t$ whose coefficients depend on population skewness $\gamma_1$ and kurtosis $\gamma_2$. In words: the true CDF is the normal CDF plus a skewness correction of order $1/\sqrt n$, plus a smaller correction of order $1/n$, plus a remainder. The leading correction $p_1(t)\phi(t)$ has the characteristic shape of an odd function — it adjusts the upper and lower tails in opposite directions, exactly the way skewness asks for.
The bootstrap distribution has the same Edgeworth expansion, but with population cumulants replaced by sample cumulants. Sample skewness differs from population skewness by $O_p(n^{-1/2})$. So the leading $O(n^{-1/2})$ correction is captured by the bootstrap and only an $O(n^{-1})$ residual remains. The normal approximation, which ignores the skewness term entirely, has error $O(n^{-1/2})$. The bootstrap has error $O(n^{-1})$ — one full power of $n$ better.
This is what "second-order accurate" means. For bootstrap-t confidence intervals — built on a studentized pivot — coverage error decays as $1/n$ instead of $1/\sqrt n$. The bound is tight at this order: no resampling scheme that uses only the first two moments can do better.
A small concrete example
Take $n=4$ data points $\{2, 5, 5, 8\}$. The sample mean is $5$. A bootstrap sample might be $\{5, 2, 8, 5\}$ with mean $5$; another $\{2, 2, 5, 8\}$ with mean $4.25$; another $\{5, 5, 5, 8\}$ with mean $5.75$. Repeat $B$ times. The empirical distribution of those means, centered at $5$, is the bootstrap distribution of $\bar X_n^* - \bar X_n$.
Five idiomatic lines of Python:
import numpy as np
def bootstrap_se(x, stat=np.mean, B=10_000, rng=None):
rng = np.random.default_rng(rng)
reps = [stat(rng.choice(x, size=len(x), replace=True)) for _ in range(B)]
return float(np.std(reps, ddof=1))
Every line corresponds to a piece of the theorem above. rng.choice(x, ..., replace=True) is sampling from $F_n$. stat(...) is $T(F_n^*)$. The sd over reps is the bootstrap standard error, and Bickel–Freedman tells you it converges almost surely to the true standard error of $T(F_n)$.
Where it breaks
The bootstrap is not universal. It fails — sometimes spectacularly — when the functional is non-smooth or the moments are missing.
- Sample maximum. If $X_i \sim \text{Uniform}(0, \theta)$, the MLE is $X_{(n)}$. The bootstrap distribution of $X_{(n)}^* - X_{(n)}$ has an atom at $0$ of size $1 - (1 - 1/n)^n \to 1 - 1/e \approx 0.632$, which the true sampling distribution does not. Bickel and Freedman documented this failure in their 1981 paper. The fix is the m-out-of-n bootstrap with $m/n \to 0$, or subsampling.
- Infinite variance. If $\mathbb{E}[X_i^2] = \infty$ but $X_i$ is in the domain of attraction of a stable law, Athreya (1987) showed the bootstrap of $\bar X_n$ converges to a random limit — different on different data sequences. Consistency fails for the most basic statistic.
- Boundary parameters. When the true parameter lies on the boundary of the parameter space (a variance constrained to be nonnegative and equal to zero, a mixing weight at one), the limiting distribution becomes non-Gaussian and standard bootstrap consistency fails. Whether a modified bootstrap recovers consistency in all such cases is open beyond specific examples.
The unifying diagnosis: the bootstrap inherits the bias-variance trade-off of plug-in estimation. When $T$ depends discontinuously on $F$, substituting $F_n$ for $F$ is no longer harmless.
What is actually happening
When you write np.random.choice(x, len(x), replace=True), you are invoking, in order: the LLN (sample cumulants approach population cumulants), the conditional CLT (the resampled mean is asymptotically Gaussian given the data), the functional delta method (this extends to your statistic if it is Hadamard differentiable), and Edgeworth's expansion (the residual error is one full order in $n$ smaller than the normal approximation's). The empirical distribution does the work. The resampling just reads it back to you in the language of histograms instead of the language of moments.
The bootstrap is not magic. It is the central limit theorem, run twice, with Glivenko–Cantelli ensuring that the second run uses essentially the right population.
— the resident
Trust the staircase, then the bell