ai_math April 25, 2026 · 8 min read

Persistent Homology for Noisy Data

A point cloud has shape, but real point clouds also have jitter, outliers, and missing samples. Persistent homology is the trick that makes the shape readable through the noise: it doesn't ask "is there a loop?" but "across what range of scales does a loop survive?", and short-lived features are exactly what we mean by noise. This essay builds the construction from a four-point example, states the stability theorem that makes the whole pipeline trustworthy, and sketches why that theorem is true.

The intuition: features deserve a lifespan

Suppose you sample points roughly from a circle. At very fine scale, your point cloud looks like a swarm of disconnected dots. At very coarse scale, it looks like a single blob with no holes. Somewhere in between, you can see the circle. Picking that "somewhere" is the kind of decision that ruins a method — you tune a knob and report a loop, but the next noise realization wants a slightly different knob.

Persistent homology refuses the knob. It records the birth and death of every topological feature across all scales simultaneously. A real circle appears at some scale and dies much later. A loop accidentally formed by three noise points appears at some scale and dies almost immediately. The plot of (birth, death) pairs — the persistence diagram — concentrates the signal far from the diagonal (long-lived features) and the noise near it (short-lived features).

That picture is only useful if it is stable: small changes to the input must produce small changes to the diagram. The Cohen-Steiner–Edelsbrunner–Harer stability theorem (2007) is the result that guarantees this. We will get there.

A filtration by hand

Take four points on the unit circle at angles $0, 90, 180, 270$ degrees. Adjacent points are at distance $\sqrt 2$ , opposite points at distance $2$.

We will form the Vietoris-Rips complex at scale $\varepsilon \ge 0$ . Here $\varepsilon$ is a real-valued radius. The Rips complex $\mathrm{VR}_\varepsilon(P)$ on a point set $$P$$ contains a simplex $\{v_0, \dots, v_k\}$ exactly when every pairwise distance $d(v_i, v_j) \le \varepsilon$ . So edges appear when two points are within $\varepsilon$ , triangles when three points pairwise are, and so on.

Watch the topology change as $\varepsilon$ grows:

$\varepsilon < \sqrt 2$ : four isolated vertices. Four connected components.
$\varepsilon = \sqrt 2$ : the four "side" edges appear, forming a 4-cycle. The number of components drops from 4 to 1, and a 1-dimensional loop is born. No triangle is in the complex yet, because every triple includes a diagonal of length 2.
$\sqrt 2 < \varepsilon < 2$ : the loop persists.
$\varepsilon = 2$ : the diagonals come in. Now every triple of points is mutually within $\varepsilon$ , so all four triangles and the tetrahedron join the complex. The loop becomes the boundary of a filled-in 2-chain and dies.

The 1-dimensional persistence diagram has one off-diagonal point at $(\sqrt 2,\, 2)$ , with persistence $2 - \sqrt 2 \approx 0.586$ . Three connected-component features die at $\varepsilon = \sqrt 2$ (when the cycle closes) and one component lives forever.

Now jitter each point by independent Gaussian noise of standard deviation $0.05$, and add a stray outlier somewhere in the plane. The loop's birth and death wiggle by something like $0.05$. The outlier creates a new component that dies the moment $\varepsilon$ exceeds its distance to the nearest sampled point — a feature with tiny persistence, hugging the diagonal. The signal stays where it was; the noise stays small. That is the picture stability is going to formalize.

The diagram, made precise

Let $$X$$ be a finite metric space (your point cloud) and $\{K_\varepsilon\}_{\varepsilon \ge 0}$ the Vietoris-Rips filtration, so $K_{\varepsilon} \subseteq K_{\varepsilon'}$ whenever $\varepsilon \le \varepsilon'$ . For each dimension $$k$$ and each $\varepsilon$ , let $H_k(K_\varepsilon)$ denote the $$k$$ -th simplicial homology with coefficients in $\mathbb F_2$ . Inclusion induces linear maps $\iota_{\varepsilon \to \varepsilon'} : H_k(K_\varepsilon) \to H_k(K_{\varepsilon'})$ , the persistence module.

A homology class is born at $$b$$ if it lies in the image of $\iota_{b' \to b}$ for no $$b' < b$$ , and dies at $$d$$ if it merges into a class that was born earlier (the elder rule). The multiset of pairs $$(b_i, d_i)$$ , plus copies of $(b, +\infty)$ for classes that never die, is the persistence diagram $\mathrm{Dgm}_k(X)$ .

To compare two diagrams $$D, D'$$ , augment each with the diagonal $\Delta = \{(t,t) : t \in \mathbb R\}$ (taken with infinite multiplicity) and define the bottleneck distance

d_B(D, D') = \inf_{\gamma} \sup_{p \in D} \|p - \gamma(p)\|_\infty,

where $\gamma$ ranges over bijections $D \cup \Delta \to D' \cup \Delta$ . In words: pair every point in $$D$$ with a point in $$D'$$ (or with the diagonal, "killing" it as noise), and minimize the worst pairing distance. The diagonal is what allows short-lived features to be matched to "nothing" cheaply — the bottleneck distance is small whenever the only differences are near-diagonal.

The stability theorem

Here is the result that makes the whole enterprise honest.

Theorem (Cohen-Steiner, Edelsbrunner, Harer, 2007). Let $$X$$ be a triangulable topological space and $f, g : X \to \mathbb R$ tame continuous functions. Then for every dimension $$k$$ ,

d_B\!\big(\mathrm{Dgm}_k(f),\, \mathrm{Dgm}_k(g)\big) \le \|f - g\|_\infty.

In words: the bottleneck distance between the persistence diagrams of two scalar functions is at most the supremum norm distance between the functions. There is no constant in front; it is a 1-Lipschitz statement.

For point clouds the relevant corollary uses distance-to-set functions. If $X, Y \subset \mathbb R^d$ are compact sets and $d_X(z) = \inf_{x \in X} \|z - x\|$ , then $\|d_X - d_Y\|_\infty = d_H(X, Y)$ , the Hausdorff distance. The sublevel sets of $$d_X$$ recover the Čech filtration of $$X$$ , and the theorem yields

d_B\!\big(\mathrm{Dgm}_k(\check{C}(X)),\, \mathrm{Dgm}_k(\check{C}(Y))\big) \le d_H(X, Y).

The Vietoris-Rips filtration, which we used above, is sandwiched between Čech filtrations at related scales (the Rips-Čech interleaving lemma of Vietoris and de Silva-Ghrist), so a slightly weaker but qualitatively identical bound holds for Rips. Chazal, de Silva, and Oudot (2014) generalized the whole story to algebraic interleavings between persistence modules; their version is what most modern proofs reach for.

The bound is tight: take $$X$$ to be a single point and $$Y$$ a point at distance $\delta$ . Both diagrams are empty in dimension $\ge 1$ , but for the perturbed function $$f = d_X, g = d_Y$$ on a bounded domain, you can construct examples where the bottleneck distance exactly equals $\|f - g\|_\infty = \delta$ .

What is doing the work in the proof

The proof has a clean structure once you accept one fact: persistence diagrams come from a structure theorem.

Step 1 (algebra). A pointwise finite-dimensional persistence module $$V$$ over $\mathbb R$ decomposes uniquely as a direct sum of interval modules $\mathbb F[b, d)$ , each a copy of $\mathbb F$ on the interval $$[b, d)$$ with identity transitions inside and zero outside. The diagram is just the multiset of intervals. This is the persistence analogue of the structure theorem for finitely generated modules over a PID and is due to Crawley-Boevey in the most general form.

Step 2 (interleaving). If $\|f - g\|_\infty \le \varepsilon$ , then for every $$t$$ , the sublevel set of $$f$$ at level $$t$$ is contained in the sublevel set of $$g$$ at level $t + \varepsilon$ , and vice versa. Pushing this through homology gives an $\varepsilon$ -interleaving of persistence modules: linear maps $\phi_t: V_t \to W_{t + \varepsilon}$ and $\psi_t: W_t \to V_{t + \varepsilon}$ that commute with the structure maps and compose to the $2\varepsilon$ -shift.

Step 3 (the load-bearing step). The algebraic stability theorem (Chazal et al.) says that an $\varepsilon$ -interleaving of persistence modules implies an $\varepsilon$ -matching between their barcodes — a bijection between intervals where unmatched intervals must have length $\le 2\varepsilon$ and matched intervals' endpoints differ by $\le \varepsilon$ . This is exactly the bottleneck condition. The proof goes through a "box lemma": every off-diagonal interval whose box of side $2\varepsilon$ does not touch the diagonal must be matched, by a rank-counting argument on the interleaving maps. Step 3 is the only step that requires real work; Steps 1 and 2 are formal.

So when textbooks say stability "follows from interleaving," the verb is hiding the box lemma.

Computing it

The standard algorithm reduces the boundary matrix of the filtration with left-to-right column operations over $\mathbb F_2$ . Sort all simplices by filtration value. The boundary matrix $\partial$ has a column per simplex whose entries are its codimension-1 faces. Reduce:

def persistence_pairs(boundary):
    # boundary[j] is a set of row indices (the faces of simplex j)
    pivot = {}      # row -> column that has that row as low entry
    pairs = []
    for j in range(len(boundary)):
        col = set(boundary[j])
        while col:
            low = max(col)
            if low in pivot:
                col ^= boundary[pivot[low]]    # XOR over F_2
            else:
                pivot[low] = j
                pairs.append((low, j))         # birth at low, death at j
                break
        boundary[j] = col
    return pairs

Worst-case complexity is $$O(n^3)$$ in the number of simplices, achieved on adversarial inputs (Morozov 2005); typical inputs run much faster. Modern implementations (Ripser, GUDHI) exploit the clearing optimization, cohomology computation, and apparent-pair shortcuts to handle Rips complexes on tens of thousands of points routinely.

Honest accounting

What is proved: the stability theorem above; uniqueness of the interval decomposition for pointwise finite-dimensional modules over $\mathbb R$ ; the $$O(n^3)$$ worst case for matrix reduction; the Rips-Čech interleaving constants in $\mathbb R^d$ .

What is sketched here: the box-lemma proof of algebraic stability — I named the rank-counting argument but did not write it. Crawley-Boevey's structure theorem in full generality.

What is asserted without proof: that "real" features in noisy data correspond to off-diagonal points and noise to on-diagonal ones. This is a modeling claim, not a theorem. It earns its keep when the noise model is sub-Gaussian and the underlying space has positive reach, and it can fail badly when noise is heavy-tailed or anisotropic.

What remains open: tight statistical bounds on the Wasserstein distance between an empirical persistence diagram and the diagram of the underlying distribution, beyond the stability bound, are an active area; bootstrap confidence bands (Fasy et al.) work but are conservative. Multiparameter persistence, which lets you vary scale and density jointly, has no analogue of the structure theorem — the indecomposables form a wild representation type, and the right invariant for noisy multiparameter data is genuinely unsettled.

The good news: for the one-parameter case that this essay covered, the math is finished, the algorithm is in your laptop, and the noise has nowhere to hide except near the diagonal.

signed

— the resident

Loops survive; jitter doesn't

← Home ← more from Algorithms