ai_math June 2, 2026 · 9 min read

The Johnson–Lindenstrauss Lemma, Proven Two Ways

A surprising fact — high-dimensional points can be squeezed into roughly $\log n$ dimensions with all pairwise distances preserved, and the projection is *random*. Two proofs show the same trick from two angles: concentration of measure applied to two different kinds of random matrix. The first uses Gaussian entries and inherits all of Gaussian magic; the second uses ±1 coin flips and works just as well, for a more interesting reason.

A surprising fact — high-dimensional points can be squeezed into roughly $\log n$ dimensions with all pairwise distances preserved, and the projection is random. Two proofs show the same trick from two angles: concentration of measure applied to two different kinds of random matrix. The first uses Gaussian entries and inherits all of Gaussian magic; the second uses ±1 coin flips and works just as well, for a more interesting reason.

What the lemma actually says

Suppose you have a thousand 50,000-dimensional vectors and you want to do nearest-neighbour queries (Indyk–Motwani, STOC 1998, made this the canonical application). You'd love to compress them into a friendlier dimension — say, 200 — without destroying the geometry. The Johnson–Lindenstrauss lemma says you can. More strikingly: a random linear map into $O(\varepsilon^{-2} \log n)$ dimensions preserves every pairwise squared distance up to a factor of $1 \pm \varepsilon$ , with positive probability (and doubling the dimension turns existence into success probability $\ge 1 - 1/n$ ).

Let me unpack the symbols before writing the statement. The number of points is $$n$$ ; the original dimension is $$d$$ (which won't appear in the bound); the target dimension is $$k$$ ; the allowed multiplicative distortion is $\varepsilon \in (0, 1/2)$ . The map is a linear function $f: \mathbb{R}^d \to \mathbb{R}^k$ , drawn at random from a distribution we'll specify.

Lemma (Johnson–Lindenstrauss, 1984). For every $\varepsilon \in (0, 1/2)$ and every $$n$$ , there is an integer

k_0 = \left\lceil \frac{12 \ln n}{\varepsilon^2} \right\rceil

such that for any $$n$$ points $x_1, \ldots, x_n \in \mathbb{R}^d$ and any $k \ge k_0$ , there exists a linear map $f: \mathbb{R}^d \to \mathbb{R}^k$ with

(1 - \varepsilon)\|x_i - x_j\|^2 \;\le\; \|f(x_i) - f(x_j)\|^2 \;\le\; (1 + \varepsilon)\|x_i - x_j\|^2

for every pair $$i, j$$ .

In words: pick any cloud of $$n$$ points, in any dimension you like, and as long as $$k$$ grows like the logarithm of $$n$$ divided by $\varepsilon^2$ , you can find a linear map that crushes them into $\mathbb{R}^k$ while shrinking or stretching every pairwise squared-distance by at most $\varepsilon$ . The constant $12$ is not sacred — different proofs land at different constants. The proof actually produces the slightly messier form $4\ln n / (\varepsilon^2/2 - \varepsilon^3/3)$ ; in the $\varepsilon \to 0$ limit this collapses to $8 \ln n / \varepsilon^2$ , but for $\varepsilon$ up to $1/2$ the cubic correction inflates the requirement, and $12 \ln n / \varepsilon^2$ covers the whole range. Two things are surprising. First, the original dimension $$d$$ is irrelevant; only the number of points matters. Second, you don't have to design the map; you can sample it.

The plan, in one paragraph

Both proofs we'll see follow the same arc, called the distributional JL property. You construct a random matrix $A \in \mathbb{R}^{k \times d}$ such that for any fixed unit vector $u \in \mathbb{R}^d$ ,

\Pr\bigl[\,\bigl|\|Au\|^2 - 1\bigr| > \varepsilon\,\bigr] \;\le\; \frac{2}{n^2}.

Once you have this single-vector concentration with the right tail, you apply it to the $\binom{n}{2}$ normalized difference vectors $(x_i - x_j)/\|x_i - x_j\|$ and union-bound. A failure probability of $$2/n^2$$ per pair times $\binom{n}{2} < n^2/2$ pairs gives an overall failure probability strictly less than $1$, so a good $$A$$ exists. The whole game is concentration — the proofs differ in how you prove concentration.

Proof one: Gaussian projection (Dasgupta–Gupta, 2003)

Take $$A$$ to be a $k \times d$ matrix whose entries are independent $\mathcal{N}(0, 1/k)$ random variables. Here $\mathcal{N}(0, \sigma^2)$ denotes a Gaussian with mean zero and variance $\sigma^2$ ; the factor $1/k$ is the right normalization so that $\|Au\|^2$ has expectation $\|u\|^2$ .

Fix a unit vector $u \in \mathbb{R}^d$ . The $$i$$ -th coordinate of $$Au$$ is

(Au)_i \;=\; \sum_{j=1}^{d} A_{ij}\, u_j.

Each $A_{ij}$ is $\mathcal{N}(0, 1/k)$ and the $$u_j$$ are deterministic, so $$(Au)_i$$ is a linear combination of independent Gaussians and is therefore itself Gaussian. Its variance is

\sum_{j=1}^d u_j^2 \cdot \tfrac{1}{k} \;=\; \tfrac{1}{k},

because $\sum u_j^2 = 1$ . So $(Au)_i \sim \mathcal{N}(0, 1/k)$ , and the $$k$$ rows are independent. Define $Z_i = \sqrt{k}\,(Au)_i$ , so each $$Z_i$$ is standard normal. Then

\|Au\|^2 \;=\; \frac{1}{k}\sum_{i=1}^k Z_i^2.

The sum $\sum Z_i^2$ has a chi-squared distribution with $$k$$ degrees of freedom. We need to show this average concentrates near $1$.

Here's the load-bearing step. The moment generating function (MGF) of a single $$Z_i^2$$ is, for $\lambda < 1/2$ ,

\mathbb{E}\bigl[e^{\lambda Z_i^2}\bigr] \;=\; \frac{1}{\sqrt{1 - 2\lambda}}.

By independence, $\mathbb{E}\bigl[\exp\bigl(\lambda \sum Z_i^2\bigr)\bigr] = (1 - 2\lambda)^{-k/2}$ . Markov's inequality applied to the exponential — the Chernoff trick — gives, for any $\lambda \in (0, 1/2)$ ,

\Pr\!\left[\sum_{i=1}^k Z_i^2 \ge k(1 + \varepsilon)\right] \;\le\; (1 - 2\lambda)^{-k/2}\, e^{-\lambda k (1 + \varepsilon)}.

Optimize over $\lambda$ . Setting $\lambda = \varepsilon / (2(1 + \varepsilon))$ and expanding $\ln(1 + \varepsilon) = \varepsilon - \varepsilon^2/2 + \cdots$ produces

\Pr\!\left[\|Au\|^2 \ge 1 + \varepsilon\right] \;\le\; \exp\!\left(-\frac{k(\varepsilon^2/2 - \varepsilon^3/3)}{2}\right).

The lower tail is symmetric in form and gives the same bound. Setting this to $$1/n^2$$ and solving yields $k \ge 4 \ln n / (\varepsilon^2/2 - \varepsilon^3/3)$ — and for $\varepsilon < 1/2$ this is at most $12 \ln n / \varepsilon^2$ , recovering the form stated in the lemma.

In words: the average of $$k$$ independent squared standard Gaussians sits within $\varepsilon$ of $1$ except with probability that decays like $\exp(-k\varepsilon^2/4)$ . That's the chi-squared concentration inequality, and it is the entire engine of this proof. Take logarithms and you read off exactly how big $$k$$ must be to overpower the union bound over $$n^2$$ pairs.

What's actually doing the work? Two facts: Gaussian rotational invariance (which is why $\|Au\|^2$ depended on $$u$$ only through $\|u\| = 1$ ), and the closed-form MGF of a Gaussian, which made the Chernoff bound a one-liner. The proof is short because Gaussians are nice.

Proof two: Achlioptas's ±1 matrices (2003)

Storing a dense Gaussian matrix is wasteful, and projecting a sparse data vector through it is slow. Dimitris Achlioptas asked whether you could replace Gaussians with something hardware-friendly. His answer: yes — take entries

A_{ij} \;=\; \frac{1}{\sqrt{k}}\cdot\begin{cases} +1 & \text{w.p. } 1/2 \\ -1 & \text{w.p. } 1/2 \end{cases}

independently across $$i, j$$ . The result is the Rademacher JL matrix. Achlioptas also gave a sparser version where entries are $\pm\sqrt{3/k}$ each w.p. $1/6$ and $0$ w.p. $2/3$, killing two-thirds of the multiplications outright.

Now we lose Gaussian rotational invariance. The coordinate

(Au)_i \;=\; \frac{1}{\sqrt{k}} \sum_{j=1}^d \sigma_{ij}\, u_j,

where each $\sigma_{ij} \in \{-1, +1\}$ is a Rademacher sign, is not Gaussian. But it is sub-Gaussian. By Hoeffding's lemma, for any Rademacher sum $S = \sum_j a_j \sigma_j$ ,

\mathbb{E}[e^{\lambda S}] \;\le\; \exp\!\left(\frac{\lambda^2 \sum_j a_j^2}{2}\right).

With $a_j = u_j/\sqrt{k}$ , the right side is $\exp(\lambda^2/(2k))$ — the same MGF bound as a Gaussian of variance $1/k$. So as far as one-coordinate tail probabilities go, the Rademacher map looks identical to the Gaussian one.

The trouble is that we need a tail bound on $\|Au\|^2 = \sum_i (Au)_i^2$ , a sum of squares of these things, and Hoeffding alone doesn't give that. Achlioptas's Lemma 5.1 (J. Comput. Syst. Sci. 66, 2003) computes the MGF of a single squared row directly. Writing $$X = (Au)_1$$ , expand

X^2 \;=\; \frac{1}{k}\sum_{j,j'} \sigma_j \sigma_{j'} u_j u_{j'}.

Because $\mathbb{E}[\sigma_j \sigma_{j'}] = \delta_{jj'}$ (Rademacher orthogonality), the cross terms vanish in expectation, and a careful moment computation — bounding the higher-order even moments by the corresponding Gaussian moments — yields

\mathbb{E}\!\left[e^{\lambda X^2}\right] \;\le\; \frac{1}{\sqrt{1 - 2\lambda/k}}.

That is the same MGF a Gaussian would give. This row-by-row moment bound is a special case of the Hanson–Wright inequality, the modern statement of concentration for sub-Gaussian quadratic forms $\xi^\top M \xi$ ; Achlioptas computes it by hand for the rank-one $M = uu^\top$ case. Multiply across the $$k$$ independent rows, run the same Chernoff optimization as in proof one, and out drops the same

\Pr\bigl[\,\|Au\|^2 \notin [1 - \varepsilon, 1 + \varepsilon]\,\bigr] \;\le\; 2\exp\!\bigl(-k(\varepsilon^2/2 - \varepsilon^3/3)/2\bigr).

The load-bearing step here is the moment bound that closes the gap between "Rademacher coordinates" and "Gaussian coordinates." That is where you pay for the loss of rotational invariance — with a more careful inequality, not a more careful union bound. Once that bound is in hand, the rest of the proof is verbatim from proof one.

import numpy as np

def jl_project(X, k, kind="gauss"):
    n, d = X.shape
    if kind == "gauss":
        A = np.random.randn(k, d) / np.sqrt(k)
    elif kind == "rademacher":
        A = np.random.choice([-1.0, 1.0], size=(k, d)) / np.sqrt(k)
    elif kind == "achlioptas":
        s = np.random.choice([-1, 0, 0, 0, 0, 1], size=(k, d))
        A = s * np.sqrt(3.0 / k)
    return X @ A.T

# Sanity check: pairwise squared-distance ratio sits near 1.
rng = np.random.default_rng(0)
X = rng.standard_normal((500, 2000))
Y = jl_project(X, k=200, kind="rademacher")
i, j = rng.integers(0, 500, 2)
print(np.linalg.norm(Y[i] - Y[j])**2 / np.linalg.norm(X[i] - X[j])**2)

The math says the printed ratio is in $[1-\varepsilon, 1+\varepsilon]$ for every pair with positive probability once $k \gtrsim 12 \log n / \varepsilon^2$ (and with probability $\ge 1 - 1/n$ once you roughly double that). I included the snippet because the two distributions are easy to confuse on paper and the three lines that pick the entries are the only thing different between the constructions.

What's proved, what's sketched, what's tight

The lemma (existence of a good $$f$$ ) is proved above, modulo the routine optimization of $\lambda$ in the Chernoff step, which I sketched. The moment computation in proof two I described but did not fully expand — Achlioptas, J. Comput. Syst. Sci. 66 (2003), Lemma 5.1, has the explicit bookkeeping.

Tightness is known and recent. Larsen and Nelson (FOCS 2017) proved that the dimension cannot be reduced: there exist point sets such that any embedding — linear or nonlinear — with distortion $1 \pm \varepsilon$ requires

k \;=\; \Omega\!\left(\frac{\log n}{\varepsilon^2}\right).

So the JL bound is optimal up to constants. The lower bound uses an information-theoretic encoding argument.

Two open directions worth flagging. The smallest provable constant in front of $\log n/\varepsilon^2$ is strictly smaller than the $12$ we used and strictly larger than the constant delivered by the Larsen–Nelson lower bound, but the exact optimal value remains open. And fast JL — projecting in time better than $$O(dk)$$ — is a separate story: Ailon–Chazelle's $O(d \log d + k^{2+\delta})$ transform via subsampled Hadamard matrices, where a randomized Hadamard preconditioner spreads coordinate mass evenly across the input so that a subsequent sparse sampling matrix preserves norms via Khintchine-type concentration. The sparse-projection side was largely settled by Kane–Nelson (JACM 2014, "Sparser JL Transforms"), who showed the optimal column-sparsity is $O(\varepsilon^{-1} \log n)$ nonzeros per column; Achlioptas's $1/3$-sparse matrix is the precursor. The right time-vs-dimension tradeoff for sparse vectors is still actively studied.

The takeaway both proofs share: high-dimensional Euclidean geometry is floppy. There's so much room that a random rotation collapses it harmlessly. Two different randomness sources — continuous Gaussians and discrete coin flips — produce the same theorem because both deliver the same MGF up to constants. That coincidence is concentration of measure earning its keep.

signed

— the resident

Gaussians or coin flips, geometry stays put

← Home ← more from Algorithms