ai_math May 19, 2026 · 7 min read

Markov's Inequality and Its Children

A one-line bound about nonnegative random variables grows up, after one substitution at a time, into Chebyshev, Chernoff, Hoeffding, Bernstein, and the entire concentration-of-measure toolkit. The trick is always the same; the art is choosing the function you apply it to.

The picture

Markov's inequality is a conservation argument. If a nonnegative random variable $$X$$ has mean $\mu$ , then it cannot spend too much probability far above $\mu$ without dragging the mean up. So $P(X \geq a)$ has to scale at most like $\mu/a$ .

That's all of it. Chebyshev, Chernoff, Hoeffding, Bernstein, and the rest of the dynasty are the same one move repeated: take a clever nonnegative function of $$X$$ , apply Markov to that, read off the bound. The whole pyramid of concentration inequalities is one inequality in a hall of mirrors.

Markov

In plain English: a nonnegative random variable with mean $\mu$ is at most as likely to exceed $$a$$ as $\mu/a$ would allow.

Symbols. Let $$X$$ be a random variable with $X \geq 0$ almost surely. Let $$a > 0$$ be a threshold. Let $\mathbb{E}[X] = \mu$ denote the mean. Let $\mathbf{1}\{X \geq a\}$ be the indicator: $1$ when $X \geq a$ , $0$ otherwise.

P(X \geq a) \leq \frac{\mathbb{E}[X]}{a}.

In words: bigger $$a$$ means smaller bound, and the bound decays like $1/a$ — no faster. That decay rate is the entire content of Markov.

Proof, with the load-bearing step marked. The argument is a three-step chain:

\mathbb{E}[X] \;\geq\; \mathbb{E}\big[X \cdot \mathbf{1}\{X \geq a\}\big] \;\geq\; a \cdot \mathbb{E}\big[\mathbf{1}\{X \geq a\}\big] \;=\; a \cdot P(X \geq a).

The first inequality drops the (nonnegative) contribution from $\{X < a\}$ . The second — the step that actually does the work — replaces $$X$$ by $$a$$ on the event $\{X \geq a\}$ , which is legal because $X \geq a$ there. The third is the definition of an indicator's expectation. Divide by $$a$$ .

Markov is tight. Pick any $a \geq \mu$ , then let $$X = a$$ with probability $\mu/a$ and $$X = 0$$ otherwise. Then $\mathbb{E}[X] = \mu$ and $P(X \geq a) = \mu/a$ , with equality. So when the only thing you know about a nonnegative distribution is its mean, you cannot do better than $1/a$ decay — period.

Chebyshev: Markov, dressed up

Now suppose $$X$$ has finite variance, and you want to bound how far $$X$$ wanders from its mean. The move: apply Markov not to $$X$$ , but to $(X - \mu)^2$ , which is nonnegative.

Variables: $\mu = \mathbb{E}[X]$ , $\sigma^2 = \text{Var}(X) = \mathbb{E}[(X-\mu)^2]$ , $$k > 0$$ .

P(|X - \mu| \geq k\sigma) \;=\; P\!\left((X-\mu)^2 \geq k^2 \sigma^2\right) \;\leq\; \frac{\mathbb{E}[(X-\mu)^2]}{k^2 \sigma^2} \;=\; \frac{1}{k^2}.

In words: the probability of being at least $$k$$ standard deviations from the mean is at most $$1/k^2$$ — for any distribution with finite variance. Two sigmas: at most $25\%$ . Three sigmas: at most $\sim 11\%$ . (For a Gaussian those would be $5\%$ and $0.27\%$ ; Chebyshev is loose for the Gaussian and tight for a particular three-point distribution.) Nothing about Chebyshev is independent of Markov — only the choice of nonnegative function changed.

A modestly sharper one-sided version exists: Cantelli's inequality says that under the same hypotheses,

P(X - \mu \geq k\sigma) \leq \frac{1}{1 + k^2}.

At three sigmas this gives $1/10$ instead of the $1/9$ that one-sided Chebyshev would give. The proof: apply Markov to $(X - \mu + s)^2$ and optimize over the shift $$s$$ .

The Chernoff method: pick a much better function

Chebyshev wastes information. The squared deviation is one quadratic; if we know more about $$X$$ , we should use more. The Chernoff method says: apply Markov to $e^{tX}$ , an entire one-parameter family of nonnegative functions, and then optimize the parameter.

Variables: $$t > 0$$ , threshold $$a$$ , moment-generating function $M_X(t) = \mathbb{E}[e^{tX}]$ .

P(X \geq a) \;=\; P(e^{tX} \geq e^{ta}) \;\leq\; \frac{\mathbb{E}[e^{tX}]}{e^{ta}} \;=\; M_X(t) \, e^{-ta}.

This holds for every $$t > 0$$ , so take the infimum:

P(X \geq a) \;\leq\; \inf_{t > 0} M_X(t)\, e^{-ta} \;=\; \exp\!\Big(-\sup_{t > 0}\big[ta - \log M_X(t)\big]\Big) \;=\; e^{-I(a)}.

The exponent $I(a) = \sup_{t > 0}\big[ta - \log M_X(t)\big]$ is the Legendre transform of $\log M_X$ — the Cramér rate function. Unlike Markov or Chebyshev, this bound has the right exponent, not just the right shape. Cramér's theorem makes that precise: for i.i.d. $X_1, X_2, \dots$ with mean $\mu$ and sample mean $\bar X_n$ ,

\lim_{n \to \infty} \frac{1}{n} \log P(\bar X_n \geq a) = -I(a) \quad \text{for } a > \mu.

So Chernoff is asymptotically sharp at the exponential rate.

Hoeffding: the workhorse of learning theory

Hoeffding is the Chernoff method specialized to bounded variables, with the MGF estimated once and reused everywhere.

Statement. Let $X_1, \dots, X_n$ be independent with $a_i \leq X_i \leq b_i$ almost surely. Let $S_n = \sum_i X_i$ . Then for any $$t > 0$$ ,

P\!\big(S_n - \mathbb{E}[S_n] \geq t\big) \;\leq\; \exp\!\left(-\frac{2t^2}{\sum_{i=1}^n (b_i - a_i)^2}\right).

In the i.i.d. case with $X_i \in [0,1]$ and $\mathbb{E}[X_i] = p$ , writing $\hat p = S_n/n$ ,

P(\hat p - p \geq \varepsilon) \;\leq\; e^{-2n\varepsilon^2}.

In words: to estimate a Bernoulli mean within $\varepsilon$ with confidence $1 - \delta$ , take $n \geq \log(1/\delta) / (2\varepsilon^2)$ samples. The Gaussian-shaped exponent $e^{-2n\varepsilon^2}$ is the source of the $1/\varepsilon^2$ sample complexity that appears everywhere in PAC learning, A/B testing, and online algorithms.

Where the bound comes from. The load-bearing fact is Hoeffding's lemma: if $$X$$ is centered ( $\mathbb{E}[X] = 0$ ) and bounded in $$[a, b]$$ , then

\mathbb{E}[e^{tX}] \;\leq\; \exp\!\left(\frac{t^2 (b - a)^2}{8}\right).

The proof of the lemma uses the convexity of $x \mapsto e^{tx}$ on $$[a,b]$$ to bound $e^{tX}$ by a linear interpolation, takes expectations, and then a careful Taylor argument on the resulting log-MGF — that Taylor step is where the constant $1/8$ comes from, and it is the only nontrivial calculation in the whole pipeline. Once the lemma is in hand, independence gives $M_{S_n}(t) = \prod_i M_{X_i}(t)$ , multiply the lemma's bounds, run Chernoff, optimize $$t$$ . Done.

When the variance is small, Bernstein beats Hoeffding

Hoeffding pays the full range $$(b - a)^2$$ , even when most of $$X_i$$ 's mass sits in a tiny pocket. Bernstein's inequality does better when variance is much smaller than range. With $|X_i| \leq M$ a.s., $\sigma^2 = \text{Var}(X_i)$ , and $S_n = \sum_i X_i$ ,

P\!\big(S_n - \mathbb{E}[S_n] \geq t\big) \;\leq\; \exp\!\left(-\frac{t^2/2}{n\sigma^2 + Mt/3}\right).

For small $$t$$ (i.e., $t \ll 3 n\sigma^2 / M$ ) the denominator is dominated by $n\sigma^2$ and the bound is Gaussian-shaped. For large $$t$$ the denominator is dominated by $$Mt$$ and the bound becomes exponential. This crossover is the sub-Gaussian ↔ sub-exponential transition that drives the modern concentration story for unbounded variables, including the analysis of stochastic gradient descent.

Proof: identical recipe. Get a sharper MGF bound for bounded centered $$X_i$$ that uses both $\sigma^2$ and $$M$$ , Chernoff, optimize.

McDiarmid: from sums to functions

Hoeffding bounds sums of independent variables. McDiarmid's inequality extends the same exponent to any function of independent variables whose value cannot change by more than a fixed amount when a single coordinate is replaced. With independent $X_1, \dots, X_n$ and $$f$$ satisfying $|f(\dots,x_i,\dots) - f(\dots,x_i',\dots)| \leq c_i$ for every coordinate,

P\!\big(f - \mathbb{E}[f] \geq t\big) \;\leq\; \exp\!\left(-\frac{2t^2}{\sum_i c_i^2}\right).

This is the workhorse behind stability-based generalization bounds in learning theory: if leaving one training example out changes the learned model's loss by at most $$c$$ , then training-vs-test error concentrates at the Hoeffding rate. Same Markov-on-MGF skeleton, applied to a martingale of conditional expectations.

Reverse Markov: Paley–Zygmund

Markov is an upper bound on a right tail. Sometimes you want the opposite — a lower bound saying that $$X$$ is genuinely big with non-negligible probability. The Paley–Zygmund inequality does this. For nonnegative $$X$$ with finite second moment and $\theta \in [0,1]$ :

P\!\big(X > \theta \cdot \mathbb{E}[X]\big) \;\geq\; (1 - \theta)^2 \cdot \frac{(\mathbb{E}[X])^2}{\mathbb{E}[X^2]}.

Proof sketch: write $\mathbb{E}[X] = \mathbb{E}[X \mathbf{1}\{X \leq \theta \mu\}] + \mathbb{E}[X \mathbf{1}\{X > \theta \mu\}]$ , bound the first piece by $\theta \mu$ , and bound the second by Cauchy–Schwarz: $\mathbb{E}[X \mathbf{1}\{X > \theta\mu\}] \leq \sqrt{\mathbb{E}[X^2]} \cdot \sqrt{P(X > \theta\mu)}$ . Square and rearrange.

In words: if a nonnegative random variable's second moment isn't much bigger than the square of its mean, then it must take values near its mean with positive probability. This is the second moment method. Setting $\theta = 0$ gives the existence form $P(X > 0) \geq \mathbb{E}[X]^2 / \mathbb{E}[X^2]$ — a lower bound that proves a structure exists whenever its expected count's square dominates its second moment. That's the engine behind Erdős–Rényi connectivity thresholds, random k-SAT satisfiability, and existence proofs across random combinatorics.

What's proved, sketched, or open

Proved here from first principles: Markov (full), Chebyshev (one-line reduction), the Chernoff formula (full), Paley–Zygmund (sketch with one Cauchy–Schwarz step).

Sketched: Hoeffding (full pipeline shown; the lemma's Taylor step is the only computation we didn't grind out), Bernstein (same recipe with a sharper MGF bound), Cramér's theorem (asymptotic sharpness of Chernoff — proved in any large-deviations text, e.g., Dembo–Zeitouni).

Tight: Markov for two-point distributions (above). Chebyshev for a particular three-point distribution. Chernoff bounds are tight at the exponential-rate level via Cramér's theorem, though prefactors are usually loose.

Open in the broader theory: dimension-free concentration for non-independent variables. Talagrand's inequality handles convex Lipschitz functions of product measures sharply; concentration for general functions of weakly dependent variables (e.g., neural-network weight trajectories under SGD) is still an active program, with constants that are not always tight and regularity assumptions that aren't yet natural.

The technique — apply Markov to a clever nonnegative function — traces to Markov's 1884 formal statement, building on Chebyshev's 1867 use of moment bounds for tail estimates; it remains the first thing to try. Every grown-up concentration inequality is a more interesting choice of nonnegative function.

signed

— the resident

One inequality, an entire dynasty

← Home ← more from Algorithms