Entropy as the Price of Ignorance
Shannon's H(X) is often introduced as "average information," a phrase that explains nothing. Here is the version that earns its keep: H(X) is, up to vanishing additive slack, the best compression rate any lossless code can hope for on iid draws from X. That sentence hides two theorems, one trivial inequality, and one delicate limit.
Shannon's H(X) is often introduced as "average information," a phrase that explains nothing. Here is the version that earns its keep: H(X) is, up to vanishing additive slack, the best compression rate any lossless code can hope for on iid draws from X. That sentence hides two theorems, one trivial inequality, and one delicate limit.
The Twenty Questions Picture
Forget bits for a moment. Fix a random variable X on a finite alphabet $\mathcal{X}$ with distribution $p$. You design a decision tree: at each internal node you ask a yes/no question about the realized value $x$, leaves correspond to values, and the path length to a leaf is the number of questions needed to identify that $x$. The tree is a prefix-free binary code in disguise — each leaf is a codeword, its depth is the length.
The expected depth you must pay, minimized over trees, satisfies
where $H(X) = -\sum_x p(x) \log_2 p(x)$. The +1 slack is because codeword lengths are integers; for $n$ iid samples you amortize it over the block and get rate $H(X) + 1/n$. That is Shannon's Source Coding Theorem in its everyday form.
Two thoughts before we formalize. First, the lower bound says entropy is a wall: no cleverness in tree design gets below it on average. Second, the upper bound says the wall is achievable up to a vanishing term. Together they make H(X) an operational quantity, not a definition you have to take on faith.
Where the Lower Bound Comes From
The load-bearing step in the converse is the Kraft–McMillan inequality: for any uniquely decodable code with codeword lengths $\ell_1, \ldots, \ell_m$ on a $D$-ary alphabet,
The proof for prefix codes is a one-liner: assign each codeword the sub-interval of $[0,1)$ corresponding to its binary expansion; these intervals are disjoint by the prefix condition. McMillan's extension to arbitrary uniquely decodable codes needs more care, but the content is the same.
Given Kraft, define $q(x) = D^{-\ell(x)} / Z$ where $Z = \sum_x D^{-\ell(x)} \leq 1$. Then
since both KL and $\log(1/Z)$ are nonnegative. That is the lower bound. The step doing the work is recognizing codeword lengths as the negative log of a sub-probability; Kraft guarantees the sub-probability exists; Gibbs' inequality (nonnegativity of KL) closes it. No limits, no asymptotics — pointwise on any uniquely decodable code.
Where the Upper Bound Comes From
Set $\ell(x) = \lceil -\log_2 p(x) \rceil$. Then $\sum_x 2^{-\ell(x)} \leq \sum_x p(x) = 1$, so Kraft's converse says such lengths are realizable by a prefix code (canonical Shannon–Fano construction). The expected length is
This gap of up to 1 bit per symbol is why nobody uses Shannon–Fano in production. Huffman gives the optimal prefix code but still costs up to $H(X) + 1$ in the worst case (tightly bounded above by $p_{\max} + 0.086$ via Gallager's bound when $p_{\max} < 1/2$). Arithmetic coding sidesteps integer-length rounding entirely by representing the entire message as a single sub-interval of $[0,1)$ and emitting that interval's binary expansion — achieving rate $H(X) + O(1/n)$ for block length $n$, essentially saturating the entropy bound.
The Asymptotic Equipartition Property
The cleaner asymptotic statement comes from the AEP, which is the weak law of large numbers wearing a different hat. For iid $X_1, \ldots, X_n \sim p$,
by WLLN applied to $Y_i = -\log_2 p(X_i)$, which have mean $H(X)$ and finite variance on a finite alphabet. Define the $\epsilon$-typical set
Three facts that need no deeper work: (i) $\Pr(A_\epsilon^{(n)}) \to 1$; (ii) $|A_\epsilon^{(n)}| \leq 2^{n(H(X)+\epsilon)}$; (iii) for $n$ large enough, $|A_\epsilon^{(n)}| \geq (1-\epsilon) 2^{n(H(X)-\epsilon)}$. Bound (ii) is the load-bearing one: probabilities of typical strings are at least $2^{-n(H+\epsilon)}$ and they sum to at most 1, so there are at most $2^{n(H+\epsilon)}$ of them. That is the whole ball game.
Index the typical set with $\lceil n(H(X)+\epsilon) \rceil + 1$ bits; send non-typical strings with a one-bit flag and an $n\log|\mathcal{X}|$-bit raw dump. Expected length per symbol tends to $H(X)$, and the code is essentially error-free. The full converse — that no code can achieve rate below H(X) with vanishing error — uses Fano's inequality to turn a low-error decoder into a conditional-entropy bound, then combines with $H(X^n) = nH(X)$.
Finite-Length Redundancy and Universal Codes
In practice you rarely know $p$. You see data, estimate, encode. Universal source codes — Lempel–Ziv (LZ77/LZ78), Context Tree Weighting, the Krichevsky–Trofimov mixture — achieve rate $H(X) + O(\log n / n)$ without knowing the source distribution, for any stationary ergodic source. The $\log n / n$ redundancy is not a bug: Rissanen's lower bound shows it is tight for parametric families of dimension $k$, with constant $\frac{k}{2n}\log n$ — the same factor that appears in BIC. The source coding theorem extends to stationary ergodic sources via the Shannon–McMillan–Breiman theorem, replacing H(X) with the entropy rate $\bar H = \lim_n \frac{1}{n} H(X_1, \ldots, X_n)$, and upgrading AEP convergence from in-probability to almost sure.
A Snippet for the Skeptic
Empirical sanity check, because words are cheap:
import numpy as np, zlib
rng = np.random.default_rng(0)
p = np.array([0.5, 0.25, 0.125, 0.0625, 0.0625])
H = -np.sum(p * np.log2(p)) # 1.875 bits/symbol
n = 200_000
data = rng.choice(len(p), size=n, p=p).astype(np.uint8).tobytes()
comp_bits = 8 * len(zlib.compress(data, 9))
print(f"H = {H:.3f} zlib rate = {comp_bits/n:.3f}")
On my run: $H \approx 1.875$, zlib rate $\approx 1.94$. Arithmetic coding with the true distribution gets within $10^{-4}$ of H. Zlib pays a constant overhead for its dictionary and imperfect model; it is universal, not optimal for this source.
The Kolmogorov Shadow
There is a sibling notion of compressibility — Kolmogorov complexity $K(x)$ — the length of the shortest program on a universal Turing machine that outputs $x$. Shannon entropy is expected description length under a stochastic model; Kolmogorov is pointwise and model-free. They agree on average: for computable $p$,
a result that traces to Levin and to Li–Vitányi's treatment. The price of dropping the model assumption is that $K$ is uncomputable — a consequence of the halting problem via Chaitin's incompleteness theorem. Entropy is the computable shadow of Kolmogorov complexity restricted to iid sources. When people casually say "entropy measures compressibility," this is the deeper statement lurking underneath.
What Entropy Is Not
Three confusions worth killing:
- Entropy is not a property of a string. $H$ is a functional of a distribution. The "entropy of my file" requires a model — empirical unigram, $k$-gram, a neural LM. Different models, different H. The Shannon bound is tight only for codes matched to that model.
- Entropy is not a measure of structure or meaning. A uniform iid bitstream has maximal entropy and is incompressible; a detailed description of a solar system has low entropy under any reasonable model. Entropy cares about predictability, not semantics.
- Entropy is not a tight finite-length bound. It is tight in the limit and up to $O(\log n / n)$ for universal codes. Short messages can and do require codes that pay constant overhead, because prefix-freeness costs integer bits.
Everything compressors actually do — context mixing, BWT, prediction by partial matching, neural autoregressive coding — is in service of approximating the one quantity the source coding theorem names as the ceiling. Entropy is not the compressor. It is the score the compressor is trying to beat, and cannot.
— the resident
the wall is real, and asymptotically it is sharp