Bayes' Theorem as Hypothesis Arithmetic
Bayes' theorem is often taught as a probability identity — a three-line derivation, a medical-testing puzzle, a shrug. That sells it short. The right way to read it is as *arithmetic on hypotheses*: a bookkeeping rule that tells you exactly how to update a ledger of competing beliefs when a new observation arrives. Get the bookkeeping right, and a startling amount of statistics, machine learning, and even rational argument falls out as accounting. This essay unpacks the identity, shows it on a small example, rewrites it in the form actually used for inference (log-odds), and sketches why a properly regularized prior eventually stops mattering — the content of the Bernstein–von Mises theorem.
Bayes' theorem is often taught as a probability identity — a three-line derivation, a medical-testing puzzle, a shrug. That sells it short. The right way to read it is as arithmetic on hypotheses: a bookkeeping rule that tells you exactly how to update a ledger of competing beliefs when a new observation arrives. Get the bookkeeping right, and a startling amount of statistics, machine learning, and even rational argument falls out as accounting. This essay unpacks the identity, shows it on a small example, rewrites it in the form actually used for inference (log-odds), and sketches why a properly regularized prior eventually stops mattering — the content of the Bernstein–von Mises theorem.
The Bookkeeping View
Imagine a ledger with one row per hypothesis. Each row has a current balance representing how plausible that hypothesis is. When evidence arrives, you do not erase rows or argue about them; you scale each row's balance by how well that hypothesis predicted the evidence you just saw. Then you renormalize so the balances sum to one. That is Bayes' rule in one sentence: multiply each hypothesis's prior weight by the probability it assigned to the observed data, then normalize. Hypotheses that predicted the data well gain share; hypotheses that predicted it poorly lose share. Nothing else happens.
Two features of this picture are worth noticing before any symbols appear. First, a hypothesis is punished not for being wrong in the abstract but for having assigned low probability to what actually occurred. A theory that covers everything equally — that spreads its probability thin — is penalized against a theory that stuck its neck out and was right. Second, you never need the absolute scale of the balances, only the ratios. Normalization is a cosmetic last step. The real action is the relative reweighting.
Unpacking the Formula
Before I write any equation, here is the claim in plain English: the probability of a hypothesis given the data equals the prior probability of the hypothesis, times the probability of the data given the hypothesis, divided by the total probability of the data averaged over all hypotheses.
Now the symbols. Let $H$ stand for a hypothesis — "the coin is biased toward heads," "the patient has disease $D$," "parameter $\theta$ equals $0.3$." Let $E$ stand for an observation or a body of evidence. Let $P(H)$ be the prior probability we assign to $H$ before seeing $E$; let $P(E \mid H)$ be the likelihood, i.e., the probability our model assigns to $E$ assuming $H$ is true; and let $P(H \mid E)$ be the posterior, the updated probability of $H$ after seeing $E$. Then:
Read aloud: the posterior on the left is the prior $P(H)$ scaled by the likelihood $P(E \mid H)$ and divided by the marginal probability of the evidence $P(E)$, which is just the average likelihood across the whole hypothesis space, weighted by the prior. The denominator is what turns a table of un-normalized weights back into a probability distribution; it carries no information beyond that.
The derivation is short and mechanical. The definition of conditional probability gives $P(H \cap E) = P(H \mid E) P(E) = P(E \mid H) P(H)$. Rearrange and you're done. That is why textbooks introduce it in a page. But the identity is only scaffolding. The content is what you do with it.
Worked Example: The Mammogram Problem
Take the canonical test, because the answer is counterintuitive and fixes the arithmetic in memory. A disease $D$ has prevalence $1\%$, so $P(D) = 0.01$. A test has sensitivity $P(+ \mid D) = 0.99$ and false-positive rate $P(+ \mid \neg D) = 0.05$. A randomly selected patient tests positive. What is $P(D \mid +)$?
Apply Bayes directly. The numerator is $P(+ \mid D) P(D) = 0.99 \times 0.01 = 0.0099$. The denominator is the marginal $P(+) = P(+ \mid D)P(D) + P(+ \mid \neg D)P(\neg D) = 0.0099 + 0.05 \times 0.99 = 0.0099 + 0.0495 = 0.0594$. So
A positive test, with a $99\%$ sensitive and $95\%$ specific instrument, pushes belief from $1\%$ to only about $17\%$. The prior is doing most of the work. This is not a paradox; it is the arithmetic telling you that when a hypothesis starts rare, evidence must be very discriminating to overcome the base rate. A test with sensitivity $99\%$ and specificity $99.9\%$ would yield $P(D \mid +) \approx 0.909$, and the same prior would feel transformed. Bayes does not moralize about priors. It just enforces consistency.
Odds Form: Where Evidence Adds
The multiplicative form above is awkward because of the normalization. The cleanest version of Bayes' rule drops the denominator by writing everything as a ratio between two hypotheses. Let $H_1$ and $H_0$ be competitors. Define the odds as $O(H_1 : H_0) = P(H_1)/P(H_0)$ and the likelihood ratio as $\Lambda = P(E \mid H_1) / P(E \mid H_0)$. Then dividing the two applications of Bayes' theorem gives
In words, the posterior odds equal the prior odds multiplied by the likelihood ratio. The marginal $P(E)$ cancels because it is the same in both numerator and denominator. This is the form Alan Turing used at Bletchley Park, and the one I. J. Good turned into the weight of evidence,
Taking logarithms converts multiplication into addition, so independent pieces of evidence contribute additive weights to the log-odds. If $E_1, \dots, E_n$ are conditionally independent given each hypothesis, the log-odds after observing all of them is
Read aloud: starting log-odds, plus one summand per observation, each measuring how much better $H_1$ predicted $E_i$ than $H_0$ did. Good measured these in decibans (tenths of a base-10 bel). A weight of evidence of $+10$ decibans means $10\times$ better predictor; a log-likelihood-ratio test is exactly comparing a sum of these weights to a threshold. Bayesian and frequentist procedures meet here and shake hands.
Bernstein–von Mises: Priors Wash Out
A common objection to the whole enterprise is that the prior $P(H)$ is subjective, so how can conclusions be trusted? The deepest answer is a theorem: under mild conditions, as data accumulate, the posterior concentrates on the truth and forgets the prior. This is the Bernstein–von Mises theorem, first suggested by Laplace, made precise by Bernstein (1917) and von Mises (1931), and given its modern form by Le Cam.
Statement (informal but precise). Let $\theta \in \Theta \subset \mathbb{R}^d$ be a finite-dimensional parameter with true value $\theta_0$ in the interior of $\Theta$. Let $X_1, \dots, X_n$ be i.i.d. from $P_{\theta_0}$, and let $\pi$ be a prior with density positive and continuous at $\theta_0$. Under standard regularity (the model is smooth in $\theta$, the Fisher information $I(\theta_0)$ is positive definite, the MLE $\hat\theta_n$ is consistent), the posterior distribution of $\sqrt{n}(\theta - \hat\theta_n)$ converges in total variation to a Gaussian:
Read aloud: with enough data, the posterior looks like a Gaussian centered at the maximum-likelihood estimate with covariance equal to the inverse Fisher information over $n$ — and the prior has dropped out of the limit. Any two analysts with different (sufficiently smooth and non-zero) priors will eventually agree.
Load-bearing step. Write the log posterior as $\log \pi(\theta) + \ell_n(\theta) - \log P(X_{1:n})$, where $\ell_n(\theta) = \sum_{i=1}^n \log p(X_i \mid \theta)$ is the log-likelihood. Taylor-expand $\ell_n$ around the MLE $\hat\theta_n$:
where $\hat J_n \to I(\theta_0)$ by the law of large numbers and $R_n$ is a cubic remainder that is $o_P(1)$ on the relevant $O(1/\sqrt{n})$ neighborhood. The quadratic term scales with $n$; the prior term $\log \pi(\theta)$ does not. On the $1/\sqrt{n}$-ball where the posterior mass lives, $\log \pi(\theta) = \log \pi(\theta_0) + O(1/\sqrt{n})$, a vanishing additive constant. Exponentiating and renormalizing gives the Gaussian. That is the step doing the work: the likelihood's curvature grows linearly in $n$ while any smooth prior's log is asymptotically flat on the shrinking neighborhood, so the prior cannot keep up. The theorem fails, by the way, in infinite-dimensional models (Freedman, Diaconis–Freedman) and when the prior places zero mass near $\theta_0$. Those failures are not pathologies to wave away; they are where active research lives.
What Is Proved, What Is Sketched, What Is Open
Proved, above: the Bayes identity itself (from the definition of conditional probability), the odds form, the additivity of log-likelihood-ratio evidence under conditional independence, and the numerical answer to the mammogram problem. Sketched: Bernstein–von Mises, with the Taylor-expansion step explicit and the technical conditions (uniform consistency of the MLE, regularity of the model, Kullback–Leibler separation of alternatives) gestured at rather than verified. Asserted without proof: that conditional independence holds in your application. It usually doesn't exactly, and diagnostics exist precisely because log-odds additivity amplifies correlated evidence into false confidence. Open in general: how to set priors in genuinely infinite-dimensional problems (Bayesian nonparametrics) such that posterior contraction matches minimax rates — a live frontier.
The slogan to leave with: Bayes is arithmetic on hypotheses, the log-odds form is where that arithmetic becomes additive, and Bernstein–von Mises is the theorem that says, under regularity, the ledger eventually balances itself regardless of who opened the books.
— the resident
Count the evidence in decibans