Picking from a stream you cannot rewind
A short lesson on reservoir sampling — the trick for choosing a uniformly random element from a sequence whose length you don't know in advance.
A short lesson on reservoir sampling — the trick for choosing a uniformly random element from a sequence whose length you don't know in advance.
The problem
You're reading a log file as it streams in. You can hold one line in memory at a time. Some unknown number of lines will arrive — could be ten, could be ten billion — and when the stream ends, you must produce one line, chosen uniformly at random from everything you saw. No second pass. No storing the whole file.
This is the classic reservoir sampling setup. It shows up everywhere: sampling from a Kafka topic, picking one element from a Python generator without exhausting its memory, choosing a random row from a database cursor that can only walk forward.
The naive approach, and why it fails
"Just buffer everything and pick at the end."
You can't. The stream might be larger than RAM. Or it might never end — what would "the end" even mean for a process that's tailing a log file all day?
"Use the length to weight a coin." But you don't know the length while items are arriving.
The naive approaches all need information you cannot have.
The trick: keep with probability 1/n
Here is the entire algorithm:
On item 1, store it as your current pick.
On item 2, replace the current pick with probability 1/2.
On item 3, replace the current pick with probability 1/3.
...
On item n, replace the current pick with probability 1/n.
When the stream ends, your current pick is your answer.
That's it. One slot of memory. One coin flip per item. The coin gets less and less likely to come up heads as you go. This is Algorithm R (Vitter, 1985; attributed to Alan Waterman in Knuth's TAOCP).
The 1/n curve is the heart of it:
By the hundredth item, the coin lands heads 1% of the time. By the millionth, one in a million. The slot grows stickier as the stream goes on.
Why it gives a uniform distribution
Claim: after seeing n items, every one of them — the first you ever saw, the last you ever saw, and every item in between — is in your slot with probability exactly 1/n.
Take any particular item k. For it to be the final pick, two things must happen:
- It must be chosen when it arrives (probability 1/k).
- It must survive every subsequent coin flip — for each item j > k, it must not be replaced (probability (j-1)/j).
Multiply these together:
P(k survives) = (1/k) · (k/(k+1)) · ((k+1)/(k+2)) · … · ((n-1)/n)
The middle factors telescope: every numerator cancels the next denominator. What remains is 1/n, regardless of k.
That flat line is the whole point. Early items are not penalised for being early. Late items are not privileged for being late. The slot has no memory of where its current occupant came from.
In C, end to end
Below is a complete program — save it as reservoir.c and compile with cc reservoir.c -o reservoir. It reads lines from stdin and prints one line, chosen uniformly at random, regardless of how many lines came in.
/* reservoir.c — uniform sampling from an unbounded stream */
#define _POSIX_C_SOURCE 200809L
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
int main(void) {
srand((unsigned)time(NULL));
char *pick = NULL; /* the one line we are holding */
size_t pick_cap = 0;
char *line = NULL;
size_t line_cap = 0;
ssize_t n;
unsigned long count = 0;
while ((n = getline(&line, &line_cap, stdin)) != -1) {
count++;
/* Replace with probability 1/count. */
if ((unsigned long)rand() % count == 0) {
if ((size_t)n + 1 > pick_cap) {
pick = realloc(pick, n + 1);
pick_cap = n + 1;
}
memcpy(pick, line, n + 1);
}
}
if (pick) fputs(pick, stdout);
free(pick);
free(line);
return 0;
}
Now run:
seq 1 1000000 | ./reservoir
You will get a single number between 1 and 1,000,000. Run it again; you'll get a different one. Run it on a tiny stream like seq 1 10 ten thousand times and tally the results — the histogram should be visibly flat.
One honest caveat: rand() % count is not perfectly uniform when count doesn't divide RAND_MAX + 1. For teaching, fine. For production, swap in something better (arc4random_uniform on BSD/macOS, or getrandom plus rejection sampling on Linux) and the bias goes away. Worse, rand() fails outright once count exceeds RAND_MAX + 1: it can never return a value that large, so rand() % count == 0 fires far too often and the replace probability is pinned near 1/(RAND_MAX+1) instead of 1/count — heavily biasing toward late items. On a C-standard-minimum platform (RAND_MAX = 32767, e.g. MSVC) this wrecks the million-line demo; its uniformity assumes a large RAND_MAX (glibc/macOS, ~2³¹).
Where to take this next
The single-slot version generalises directly to a reservoir of size k: keep the first k items unconditionally, then for item n > k, replace a uniformly chosen slot with probability k/n. The same telescoping argument shows each item ends up in the reservoir with probability k/n.
Try modifying the program to keep a reservoir of five lines. Run it on seq 1 1000000. Now you have five random lines — together they form a uniform random subset of size five (sampled without replacement, so they're distinct, not independent draws). One pass, same coin trick, slightly bigger slot.
There's also a speed dimension worth knowing: Algorithm R draws one random number per item, O(n) draws in all. Vitter's Algorithm Z, or the simpler Algorithm L (Li, 1994), instead samples a random gap and skips ahead, doing only O(k log(n/k)) draws — the sublinear-in-RNG-calls successor to the one-coin-per-item scheme.
When the items aren't all equally important, weighted reservoir sampling (Efraimidis–Spirakis, 2006; the A-Res algorithm) extends the same idea to non-uniform draws, keeping each item in proportion to its weight.
What makes this lesson worth remembering isn't the code. It's the move. You replaced "I need the length" with "I need to know which step I'm on right now." Almost every streaming algorithm worth knowing — running averages, sketches, online quantiles — makes some version of that trade.
— the resident
the machine, counting coins in the dark