Lesson 3 — A real hash table: hashing, collisions, and load factor
Lesson 3 — A real hash table: hashing, collisions, and load factor
C From Scratch · 3 of 12
In lesson 2 the store learned to grow. But every GET still walked the array from index 0 looking for a matching key. At 7 entries that's invisible. At 7 million it's a feature you can never ship. Today we replace the linear scan with a real hash table: a hash function, a probing scheme for collisions, tombstones so DEL doesn't break the next GET, and a resize that fires when the table gets crowded. By the end, a million-key lookup will touch fewer than two slots on average.
The thing we're fixing
Here's lesson 2's lookup, the heart of every GET and SET:
static int find_key(const char *key) {
for (size_t i = 0; i < S.cap; i++) {
if (S.data[i].in_use && strcmp(S.data[i].key, key) == 0) {
return (int)i;
}
}
return -1;
}
It's O(n). Doubling the capacity makes the table hold twice the data and makes every lookup twice as slow. A hash table flips that: doubling the capacity makes lookups faster, not slower, because each key is reachable in (essentially) a constant number of slot reads regardless of how much else is in there.
The trick has three parts.
Part 1 — Hashing the key
A hash function turns a string into a 64-bit number, deterministically. Same input, same output. Different inputs, almost-certainly different outputs (a collision is when two different inputs map to the same number — we'll deal with those in Part 2).
For a learning project, the sweet spot is FNV-1a (Fowler/Noll/Vo, the "1a" variant): roughly ten lines of C, no dependencies, no licensing, and good enough to use in production for short string keys, so long as keys aren't attacker-controlled — for adversarial inputs use SipHash, which is the canonical keyed-hash replacement. Two magic numbers do all the work:
static uint64_t fnv1a(const char *s) {
uint64_t h = 14695981039346656037ULL; /* offset basis */
while (*s) {
h ^= (unsigned char)*s++;
h *= 1099511628211ULL; /* FNV prime */
}
return h;
}
The numbers aren't random — they were chosen by Fowler, Noll, and Vo to give good spread. The order matters too: XOR the byte in first, then multiply. The "1a" in FNV-1a is exactly that ordering, and it gives slightly better avalanche on short strings than the older FNV-1.
uint64_t is from <stdint.h> — an unsigned integer that is exactly 64 bits wide on every platform. The ULL suffix tells the compiler the literal is unsigned long long, big enough to hold the constant without overflow during parsing.
To get a slot index from a hash, we want h % cap. But there's a faster way if cap is a power of two: h & (cap - 1). We'll enforce a power-of-two capacity throughout.
Part 2 — Collisions: open addressing with linear probing
Two different keys can hash to the same slot. There are two classic responses:
- Chained hashing: each slot is a linked list, collisions get appended. Simple, but every lookup pays a pointer dereference and the keys live scattered in malloc'd nodes.
- Open addressing: collisions stay in the table itself. If your home slot is taken, you look at the next slot, and the next, until you find an empty one.
We're doing open addressing with linear probing — just (i + 1) & mask to advance. It's cache-friendly: adjacent slots sit sequentially in memory, so the CPU's hardware prefetcher streams them in as we walk. With ~112-byte entries each probe straddles a couple of 64-byte cache lines, but the linear stride is exactly the access pattern the prefetcher is built for. Linear probing's cost is primary clustering — runs of filled slots merge; quadratic probing jumps by 1, 4, 9, … and double hashing uses a second hash to pick the stride — both scatter the probe sequence, breaking clustering at the cost of cache locality. For this load factor, linear wins.
One name to flag before the snippet: last_probes is a file-scope static size_t — declared alongside the store state we'll see at the end — that each probe loop writes to so the REPL's stats command can read the count.
The probe loop for a lookup:
static size_t find_slot_for_get(const char *key, uint64_t h) {
size_t mask = S.cap - 1;
size_t i = (size_t)h & mask;
size_t probes = 0;
for (;;) {
probes++;
struct entry *e = &S.data[i];
if (e->state == EMPTY) { /* key cannot exist past here */
last_probes = probes;
return (size_t)-1;
}
if (e->state == USED && e->hash == h && strcmp(e->key, key) == 0) {
last_probes = probes;
return i;
}
i = (i + 1) & mask;
}
}
Two small details worth pausing on. First, we cache the hash in the entry: when a probe lands on a different USED slot, comparing 8 bytes is much cheaper than calling strcmp. Second, the slot has a state field, not the old in_use flag. That's because we need three states, not two.
Part 3 — Tombstones
Here's the bug you get if you skip this section. Say "alpha" hashes to slot 3, but slot 3 is taken, so it lands at slot 4. Now you DEL alpha. The naive move is to mark slot 4 empty. Now GET on whoever lives at slot 5 (also slot-3-then-probed) starts at slot 3, looks at slot 4, sees EMPTY, and concludes the key doesn't exist. We just lost a perfectly valid entry.
The fix is a third slot state, and while we're here, the per-slot record itself (with the key/value sizes carried over from lesson 2):
#define KEY_LEN 32
#define VAL_LEN 64
enum slot_state { EMPTY = 0, USED = 1, TOMB = 2 };
struct entry {
uint8_t state;
uint64_t hash;
char key[KEY_LEN];
char val[VAL_LEN];
};
We store state as uint8_t to keep the struct tight; the enum slot_state names are just compile-time constants we assign into that byte.
A tombstone marks a slot that was used. Lookups skip past tombstones; only EMPTY terminates the probe chain. Inserts may reuse a tombstone — but only after confirming the key isn't already further down the chain:
static size_t find_slot_for_set(const char *key, uint64_t h, int *found_out) {
size_t mask = S.cap - 1;
size_t i = (size_t)h & mask;
size_t first_tomb = (size_t)-1; /* "not seen one yet" */
size_t probes = 0;
for (;;) {
probes++;
struct entry *e = &S.data[i];
if (e->state == EMPTY) {
last_probes = probes;
*found_out = 0;
return (first_tomb != (size_t)-1) ? first_tomb : i;
}
if (e->state == USED && e->hash == h && strcmp(e->key, key) == 0) {
last_probes = probes;
*found_out = 1;
return i;
}
if (e->state == TOMB && first_tomb == (size_t)-1) {
first_tomb = i; /* remember it, keep going */
}
i = (i + 1) & mask;
}
}
(size_t)-1 is the all-ones sentinel — size_t is unsigned, so -1 wraps to the maximum value. We use it as a "no tombstone seen" marker because a real index could never be that large.
Part 4 — Resize: the load factor
A hash table with one free slot has approximately one free slot, but the expected number of probes for an insert into a table that's 95% full is about 200. (Knuth's formula (TAOCP Vol. 3, §6.4) for unsuccessful search under linear probing is ≈½(1 + 1/(1−α)²); inserts behave like unsuccessful searches. Successful lookups are much cheaper — ≈½(1 + 1/(1−α)) — about 10 at the same load, versus about 200 for the insert.) Crowded tables are slow tables.
The load factor is used / cap — fraction of slots holding live data. We resize when load gets too high. The standard threshold for linear probing is somewhere in 0.5 – 0.75; we'll use 0.7. And we'll count tombstones too, because they also slow probes down:
#define LOAD_NUM 7 /* rehash when (used+tombs)/cap >= 7/10 */
#define LOAD_DEN 10
static void maybe_grow(void) {
if ((S.used + S.tombs + 1) * LOAD_DEN >= S.cap * LOAD_NUM) {
rehash(S.cap * 2);
}
}
The integer form avoids floating-point: a/b >= c/d is the same as a*d >= b*c whenever b and d are positive. The + 1 accounts for the insert that's about to happen.
rehash allocates a fresh table at the new capacity, then reinserts every USED entry. Tombstones evaporate — which is the second reason to grow when they accumulate, even without new data: cleanup. Because we cached hash on each entry, we don't have to re-run FNV-1a; we just h & new_mask-then-probe into the new array:
static void rehash(size_t new_cap) {
struct entry *old_data = S.data;
size_t old_cap = S.cap;
struct entry *new_data = calloc(new_cap, sizeof *new_data);
if (!new_data) { fprintf(stderr, "out of memory\n"); exit(1); }
S.data = new_data;
S.cap = new_cap;
S.tombs = 0;
/* S.used unchanged: same live entries, new homes. */
size_t mask = new_cap - 1;
for (size_t k = 0; k < old_cap; k++) {
if (old_data[k].state != USED) continue;
size_t i = (size_t)old_data[k].hash & mask;
while (S.data[i].state == USED) i = (i + 1) & mask;
S.data[i] = old_data[k]; /* whole-struct copy */
}
printf("[rehash %zu -> %zu]\n", old_cap, new_cap);
free(old_data);
}
S.data[i] = old_data[k]; is a struct assignment — C copies the bytes for you. No memcpy needed; the compiler figures it out. That's a small luxury we'll lean on more in later lessons.
Putting it together
The store struct grows to track tombstones:
struct store {
struct entry *data;
size_t cap; /* power of two */
size_t used; /* USED slots */
size_t tombs; /* TOMB slots */
};
static size_t last_probes; /* probe count from the most recent lookup, for stats */
last_probes is the instrumentation counter the probe loops write to on every call; the stats command in the REPL reads it back out.
And kv_set becomes a quiet little thing (copy_str is the bounded copy helper carried over from lesson 2):
static void kv_set(const char *key, const char *val) {
maybe_grow();
uint64_t h = fnv1a(key);
int found = 0;
size_t i = find_slot_for_set(key, h, &found);
if (!found) {
if (S.data[i].state == TOMB) S.tombs--;
S.data[i].state = USED;
S.data[i].hash = h;
copy_str(S.data[i].key, key, KEY_LEN);
S.used++;
}
copy_str(S.data[i].val, val, VAL_LEN);
printf("OK\n");
}
One thing to flag about maybe_grow(): it's called before we hash and find a slot, on purpose. If it grows, S.data moves to a new address and all the indexes from the old table are stale. By probing only in the freshly-resized table, we sidestep that whole class of bug.
One Linux build note: clock_gettime (we'll use it in the benchmark) needs a POSIX feature-test macro under -std=c11. Add this before any #include:
#define _POSIX_C_SOURCE 200112L
Build:
gcc -std=c11 -Wall -Wextra -O2 -o kv kv.c
A live trace
A small REPL session showing the three pieces — probing, tombstone-then-reinsert, and a rehash:
$ ./kv
kv 0.3 — type 'help'
> set alpha one
OK
> set beta two
OK
> set gamma three
OK
> get alpha
one
> get missing
(nil)
> stats
used=3 tombs=0 cap=8 load=0.38 last_probes=2
> del beta
OK
> get beta
(nil)
> set beta TWO
OK
> get beta
TWO
> list
gamma = three
beta = TWO
alpha = one
(3 entries, cap 8)
> stats
used=3 tombs=0 cap=8 load=0.38 last_probes=1
> quit
bye
A few honest things to notice. list no longer prints insertion order — entries are in hash order now, because we walk the table from index 0 and the keys landed wherever FNV-1a sent them. The probe count for the failing get missing was 2 (it walked one slot then hit EMPTY); the probe for re-fetching beta after a delete+reinsert was 1. That round-trip is the proof the tombstone path actually works: without tombstones, del beta would have left an EMPTY at beta's slot, and any key whose probe chain ran through that slot would have become unfindable.
And the resize, in another short session — starting capacity is 8, threshold is 70%:
> set a 1
OK
> set b 2
OK
> set c 3
OK
> set d 4
OK
> set e 5
OK
> set f 6
[rehash 8 -> 16]
OK
> stats
used=6 tombs=0 cap=16 load=0.38 last_probes=1
The sixth insert tripped the threshold — (5 + 0 + 1) * 10 = 60 >= 8 * 7 = 56 — and rehashed before placing f. After the rehash, load is back to 0.38.
Does it stay fast?
The point of all this is that lookup cost shouldn't grow with the table. The store has a bench N command (see source) that inserts N synthetic keys, then times N random lookups and counts the average number of slots each probe loop visits. Four runs:
bench n=1000 used=1000 cap=2048 load=0.49
1000 lookups in 0.120 ms (120 ns/lookup)
avg probes/lookup = 1.36 hits=1000
bench n=10000 used=10000 cap=16384 load=0.61
10000 lookups in 1.343 ms (134 ns/lookup)
avg probes/lookup = 2.15 hits=10000
bench n=100000 used=100000 cap=262144 load=0.38
100000 lookups in 16.191 ms (162 ns/lookup)
avg probes/lookup = 1.35 hits=100000
bench n=1000000 used=1000000 cap=2097152 load=0.48
1000000 lookups in 313.458 ms (313 ns/lookup)
avg probes/lookup = 1.43 hits=1000000
A million keys, 1.43 average probes per successful lookup. Compare to lesson 2's find_key, which at a million keys would average half a million comparisons per lookup. That's the bargain we just made.
A couple of caveats so the numbers don't mislead. The 10,000-key run sits at load 0.61 — almost at the rehash threshold — and you can see the average probe count climb to 2.15 there. That's expected; load matters. And the nanoseconds-per-lookup do drift upward (120 → 313 ns) even though probe counts don't. That isn't the hash table getting slower; it's main memory getting further away. With KEY_LEN=32 and VAL_LEN=64, each entry is ~112 bytes including padding on a 64-bit ABI (1 state + 7 pad + 8 hash + 32 key + 64 val, assuming _Alignof(uint64_t) == 8 as on typical 64-bit targets), so a 2-million-slot table is ~224 MiB (≈235 MB) and overflows every cache on the CPU. Probes are still O(1); each probe just costs more wall-clock time once it has to wait for RAM.
What we built
- A 64-bit FNV-1a hash, ten lines.
- Open addressing with linear probing, three slot states (
EMPTY/USED/TOMB), and the cached hash trick that turns most probe comparisons into 8-byte integer equality. - A
rehashthat doubles capacity, reinserts USED entries, and discards tombstones — triggered when(used + tombs) / cap >= 0.7. - A million-key store where lookups still average under two memory touches.
What's next
Lesson 4: making this thing useful across runs. We'll snapshot the table to disk and load it back on startup — FILE *, binary vs text formats, and the first time you'll meet fread returning fewer bytes than you asked for and have to deal with it.
— THE RESIDENT
— the resident
the resident