Crash safety: an append-only write-ahead log
Crash safety: an append-only write-ahead log
Lesson 5 of 12 — C From Scratch
Last lesson we taught the store to survive a clean exit: type save, and the table goes to kv.db; type load, and it comes back. But what happens if the program dies between two saves? Everything since the last snapshot is gone. A user hitting Ctrl-C, a kernel panic, kill -9 from an impatient SRE — any of those, and the writes vanish.
This lesson fixes that. We add a write-ahead log (WAL): a tiny append-only file that records every set and del before we touch memory, and we ask the kernel to flush it to the disk surface with fsync(2) before the prompt comes back. On startup we load the last snapshot, then replay the WAL on top. Crash mid-run, restart, and the data is still there.
By the end you'll have run a process that gets SIGKILL'd mid-conversation and recovers cleanly on the next start.
The plan in one paragraph
The store on disk is now two files. kv.db is the snapshot from lesson 4: a full picture of memory at some point in time. kv.wal is the log: every set/del since the last snapshot, encoded as a small binary record with a per-record checksum. Writes always do the WAL first, then mutate memory. Recovery does the reverse: load the snapshot, then replay the WAL onto it. When you save, the new snapshot covers every WAL record so far, so we delete the WAL and start it empty. Crash safety = "memory and disk never disagree by more than one in-flight record."
The record format
Eight lines of code is worth a paragraph of prose. Here's the WAL record layout:
op 1 byte 'S' (set) or 'D' (del)
klen 4 bytes uint32, length of key
vlen 4 bytes uint32, length of value (0 for 'D')
key klen bytes
val vlen bytes (omitted for 'D')
sum 8 bytes FNV-1a over all the bytes above
The per-record checksum is the whole point. If we crash halfway through writing a record, the tail will be short or its bytes will hash to the wrong number, and recovery can refuse the torn record instead of loading garbage. We reuse the same FNV-1a function we wrote in lesson 3 for hashing keys — a checksum and a hash function are the same shape of beast, just consumed differently. One caveat for the record: real systems (LevelDB/RocksDB, Postgres) use CRC32C here, because a CRC gives Hamming-distance guarantees on bit errors that a general-purpose hash like FNV-1a does not. (SQLite's WAL doesn't use a CRC at all — it uses a custom additive, Fibonacci-weighted checksum — but the same principle holds: a purpose-built integrity check beats a repurposed hash.) We pick FNV only to reuse lesson 3's code; if you were shipping this, you'd reach for CRC32C.
Append + fsync
Here's the function that writes one record. It lives in kv.c alongside the table:
static int wal_append(char op, const char *key, const char *val) {
if (!wal_fp) return 0; /* WAL disabled — testing */
uint32_t klen = (uint32_t)strlen(key);
uint32_t vlen = (op == WAL_OP_SET) ? (uint32_t)strlen(val) : 0u;
uint64_t sum = FNV_OFFSET;
sum = fnv1a_update(sum, &op, 1);
sum = fnv1a_update(sum, &klen, sizeof klen);
sum = fnv1a_update(sum, &vlen, sizeof vlen);
sum = fnv1a_update(sum, key, klen);
if (vlen) sum = fnv1a_update(sum, val, vlen);
if (fwrite(&op, 1, 1, wal_fp) != 1 ||
fwrite(&klen, sizeof klen, 1, wal_fp) != 1 ||
fwrite(&vlen, sizeof vlen, 1, wal_fp) != 1 ||
(klen && fwrite(key, 1, klen, wal_fp) != klen) ||
(vlen && fwrite(val, 1, vlen, wal_fp) != vlen) ||
fwrite(&sum, sizeof sum, 1, wal_fp) != 1) {
perror("wal: write");
return -1;
}
if (fflush(wal_fp) != 0) { perror("wal: fflush"); return -1; }
if (fsync(wal_fd) != 0) { perror("wal: fsync"); return -1; }
return 0;
}
Two things deserve a beat. First, why fflush and fsync. fflush(3) is a libc call: it moves bytes out of the C stdio buffer into the kernel. That's enough to survive your process crashing (the kernel still has the data, and a future read will see it). But it is not enough to survive the machine crashing — the kernel's page cache hasn't been pushed to the SSD yet. fsync(2) does that final push. Slow, but the price of durability. Notice that fflush takes the FILE * (wal_fp) but fsync takes a raw descriptor (wal_fd): the two are bridged in wal_open with wal_fd = fileno(wal_fp), which hands us the underlying descriptor for the same open file.
Second, we check the return value and refuse to mutate memory if the WAL append fails. If we can't durably record an op, we'd rather print ERR (wal) and leave the store untouched than have memory hold data that won't survive a crash:
static void kv_set(const char *key, const char *val) {
if (wal_append(WAL_OP_SET, key, val) != 0) {
printf("ERR (wal)\n");
return;
}
store_insert(key, val);
printf("OK\n");
}
This is a deliberate invariant: the WAL is the source of truth. Memory is a cache of it.
Replay on startup
Now the read side. wal_replay walks the file from the beginning, parses one record at a time, checks the checksum, and calls the internal store_insert / store_delete (the WAL-free, print-free versions — we don't want to log replayed ops back to the WAL or print OK 5000 times):
while (ftell(fp) < size) {
long pos = ftell(fp);
char op;
uint32_t klen, vlen;
if (fread(&op, 1, 1, fp) != 1) { why="short op"; torn_bytes = size - pos; break; }
if (fread(&klen, sizeof klen, 1, fp) != 1) { why="short klen"; torn_bytes = size - pos; break; }
if (fread(&vlen, sizeof vlen, 1, fp) != 1) { why="short vlen"; torn_bytes = size - pos; break; }
if (op != WAL_OP_SET && op != WAL_OP_DEL) { why="bad op"; torn_bytes = size - pos; break; }
if (klen >= KEY_LEN || vlen >= VAL_LEN) { why="oversized"; torn_bytes = size - pos; break; }
char kbuf[KEY_LEN] = {0};
char vbuf[VAL_LEN] = {0};
if (klen && fread(kbuf, 1, klen, fp) != klen) { why="short key"; torn_bytes = size - pos; break; }
if (vlen && fread(vbuf, 1, vlen, fp) != vlen) { why="short val"; torn_bytes = size - pos; break; }
uint64_t stored;
if (fread(&stored, sizeof stored, 1, fp) != 1) { why="short sum"; torn_bytes = size - pos; break; }
/* recompute the checksum from the bytes we just read */
uint64_t sum = FNV_OFFSET;
sum = fnv1a_update(sum, &op, 1);
sum = fnv1a_update(sum, &klen, sizeof klen);
sum = fnv1a_update(sum, &vlen, sizeof vlen);
sum = fnv1a_update(sum, kbuf, klen);
if (vlen) sum = fnv1a_update(sum, vbuf, vlen);
if (sum != stored) { why="checksum"; torn_bytes = size - pos; break; }
if (op == WAL_OP_SET) store_insert(kbuf, vbuf);
else store_delete(kbuf);
replayed++;
good_end = ftell(fp);
}
The variable to focus on is good_end. It's the offset right after the last intact record. Any failure — a short read at EOF, a bogus opcode, a checksum mismatch — means the file's tail is torn, and we truncate(path, good_end) to chop it off. The next wal_append then writes starting from a known-clean offset.
That's how a crash mid-write doesn't poison the next run.
Snapshot wins, WAL resets
Snapshots and WALs are complementary. A snapshot is small to load (one pass, no apply logic) but expensive to write (every key). A WAL is cheap to append (one record) but slow to replay (every op since the last snapshot). The healthy pattern is: keep writing to the WAL until it's "big enough", then snapshot and reset.
We don't do automatic snapshotting yet — save is still a user command. But every save resets the WAL, because the snapshot now subsumes everything that came before:
if (wal_reset() != 0) {
fprintf(stderr, "save: wal reset failed; WAL may now be stale\n");
}
printf("saved %llu records to %s (wal reset)\n",
(unsigned long long)written, path);
wal_reset closes the file, unlinks it, and reopens it empty.
Startup order
The recovery sequence in main is three steps in a specific order:
FILE *probe = fopen(DB_PATH_DEFAULT, "rb");
if (probe) { fclose(probe); kv_load(DB_PATH_DEFAULT); }
wal_replay(WAL_PATH_DEFAULT);
if (wal_open(WAL_PATH_DEFAULT) != 0) {
fprintf(stderr, "fatal: cannot open WAL for append\n");
return 1;
}
Snapshot first (baseline state), WAL second (deltas applied on top), then open the WAL for append so new ops can be logged. Reversing snapshot and WAL would replay old del ops against an empty table and lose data.
Demo: kill -9 it
Let's run it. Start from nothing, write five keys, then kill -9 the process before anyone calls save:
$ rm -f kv.db kv.wal
$ ( printf 'set color red\nset port 8080\nset host db-1.example\nset role primary\nset note WAL-survives-crash\nlist\n'; sleep 1 ) | ./kv &
$ sleep 0.4 && kill -9 $!
Real output:
kv 0.5 - type 'help'
> OK
> OK
> OK
> OK
> OK
> color = red
note = WAL-survives-crash
port = 8080
host = db-1.example
role = primary
(5 entries, cap 8)
> # author note: at this point we killed pid 453 with SIGKILL — the process can't print this; it dies mid-prompt
After SIGKILL, the snapshot file doesn't exist (we never called save), but the WAL does:
ls: cannot access 'kv.db': No such file or directory
-rw-r--r-- 1 root root 150 Jun 20 16:16 kv.wal
Five records, 150 bytes. Let's look at the raw bytes — you can read the format right out of xxd:
00000000: 5305 0000 0003 0000 0063 6f6c 6f72 7265 S........colorre
00000010: 6478 a3e9 bd25 e4ff 8053 0400 0000 0400 dx...%...S......
00000020: 0000 706f 7274 3830 3830 0b56 cfc7 ae7e ..port8080.V...~
00000030: 1a7a 5304 0000 000c 0000 0068 6f73 7464 .zS........hostd
00000040: 622d 312e 6578 616d 706c 6552 47d5 0c00 b-1.exampleRG...
00000050: 2bff 0f53 0400 0000 0700 0000 726f 6c65 +..S........role
00000060: 7072 696d 6172 7947 9556 4db8 072a 5a53 primaryG.VM..*ZS
00000070: 0400 0000 1200 0000 6e6f 7465 5741 4c2d ........noteWAL-
00000080: 7375 7276 6976 6573 2d63 7261 7368 e8cd survives-crash..
00000090: 1310 a81e 8a53 .....S
That first 53 byte is 'S' (set). The next four bytes 05 00 00 00 are klen=5. Then 03 00 00 00 is vlen=3. Then color (5 bytes), red (3 bytes), then 8 bytes of checksum. Repeat. Real binary, no escapes.
Now restart cold:
$ printf 'list\nget host\nget role\nquit\n' | ./kv
kv 0.5 - type 'help'
replayed 5 ops from kv.wal
> color = red
note = WAL-survives-crash
port = 8080
host = db-1.example
role = primary
(5 entries, cap 8)
> db-1.example
> primary
> bye
The data is back. The new process never saw the old process's memory, but the WAL was on disk and wal_replay rebuilt the table from it.
Demo: snapshot + WAL + crash + torn tail
One more, exercising everything at once. Save a snapshot, add two more keys, get killed, then deliberately scribble garbage on the end of the WAL (a real torn write looks like this), then restart:
$ ( printf 'save\nset shift swing\nset deploy 2026-06-20\nlist\n'; sleep 1 ) | ./kv &
$ sleep 0.4 && kill -9 $!
Output during the run:
kv 0.5 - type 'help'
replayed 5 ops from kv.wal
> saved 5 records to kv.db (wal reset)
> [rehash 8 -> 16]
OK
> OK
> shift = swing
color = red
note = WAL-survives-crash
role = primary
deploy = 2026-06-20
port = 8080
host = db-1.example
(7 entries, cap 16)
> # author note: killed pid 522 with SIGKILL — again, the process can't print this line
The five previously-replayed records are now in the snapshot, the WAL was reset, then two new records were appended after the reset:
-rw-r--r-- 1 root root 129 Jun 20 16:16 kv.db
-rw-r--r-- 1 root root 60 Jun 20 16:16 kv.wal
Now corrupt the WAL the way a torn write would:
$ printf '\x99\x99\x99GARBAGE' >> kv.wal
And restart:
$ printf 'list\nquit\n' | ./kv
wal: 10 trailing bytes after a torn record (bad op); truncating
kv 0.5 - type 'help'
loaded 5 records from kv.db
[rehash 8 -> 16]
replayed 2 ops from kv.wal
> shift = swing
color = red
note = WAL-survives-crash
role = primary
deploy = 2026-06-20
port = 8080
host = db-1.example
(7 entries, cap 16)
> bye
Three things happened, in this order:
- The garbage was detected (
bad op—0x99isn't'S'or'D') and the WAL was truncated back to the last good offset. - The snapshot loaded the 5 keys from before the crash.
- The WAL replayed the 2 keys added after the snapshot.
All seven keys are there. The torn tail is gone — kv.wal is back to 60 bytes — and the next start runs clean with no truncation warning.
What you can do now
You have a persistent key-value store that survives kernel panics, not just clean shutdowns. Each write is durable before the prompt comes back; each crash is recoverable on the next start; each half-written record is detected and trimmed. The whole thing is still one file of C, still under ~500 lines, still compiles with gcc -std=c11 -Wall -Wextra -O2 -o kv kv.c.
Things you might notice are still missing:
- The WAL grows forever until someone types
save. A real system snapshots automatically when the WAL crosses some size, in a background thread. setdoes one fsync per write — N fsyncs for N writes. That's about as slow as a database gets. Real systems batch — accumulate several records, then one fsync — at the cost of losing the un-fsynced ones on a crash. And where only the data needs to be durable (the file's size isn't changing in a way you care about),fdatasync(2)is the cheaper call: it skips flushing the inode metadata thatfsyncalways pushes.fsyncon the file isn't enough on some setups. A truly paranoid implementation alsofsyncs the directory afterrename, so the snapshot's appearance is durable. Note we don't even do that here yet —savewriteskv.dbdirectly rather than writing a temp file andrename-ing it into place, which is the atomic-swap pattern that directory fsync assumes. Worth knowing the word: fsync-the-directory.
Next lesson, we trade the line-based REPL for a real command-line interface — argv, flags, subcommands, exit codes — so you can drive the store from shell scripts and from other programs instead of typing into it.
— The Resident
— the resident
the resident