Lesson 6: A wire protocol — parsing client commands
> **Where we are.** Five lessons in, our `kv` is a real little database: hashed, growable, persistent, crash-safe. But it only knows one human typing one line at a time. A second program — or a network socket — can't reliably talk to it, because the REPL's `strtok` parser is a text-with-spaces convention, not a protocol. This lesson fixes that. We add a small RESP-style wire format with length-prefixed bulk strings, build a parser and serializer that have *zero idea* what stdin or stdout are, and unit-test them against the messy inputs a real network produces: garbage, half-frames, lies about lengths.
Where we are. Five lessons in, our
kvis a real little database: hashed, growable, persistent, crash-safe. But it only knows one human typing one line at a time. A second program — or a network socket — can't reliably talk to it, because the REPL'sstrtokparser is a text-with-spaces convention, not a protocol. This lesson fixes that. We add a small RESP-style wire format with length-prefixed bulk strings, build a parser and serializer that have zero idea what stdin or stdout are, and unit-test them against the messy inputs a real network produces: garbage, half-frames, lies about lengths.
Why protocols differ from REPLs
Take a careful look at the REPL parser we have. It is, essentially, this:
char *cmd = strtok(line, " \t\r\n");
char *k = strtok(NULL, " \t\r\n");
char *v = strtok(NULL, "\r\n");
That works when a human types set greeting hello and hits return. It works because we control the framing: fgets waits until it sees \n, so by the time we parse, the whole command is in the buffer. We never have to ask "is there more coming?"
A network socket gives you no such guarantee. read(2) returns whenever the kernel feels like it: maybe a whole command, maybe one byte, maybe two commands stuck together. Spaces inside values become ambiguous. Binary data eats your delimiters. The parser must be able to say three different things:
- "I have a full command, here it is, and I used N bytes."
- "What you gave me is a valid prefix, but I need more bytes."
- "What you gave me cannot become a valid command. Give up on this connection."
Conflating the second and third is how parsers turn into crashes and security bugs. So the first job of this lesson is to design a return value that keeps those three states absolutely distinct.
A small RESP-style format
We borrow Redis's RESP, stripped to what we need:
Request (array of bulk strings):
*3\r\n ← 3 arguments follow
$3\r\nSET\r\n ← bulk: length 3, then bytes, then CRLF
$5\r\nhello\r\n
$5\r\nworld\r\n
Replies:
+OK\r\n ← simple string
-ERR message\r\n ← error
:42\r\n ← integer
$5\r\nhello\r\n ← bulk string
$-1\r\n ← nil bulk
Two design choices matter:
- Every variable-length thing is prefixed with its length. No delimiter search, no quoting, no escaping. Binary values — including embedded NULs — round-trip unchanged.
- Every line ends with CRLF. It's a sanity check: if the byte we expect to be
\risn't, we know the previous length was a lie.
That's it. Three reply primitives, one request shape. Enough to talk to our store.
The parser: three return states
Here's the contract — three sentinel values, one struct that aliases into the caller's buffer:
#define PROTO_MAX_ARGS 8
#define PROTO_MAX_BULK (1<<20) /* sanity ceiling: 1 MiB */
enum parse_result {
PARSE_OK = 0,
PARSE_NEED_MORE = 1,
PARSE_ERR = -1,
};
struct parsed_cmd {
int argc;
char *argv[PROTO_MAX_ARGS]; /* pointers into caller's buffer */
size_t argl[PROTO_MAX_ARGS]; /* bulk lengths (no NUL guarantee) */
};
Notice argv[i] is not a C string. It's a pointer plus a length. A bulk that contains \0 is a perfectly legal value in this protocol — embedding strlen everywhere would silently truncate it. We store the length explicitly. (The dispatcher copies into a stack buffer and adds the NUL when it has to call the store API, which still uses C strings.)
The parser itself is a state machine: read *N\r\n, then for each of N bulks read $L\r\n + L bytes + \r\n. Helper for the length lines:
/*
* Read an ASCII integer from buf[pos..) terminated by CRLF. Reject empty
* numbers, non-digits, and absurd magnitudes.
*/
static int parse_int_crlf(const char *buf, size_t len, size_t pos,
long *out, size_t *next) {
size_t i = pos;
while (i + 1 < len) {
if (buf[i] == '\r' && buf[i+1] == '\n') break;
i++;
}
if (i + 1 >= len) return PARSE_NEED_MORE; /* no CRLF yet */
if (i == pos) return PARSE_ERR; /* empty number */
size_t start = pos;
int neg = 0;
if (buf[start] == '-') { neg = 1; start++; if (start == i) return PARSE_ERR; }
long v = 0;
for (size_t j = start; j < i; j++) {
if (buf[j] < '0' || buf[j] > '9') return PARSE_ERR;
v = v * 10 + (buf[j] - '0');
/* Loose guard: bail out before `long` can overflow. The exact
* factor doesn't matter as long as it's well below LONG_MAX/10;
* the precise `blen > PROTO_MAX_BULK` check runs at the call site. */
if (v > PROTO_MAX_BULK * 16L) return PARSE_ERR; /* runaway */
}
*out = neg ? -v : v;
*next = i + 2; /* skip CRLF */
return PARSE_OK;
}
Three things to notice. First, i + 1 >= len is the partial-frame case — we haven't seen the CRLF yet, but might next time. Second, i == pos (empty digit run) and any non-digit are hard errors — those bytes will never become valid. Third, the magnitude cap. Without it, *99999999999999999999\r\n would happily overflow long and become whatever sign happened to fall out, and then we'd malloc it. Don't trust the wire.
The top-level parse:
static int proto_parse(const char *buf, size_t len,
size_t *consumed, struct parsed_cmd *cmd) {
*consumed = 0;
cmd->argc = 0;
if (len == 0) return PARSE_NEED_MORE;
if (buf[0] != '*') return PARSE_ERR;
size_t pos = 1;
long argc;
int r = parse_int_crlf(buf, len, pos, &argc, &pos);
if (r != PARSE_OK) return r;
if (argc < 1 || argc > PROTO_MAX_ARGS) return PARSE_ERR;
for (int a = 0; a < argc; a++) {
if (pos >= len) return PARSE_NEED_MORE;
if (buf[pos] != '$') return PARSE_ERR;
pos++;
long blen;
r = parse_int_crlf(buf, len, pos, &blen, &pos);
if (r != PARSE_OK) return r;
if (blen < 0 || blen > PROTO_MAX_BULK) return PARSE_ERR;
if (pos + (size_t)blen + 2 > len) return PARSE_NEED_MORE;
if (buf[pos + blen] != '\r') return PARSE_ERR;
if (buf[pos + blen + 1] != '\n') return PARSE_ERR;
cmd->argv[a] = (char *)buf + pos;
cmd->argl[a] = (size_t)blen;
pos += (size_t)blen + 2;
}
cmd->argc = (int)argc;
*consumed = pos;
return PARSE_OK;
}
The two buf[pos + blen] checks are the protocol's load-bearing sanity test. The length field says "the next 5 bytes are the value." If byte 6 isn't \r, the sender lied — or the sender is a different protocol entirely, and we should disconnect rather than keep guessing.
The serializer
A tiny growable buffer plus one function per reply primitive:
struct outbuf { char *data; size_t len, cap; };
static int outbuf_append(struct outbuf *o, const void *p, size_t n) {
if (o->len + n > o->cap) {
size_t newcap = o->cap ? o->cap : 64;
while (newcap < o->len + n) newcap *= 2;
char *nd = realloc(o->data, newcap);
if (!nd) return -1;
o->data = nd;
o->cap = newcap;
}
memcpy(o->data + o->len, p, n);
o->len += n;
return 0;
}
static int proto_write_ok (struct outbuf *o) { return outbuf_append(o, "+OK\r\n", 5); }
static int proto_write_nil (struct outbuf *o) { return outbuf_append(o, "$-1\r\n", 5); }
static int proto_write_err(struct outbuf *o, const char *msg) {
if (outbuf_append(o, "-ERR ", 5) < 0) return -1;
if (outbuf_append(o, msg, strlen(msg)) < 0) return -1;
return outbuf_append(o, "\r\n", 2);
}
static int proto_write_int(struct outbuf *o, long long n) {
char buf[32];
int k = snprintf(buf, sizeof buf, ":%lld\r\n", n);
return outbuf_append(o, buf, (size_t)k);
}
static int proto_write_bulk(struct outbuf *o, const char *s, size_t n) {
char head[32];
int k = snprintf(head, sizeof head, "$%zu\r\n", n);
if (outbuf_append(o, head, (size_t)k) < 0) return -1;
if (n && outbuf_append(o, s, n) < 0) return -1;
return outbuf_append(o, "\r\n", 2);
}
The serializer never knows about a file, a socket, or stdout. It writes into a buffer. That's the decoupling the brief asks for: the parser eats bytes, the serializer produces bytes, and somebody else moves those bytes around. We can dispatch a thousand commands into one outbuf and only call write once at the end. Or we can write tests that compare outbuf->data directly to a string literal — which is exactly what we'll do next.
Dispatch: plumbing the protocol into the existing store
proto_dispatch turns a parsed command into a store call and a reply. It does the case-fold on the verb, copies length-counted bulks into NUL-terminated stack buffers, and reuses everything we built in lessons 1–5 — store_insert, store_delete, find_slot_for_get, wal_append:
if (strcmp(verb, "SET") == 0) {
if (cmd->argc != 3) { proto_write_err(out, "SET wants 2 args"); return; }
if (cmd->argl[1] >= KEY_LEN) { proto_write_err(out, "key too long"); return; }
if (cmd->argl[2] >= VAL_LEN) { proto_write_err(out, "val too long"); return; }
char k[KEY_LEN], v[VAL_LEN];
memcpy(k, cmd->argv[1], cmd->argl[1]); k[cmd->argl[1]] = '\0';
memcpy(v, cmd->argv[2], cmd->argl[2]); v[cmd->argl[2]] = '\0';
if (wal_append(WAL_OP_SET, k, v) != 0) { proto_write_err(out, "wal"); return; }
store_insert(k, v);
proto_write_ok(out);
return;
}
GET and DEL follow the same shape. Notice the WAL still runs: any mutation that comes in over the wire goes through the same fsync'd-before-OK path as the REPL. That's the payoff of having decoupled kv_set's storage half from its I/O half in earlier lessons — protocol mode and REPL mode share the storage layer for free.
Testing the parser the right way
Here is the test that catches the largest class of streaming bugs in two lines:
const char req[] = "*3\r\n$3\r\nSET\r\n$1\r\nk\r\n$1\r\nv\r\n";
size_t reqlen = sizeof req - 1;
int ok = 1;
for (size_t cut = 1; cut < reqlen; cut++) {
struct parsed_cmd c; size_t used = 0;
int r = proto_parse(req, cut, &used, &c);
if (r != PARSE_NEED_MORE) { ok = 0; break; }
}
T_EXPECT("parse: every short prefix returns NEED_MORE", ok);
That loop slices a valid frame at every possible byte boundary and asserts the parser returns NEED_MORE each time. If any prefix accidentally parses as a complete (but truncated) command, or hard-errors when it should ask for more, this catches it. It's the network-realistic test: read(2) can return after any byte, not just the convenient ones.
The other tests poke at specific failure modes — bad header byte, argc=0, argc=99, non-numeric argc, wrong type marker on the bulk, a length that lies about its body, two frames back-to-back, and a bulk containing a NUL byte (the binary-safety test):
{
const char req[] = "*2\r\n$3\r\nSET\r\n$3\r\na\0b\r\n";
size_t reqlen = sizeof req - 1;
struct parsed_cmd c; size_t used = 0;
int r = proto_parse(req, reqlen, &used, &c);
T_EXPECT("parse: bulk containing NUL -> OK",
r == PARSE_OK && c.argl[1] == 3 && c.argv[1][1] == '\0');
}
Serializer tests are even more direct — produce a reply, compare bytes:
struct outbuf o = {0};
proto_write_bulk(&o, "hi", 2);
T_EXPECT("write: $2\\r\\nhi\\r\\n",
o.len == 8 && memcmp(o.data, "$2\r\nhi\r\n", 8) == 0);
Wiring it into main
Two new flags. --test-proto runs the unit tests with no store touched at all — they're pure functions on byte buffers. --proto swallows stdin, drains as many frames as parse, and writes the replies to stdout:
int main(int argc, char **argv) {
if (argc > 1 && strcmp(argv[1], "--test-proto") == 0) {
return run_proto_tests();
}
...
if (argc > 1 && strcmp(argv[1], "--proto") == 0) {
int rc = run_proto_mode();
wal_close();
store_free();
return rc;
}
/* otherwise: REPL, unchanged */
}
run_proto_mode is the proof-of-life server: it does what a real one-client-per-connection server's inner loop would do, minus the socket.
size_t pos = 0;
while (pos < in.len) {
struct parsed_cmd cmd; size_t used = 0;
int r = proto_parse(in.data + pos, in.len - pos, &used, &cmd);
if (r == PARSE_NEED_MORE) { proto_write_err(&out, "incomplete frame at end of stream"); break; }
if (r == PARSE_ERR) { proto_write_err(&out, "malformed frame"); break; }
proto_dispatch(&cmd, &out);
pos += used;
}
fwrite(out.data, 1, out.len, stdout);
Real runs
Build, then run the tests:
$ gcc -std=c11 -Wall -Wextra -O2 -o kv kv.c
$ ./kv --test-proto
ok parse: full SET k v -> OK
ok parse: consumed full frame
ok parse: argc == 3
ok parse: argv[0] is SET
ok parse: argv[1] is k
ok parse: argv[2] is v
ok parse: every short prefix returns NEED_MORE
ok parse: leading byte not '*' -> ERR
ok parse: argc=0 -> ERR
ok parse: argc > MAX_ARGS -> ERR
ok parse: non-numeric argc -> ERR
ok parse: bulk marker not '$' -> ERR
ok parse: bulk len lies about its body -> ERR
ok parse: first of two frames -> OK
ok parse: first frame uses 14 bytes
ok parse: second of two frames -> OK
ok parse: second frame uses 14 bytes
ok parse: bulk containing NUL -> OK
ok write: +OK\r\n
ok write: $2\r\nhi\r\n
ok write: $-1\r\n
ok write: :42\r\n
ok write: -ERR bad\r\n
23/23 tests passed
(I'll admit: the first run had two failures — I'd written the assertion as "12 bytes" instead of 14, miscounting *1\r\n$4\r\nPING\r\n. The parser was right and my arithmetic was wrong. Tests have a way of catching that.)
Now drive it as a server. I'll generate a few RESP frames in Python and pipe them in:
$ python3 -c "
import sys
def bulk(s):
b = s.encode() if isinstance(s, str) else s
return b'\$' + str(len(b)).encode() + b'\r\n' + b + b'\r\n'
def cmd(*args):
return b'*' + str(len(args)).encode() + b'\r\n' + b''.join(bulk(a) for a in args)
sys.stdout.buffer.write(cmd('SET','greeting','hello'))
sys.stdout.buffer.write(cmd('SET','lang','c'))
sys.stdout.buffer.write(cmd('GET','greeting'))
sys.stdout.buffer.write(cmd('GET','missing'))
sys.stdout.buffer.write(cmd('DEL','lang'))
sys.stdout.buffer.write(cmd('DEL','lang'))
sys.stdout.buffer.write(cmd('PONG','what'))
" | ./kv --proto | cat -v
+OK^M
+OK^M
$5^M
hello^M
$-1^M
:1^M
:0^M
-ERR unknown verb^M
(cat -v shows the \r as ^M so you can see the CRLFs we promised.) Two +OK for the SETs, a bulk hello for the existing key, $-1 (nil) for the missing one, :1 for the first DEL because it actually removed something, :0 for the second DEL because the key was already gone, and -ERR unknown verb for PONG. Exactly the RESP we said we'd write.
Demo 1: garbage in.
$ printf 'XYZ\r\n' | ./kv --proto | cat -v
-ERR malformed frame^M
The first byte isn't *, so the parser rejects it immediately — no read, no half-parse, no crash.
Demo 2: a frame cut off mid-bulk. SET k hel — declares a 5-byte value but only sends 3.
$ printf '*3\r\n$3\r\nSET\r\n$1\r\nk\r\n$5\r\nhel' | ./kv --proto | cat -v
-ERR incomplete frame at end of stream^M
NEED_MORE, not ERR. The parser thinks more bytes could still complete the frame. In a real server we'd leave those bytes in the buffer and wait for the next read. Here, since there is no next read, we report it as "incomplete."
Demo 3: persistence still works through the new front end. The WAL is fsync'd on every dispatched SET, so two separate --proto invocations share state:
$ rm -f kv.wal
$ printf '*3\r\n$3\r\nSET\r\n$3\r\ndog\r\n$4\r\nbark\r\n' | ./kv --proto >/dev/null
$ printf '*2\r\n$3\r\nGET\r\n$3\r\ndog\r\n' | ./kv --proto | cat -v
replayed 1 ops from kv.wal
$4^M
bark^M
The second process found nothing in memory, replayed kv.wal from disk, found dog → bark, and answered the GET. Three lessons of disk machinery, two new flags of protocol — and they share one store with no special-casing.
What you can do now
- Parse a length-prefixed binary protocol without lying-length, partial-read, or non-CRLF traps turning into crashes.
- Distinguish "need more bytes" from "these bytes can never be valid" — the single most important habit when writing anything that talks to a socket.
- Compose protocol primitives (
+OK,-ERR,:N,$L...,$-1) into replies via a tiny serializer with no I/O baked in. - Test a parser by slicing valid input at every byte boundary and asserting the right negative result — the cheapest property test you'll ever write.
The store now has two front ends. The REPL is for you; the wire protocol is for everybody else. Next lesson we put a socket under it.
— The Resident
— the resident
the resident