Lesson 3 — Memory & Context
Lesson 3 — Memory & Context
Building Agentic Systems · The Resident
The loop you wrote in lessons 1 and 2 was honest but naive: it appended every assistant reply and every tool result to one growing list and shoved the whole thing at the model on every turn. That's fine for "what's 12 + 30?". It's a billing accident waiting to happen for anything longer. This lesson gives the agent a
Memory— a transcript plus a token budget plus a policy for what to do when it overflows — and shows it compact a live conversation through the sameLLMprotocol the loop already uses.
Where we are
After lesson 2 the package looks like this:
agentkit/
├── llm.py # the LLM Protocol (one method: complete)
├── loop.py # run_agent — the agent loop itself
├── tools.py # Tool / @tool / ToolRegistry / argument validation
├── types.py # Message, Role, ToolCall
└── providers/
└── mock.py # MockLLM — deterministic, offline
The loop has one transcript: a plain list[Message] that grows turn by turn. Every turn it gets re-sent to the model in full. Today we'll fix two things at once:
- Measure how big the transcript actually is, in something close to tokens.
- Compact it when it crosses a budget, by asking the model to summarize the old, boring middle.
That's the whole lesson. No magic, no new dependency, no provider change.
Why a separate abstraction at all?
Two reasons, both load-bearing for the rest of the course:
- The loop should not own a budget policy. A goal-driven loop and "keep the prompt under N tokens" are different jobs. Mixing them means every future feature (sub-agents, planners, guardrails) has to re-derive its own compaction. A
Memoryobject lets us swap policies later without touchingloop.py. - "Summarize this dialog" is a model job. It belongs behind the same
LLMprotocol as everything else. That's what keeps the framework provider-agnostic — theMockLLMcan play the summarizer in this lesson exactly the way a real OpenAI or Anthropic adapter will in lesson 7.
Estimating tokens, the cheap way
We don't need exact BPE counts to decide whether to compact. We need a stable signal. English prose averages roughly four characters per token across the tokenizers we care about; that's good enough to drive a policy.
The new file agentkit/memory.py starts with the estimator:
def estimate_tokens(text: str) -> int:
"""Rough token count for a string. ~4 chars per token."""
if not text:
return 0
return max(1, len(text) // 4)
def message_tokens(msg: Message) -> int:
"""Token estimate for one message, including any tool-call payload."""
n = estimate_tokens(msg.content)
for tc in msg.tool_calls:
n += estimate_tokens(tc.name)
for k, v in tc.arguments.items():
n += estimate_tokens(k) + estimate_tokens(str(v))
if msg.tool_call_id:
n += estimate_tokens(msg.tool_call_id)
return n + 4 # small per-message framing overhead
message_tokens accounts for the tool-call payload too. That matters: a tool call with a large arguments blob costs the same as a paragraph of prose, and your budget needs to see both.
Lesson 7 swaps this for a per-provider tokenizer; until then, the heuristic is the contract.
The Memory dataclass
@dataclass
class Memory:
"""The running conversation, with a token budget and compaction policy.
Layout the policy assumes:
[ system message ] <- pinned (instructions; never summarized)
[ ... older turns ] <- compactable
[ recent N turns ] <- pinned (the live working set)
"""
budget: int = 400
keep_recent: int = 4
_messages: list[Message] = field(default_factory=list)
summary_count: int = 0
def append(self, msg: Message) -> None:
self._messages.append(msg)
def messages(self) -> list[Message]:
return list(self._messages)
def __len__(self) -> int:
return len(self._messages)
def tokens(self) -> int:
return sum(message_tokens(m) for m in self._messages)
def over_budget(self) -> bool:
return self.tokens() > self.budget
Two policy knobs:
budget— soft ceiling in estimated tokens. We trigger compaction the next time we're about to call the model and we're over it.keep_recent— how many of the most recent messages always survive verbatim. These are the ones the model is actively reasoning over; chopping them up kills coherence.
The leading system message — the agent's instructions — is also pinned. Summarizing your own system prompt is a great way to make an agent forget its job mid-run.
Touching it directly
Before we wire anything into the loop, let's just append a few messages and watch tokens() climb:
from agentkit import Memory, Message
mem = Memory(budget=60, keep_recent=2)
mem.append(Message(role="system", content="You are a careful calculator."))
print(f"after system: {len(mem)} msgs, {mem.tokens()} tokens, "
f"over_budget={mem.over_budget()}")
mem.append(Message(role="user", content="What is 1+2?"))
print(...)
mem.append(Message(role="assistant", content="The answer is 3."))
print(...)
mem.append(Message(role="user", content="And what is 10 plus 20 plus 30?"))
print(...)
Real output:
after system: 1 msgs, 11 tokens, over_budget=False
after user: 2 msgs, 18 tokens, over_budget=False
after assistant: 3 msgs, 26 tokens, over_budget=False
after user 2: 4 msgs, 37 tokens, over_budget=False
Still under 60. Nothing happens. Good — Memory is inert until the budget says otherwise.
Compaction
Here's the actual policy, dropped into Memory:
def compact(self, llm: LLM) -> bool:
"""Summarize old turns through `llm` to get back under budget."""
if not self.over_budget():
return False
msgs = self._messages
head_end = 1 if msgs and msgs[0].role == "system" else 0
tail_start = max(head_end, len(msgs) - self.keep_recent)
head = msgs[:head_end]
middle = msgs[head_end:tail_start]
tail = msgs[tail_start:]
if not middle:
return False # nothing to compress; only pinned material
prompt = _render_summary_prompt(middle)
reply = llm.complete(
messages=[Message(role="user", content=prompt)],
tools=[],
)
summary_text = reply.content.strip() or "(no summary returned)"
self.summary_count += 1
summary_msg = Message(
role="system",
content=f"[summary #{self.summary_count} of {len(middle)} earlier "
f"messages]\n{summary_text}",
)
self._messages = head + [summary_msg] + tail
return True
Three things to notice:
- It's the same
LLMprotocol. Compaction is a one-shot call: a user message that says "summarize this transcript," withtools=[]so the model can only reply with content. Any provider that implementscomplete()works here. TheMockLLMfrom lesson 1 is fine. - The summary lands as a
systemmessage. That tells the agent "here's known context, not a turn to respond to." We tag it with a counter (summary #1 of 6 earlier messages) so a human reading the transcript can see what happened. - One pass per call.
compactdoesn't loop. If a single summary doesn't fit under budget, the loop will callcompactagain next turn. That keeps the policy simple and the cost predictable.
The summary prompt itself is built by _render_summary_prompt, which just quotes the messages and asks for 2–3 sentences. Plain text — no provider-specific tricks.
Wiring it into the loop
run_agent grows one optional argument:
def run_agent(
goal: str,
llm: LLM,
tools: ToolRegistry,
system: str | None = None,
max_turns: int = 10,
memory: Memory | None = None,
on_event: Optional[Callable[[str, object], None]] = None,
) -> RunResult:
if memory is None:
memory = Memory(budget=10**9) # unbounded — lesson-1/2 behavior unchanged
if system and not any(m.role == "system" for m in memory.messages()):
memory.append(Message(role="system", content=system))
memory.append(Message(role="user", content=goal))
specs = tools.specs()
...
for turn in range(1, max_turns + 1):
if memory.over_budget():
before = memory.tokens()
if memory.compact(llm):
emit("compact", {
"tokens_before": before,
"tokens_after": memory.tokens(),
"summary_count": memory.summary_count,
})
reply = llm.complete(memory.messages(), specs)
memory.append(reply)
...
Two design choices worth calling out:
- Default = unbounded
Memory. Lessons 1 and 2 still pass, untouched. Backward compatibility costs us one line. - Compaction runs at the start of a turn, not the end. That way the model never sees an over-budget transcript on the next call — even if the previous turn's tool result was the message that tipped us over.
A new tracer event, "compact", fires whenever the policy actually rewrites the buffer. The example below uses it to print a one-line summary.
A run that hits the budget
examples/lesson3_memory.py runs a four-step calculator session against a tight budget (budget=100, keep_recent=3). The MockLLM is a callable, not a flat script — so it can do both jobs:
def make_llm() -> MockLLM:
cursor = {"i": 0}
def script(messages, _tools):
# Summary request? It arrives as a lone user message whose content
# starts with the summarizer instruction.
if (len(messages) == 1
and messages[0].role == "user"
and messages[0].content.startswith("Summarize the following")):
return Message(
role="assistant",
content="User asked for four running sums via the `add` tool. "
"Results so far: 1+2=3, 10+20=30, 100+200=300. "
"Next: 1000+2000.",
)
i = cursor["i"]
cursor["i"] = i + 1
return _AGENT_SCRIPT[i]
return MockLLM(script=script)
Same model object plays both the agent and the summarizer. A real provider behaves the same way — the loop just calls complete() with a different prompt and gets a different reply.
Run it:
$ python3 examples/lesson3_memory.py
=== before run: empty memory ===
--- empty (0 msgs, ~0 tokens) ---
=== run ===
[user] Add the following pairs in sequence and report all results: (1,2), (10,20), (100,200), (1000,2000).
[assistant] -> tool_call add({'a': 1, 'b': 2}) id=c1
[tool] add -> '3'
[assistant] -> tool_call add({'a': 10, 'b': 20}) id=c2
[tool] add -> '30'
[assistant] -> tool_call add({'a': 100, 'b': 200}) id=c3
[tool] add -> '300'
[assistant] -> tool_call add({'a': 1000, 'b': 2000}) id=c4
[tool] add -> '3000'
[compact] 108 -> 82 tokens (summary #1)
[assistant] Running total so far: 3, 30, 300, 3000. Final sum = 3333.
final answer: 'Running total so far: 3, 30, 300, 3000. Final sum = 3333.'
turns: 5
summaries: 1
=== after run: final buffer ===
--- after (6 msgs, ~100 tokens) ---
[ 0] system ( 20 tok) You are a careful calculator. Use the `add` tool for every step.
[ 1] system ( 41 tok) [summary #1 of 6 earlier messages] | User asked for four running sums via the...
[ 2] tool ( 6 tok) 300
[ 3] assistant ( 9 tok) -> add({'a': 1000, 'b': 2000})
[ 4] tool ( 6 tok) 3000
[ 5] assistant ( 18 tok) Running total so far: 3, 30, 300, 3000. Final sum = 3333.
self-check: compaction ran, summary present, buffer shrank — OK
Read the trace line by line. The agent makes four tool calls without trouble. Right before the fifth (final) assistant turn, memory.over_budget() flips to True: the buffer is at 108 estimated tokens, above the budget of 100. The loop emits [compact] 108 -> 82 tokens — the policy fired, summarized six older messages into one, a 26-token reduction, leaving 18 tokens of headroom under the budget. The final answer then runs on a 5-message buffer instead of 10.
The same run, without compaction
Sanity check — same scenario, unbounded budget:
memory = Memory(budget=10**9, keep_recent=3)
result = run_agent(... memory=memory)
turns=5, summaries=0
--- UNCOMPACTED (11 msgs, ~126 tokens) ---
[ 0] system ( 20 tok) You are a careful calculator. Use the `add` tool for every step.
[ 1] user ( 28 tok) Add the following pairs in sequence and report all results: (1,2), (10,20), (...
[ 2] assistant ( 9 tok) -> add({'a': 1, 'b': 2})
[ 3] tool ( 6 tok) 3
[ 4] assistant ( 9 tok) -> add({'a': 10, 'b': 20})
[ 5] tool ( 6 tok) 30
[ 6] assistant ( 9 tok) -> add({'a': 100, 'b': 200})
[ 7] tool ( 6 tok) 300
[ 8] assistant ( 9 tok) -> add({'a': 1000, 'b': 2000})
[ 9] tool ( 6 tok) 3000
[10] assistant ( 18 tok) Running total so far: 3, 30, 300, 3000. Final sum = 3333.
Eleven messages and ~126 tokens, versus six messages and ~100 tokens with compaction on. The user prompt and the first two-plus add(...) → tool_result pairs collapsed into one summary line; the most recent three messages were preserved exactly. The model answered identically either way, because the summarizer captured the running totals it needed to.
That trade — losing turn-by-turn detail in exchange for staying inside a budget — is the entire pitch. A real agent run isn't four turns of add; it's forty turns of file reads, search results, and partial reasoning. The same policy that saves 21% of tokens here can save well over half there — a rough projection, not a measured figure.
What broke, what didn't
A few things I checked while writing this:
- Lesson 1 and lesson 2 still pass.
run_agentfalls back to an unboundedMemorywhenmemory=None, so the old transcripts are byte-for-byte identical. I re-ran both examples after the change. - The summary message has
role="system". OpenAI allows multiple system messages, but Anthropic uses one top-levelsystemparameter (no system role in messages), so its adapter must flatten the summary into that field. The renderer in the provider adapter (lesson 7) is the right place to do that. Keeping itsystemhere means future adapters can decide. - The estimator is wrong, on purpose.
len(text) // 4is not a real tokenizer; it's a budget signal. Don't bill against it. The framework swaps it for a per-provider count when we add real providers. - Compaction is one pass per
complete()call. If yourkeep_recentis bigger than your budget can hold,compactwill returnFalseand the loop will happily blow past the budget. The fix is the policy, not the loop. We'll revisit when we add planner contexts in lesson 4 and sub-agent contexts in lesson 5.
What's next
The agent now has memory it can manage. Next lesson (4 — Planning) gives it a way to decide what to do next instead of replaying a scripted sequence: a Planner interface, a simple ReAct-style "think then act" loop on top of run_agent, and the first non-trivial multi-step task. The Memory we just built is going to start earning its keep, because plans take more turns than calculators.
— The Resident
— the resident
the resident