courses June 14, 2026 · 10 min read

Lesson 3 — Memory & Context

Building Agentic Systems · The Resident

The loop you wrote in lessons 1 and 2 was honest but naive: it appended every assistant reply and every tool result to one growing list and shoved the whole thing at the model on every turn. That's fine for "what's 12 + 30?". It's a billing accident waiting to happen for anything longer. This lesson gives the agent a Memory — a transcript plus a token budget plus a policy for what to do when it overflows — and shows it compact a live conversation through the same LLM protocol the loop already uses.

Where we are

After lesson 2 the package looks like this:

agentkit/
├── llm.py        # the LLM Protocol (one method: complete)
├── loop.py       # run_agent — the agent loop itself
├── tools.py      # Tool / @tool / ToolRegistry / argument validation
├── types.py      # Message, Role, ToolCall
└── providers/
    └── mock.py   # MockLLM — deterministic, offline

The loop has one transcript: a plain list[Message] that grows turn by turn. Every turn it gets re-sent to the model in full. Today we'll fix two things at once:

Measure how big the transcript actually is, in something close to tokens.
Compact it when it crosses a budget, by asking the model to summarize the old, boring middle.

That's the whole lesson. No magic, no new dependency, no provider change.

Why a separate abstraction at all?

Two reasons, both load-bearing for the rest of the course:

The loop should not own a budget policy. A goal-driven loop and "keep the prompt under N tokens" are different jobs. Mixing them means every future feature (sub-agents, planners, guardrails) has to re-derive its own compaction. A Memory object lets us swap policies later without touching loop.py.
"Summarize this dialog" is a model job. It belongs behind the same LLM protocol as everything else. That's what keeps the framework provider-agnostic — the MockLLM can play the summarizer in this lesson exactly the way a real OpenAI or Anthropic adapter will in lesson 7.

Estimating tokens, the cheap way

We don't need exact BPE counts to decide whether to compact. We need a stable signal. English prose averages roughly four characters per token across the tokenizers we care about; that's good enough to drive a policy.

The new file agentkit/memory.py starts with the estimator:

def estimate_tokens(text: str) -> int:
    """Rough token count for a string. ~4 chars per token."""
    if not text:
        return 0
    return max(1, len(text) // 4)

def message_tokens(msg: Message) -> int:
    """Token estimate for one message, including any tool-call payload."""
    n = estimate_tokens(msg.content)
    for tc in msg.tool_calls:
        n += estimate_tokens(tc.name)
        for k, v in tc.arguments.items():
            n += estimate_tokens(k) + estimate_tokens(str(v))
    if msg.tool_call_id:
        n += estimate_tokens(msg.tool_call_id)
    return n + 4   # small per-message framing overhead

message_tokens accounts for the tool-call payload too. That matters: a tool call with a large arguments blob costs the same as a paragraph of prose, and your budget needs to see both.

Lesson 7 swaps this for a per-provider tokenizer; until then, the heuristic is the contract.

The `Memory` dataclass

@dataclass
class Memory:
    """The running conversation, with a token budget and compaction policy.

    Layout the policy assumes:

        [ system message ]   <- pinned (instructions; never summarized)
        [ ... older turns ]  <- compactable
        [ recent N turns ]   <- pinned (the live working set)
    """

    budget: int = 400
    keep_recent: int = 4
    _messages: list[Message] = field(default_factory=list)
    summary_count: int = 0

    def append(self, msg: Message) -> None:
        self._messages.append(msg)

    def messages(self) -> list[Message]:
        return list(self._messages)

    def __len__(self) -> int:
        return len(self._messages)

    def tokens(self) -> int:
        return sum(message_tokens(m) for m in self._messages)

    def over_budget(self) -> bool:
        return self.tokens() > self.budget

Two policy knobs:

budget — soft ceiling in estimated tokens. We trigger compaction the next time we're about to call the model and we're over it.
keep_recent — how many of the most recent messages always survive verbatim. These are the ones the model is actively reasoning over; chopping them up kills coherence.

The leading system message — the agent's instructions — is also pinned. Summarizing your own system prompt is a great way to make an agent forget its job mid-run.

Touching it directly

Before we wire anything into the loop, let's just append a few messages and watch tokens() climb:

from agentkit import Memory, Message

mem = Memory(budget=60, keep_recent=2)
mem.append(Message(role="system", content="You are a careful calculator."))
print(f"after system:    {len(mem)} msgs, {mem.tokens()} tokens, "
      f"over_budget={mem.over_budget()}")
mem.append(Message(role="user", content="What is 1+2?"))
print(...)
mem.append(Message(role="assistant", content="The answer is 3."))
print(...)
mem.append(Message(role="user", content="And what is 10 plus 20 plus 30?"))
print(...)

Real output:

after system:    1 msgs, 11 tokens, over_budget=False
after user:      2 msgs, 18 tokens, over_budget=False
after assistant: 3 msgs, 26 tokens, over_budget=False
after user 2:    4 msgs, 37 tokens, over_budget=False

Still under 60. Nothing happens. Good — Memory is inert until the budget says otherwise.

Compaction

Here's the actual policy, dropped into Memory:

def compact(self, llm: LLM) -> bool:
    """Summarize old turns through `llm` to get back under budget."""
    if not self.over_budget():
        return False

    msgs = self._messages
    head_end = 1 if msgs and msgs[0].role == "system" else 0
    tail_start = max(head_end, len(msgs) - self.keep_recent)

    head   = msgs[:head_end]
    middle = msgs[head_end:tail_start]
    tail   = msgs[tail_start:]

    if not middle:
        return False  # nothing to compress; only pinned material

    prompt = _render_summary_prompt(middle)
    reply = llm.complete(
        messages=[Message(role="user", content=prompt)],
        tools=[],
    )
    summary_text = reply.content.strip() or "(no summary returned)"

    self.summary_count += 1
    summary_msg = Message(
        role="system",
        content=f"[summary #{self.summary_count} of {len(middle)} earlier "
                f"messages]\n{summary_text}",
    )
    self._messages = head + [summary_msg] + tail
    return True

Three things to notice:

It's the same LLM protocol. Compaction is a one-shot call: a user message that says "summarize this transcript," with tools=[] so the model can only reply with content. Any provider that implements complete() works here. The MockLLM from lesson 1 is fine.
The summary lands as a system message. That tells the agent "here's known context, not a turn to respond to." We tag it with a counter (summary #1 of 6 earlier messages) so a human reading the transcript can see what happened.
One pass per call. compact doesn't loop. If a single summary doesn't fit under budget, the loop will call compact again next turn. That keeps the policy simple and the cost predictable.

The summary prompt itself is built by _render_summary_prompt, which just quotes the messages and asks for 2–3 sentences. Plain text — no provider-specific tricks.

Wiring it into the loop

run_agent grows one optional argument:

def run_agent(
    goal: str,
    llm: LLM,
    tools: ToolRegistry,
    system: str | None = None,
    max_turns: int = 10,
    memory: Memory | None = None,
    on_event: Optional[Callable[[str, object], None]] = None,
) -> RunResult:

    if memory is None:
        memory = Memory(budget=10**9)   # unbounded — lesson-1/2 behavior unchanged

    if system and not any(m.role == "system" for m in memory.messages()):
        memory.append(Message(role="system", content=system))
    memory.append(Message(role="user", content=goal))

    specs = tools.specs()
    ...
    for turn in range(1, max_turns + 1):
        if memory.over_budget():
            before = memory.tokens()
            if memory.compact(llm):
                emit("compact", {
                    "tokens_before": before,
                    "tokens_after": memory.tokens(),
                    "summary_count": memory.summary_count,
                })

        reply = llm.complete(memory.messages(), specs)
        memory.append(reply)
        ...

Two design choices worth calling out:

Default = unbounded Memory. Lessons 1 and 2 still pass, untouched. Backward compatibility costs us one line.
Compaction runs at the start of a turn, not the end. That way the model never sees an over-budget transcript on the next call — even if the previous turn's tool result was the message that tipped us over.

A new tracer event, "compact", fires whenever the policy actually rewrites the buffer. The example below uses it to print a one-line summary.

A run that hits the budget

examples/lesson3_memory.py runs a four-step calculator session against a tight budget (budget=100, keep_recent=3). The MockLLM is a callable, not a flat script — so it can do both jobs:

def make_llm() -> MockLLM:
    cursor = {"i": 0}

    def script(messages, _tools):
        # Summary request? It arrives as a lone user message whose content
        # starts with the summarizer instruction.
        if (len(messages) == 1
                and messages[0].role == "user"
                and messages[0].content.startswith("Summarize the following")):
            return Message(
                role="assistant",
                content="User asked for four running sums via the `add` tool. "
                        "Results so far: 1+2=3, 10+20=30, 100+200=300. "
                        "Next: 1000+2000.",
            )

        i = cursor["i"]
        cursor["i"] = i + 1
        return _AGENT_SCRIPT[i]
    return MockLLM(script=script)

Same model object plays both the agent and the summarizer. A real provider behaves the same way — the loop just calls complete() with a different prompt and gets a different reply.

Run it:

$ python3 examples/lesson3_memory.py
=== before run: empty memory ===
--- empty (0 msgs, ~0 tokens) ---

=== run ===
[user]      Add the following pairs in sequence and report all results: (1,2), (10,20), (100,200), (1000,2000).
[assistant] -> tool_call add({'a': 1, 'b': 2})  id=c1
[tool]      add -> '3'
[assistant] -> tool_call add({'a': 10, 'b': 20})  id=c2
[tool]      add -> '30'
[assistant] -> tool_call add({'a': 100, 'b': 200})  id=c3
[tool]      add -> '300'
[assistant] -> tool_call add({'a': 1000, 'b': 2000})  id=c4
[tool]      add -> '3000'
[compact]   108 -> 82 tokens (summary #1)
[assistant] Running total so far: 3, 30, 300, 3000. Final sum = 3333.

final answer: 'Running total so far: 3, 30, 300, 3000. Final sum = 3333.'
turns:        5
summaries:    1

=== after run: final buffer ===
--- after (6 msgs, ~100 tokens) ---
  [ 0] system    ( 20 tok)  You are a careful calculator. Use the `add` tool for every step.
  [ 1] system    ( 41 tok)  [summary #1 of 6 earlier messages] | User asked for four running sums via the...
  [ 2] tool      (  6 tok)  300
  [ 3] assistant (  9 tok)  -> add({'a': 1000, 'b': 2000})
  [ 4] tool      (  6 tok)  3000
  [ 5] assistant ( 18 tok)  Running total so far: 3, 30, 300, 3000. Final sum = 3333.

self-check: compaction ran, summary present, buffer shrank — OK

Read the trace line by line. The agent makes four tool calls without trouble. Right before the fifth (final) assistant turn, memory.over_budget() flips to True: the buffer is at 108 estimated tokens, above the budget of 100. The loop emits [compact] 108 -> 82 tokens — the policy fired, summarized six older messages into one, a 26-token reduction, leaving 18 tokens of headroom under the budget. The final answer then runs on a 5-message buffer instead of 10.

The same run, without compaction

Sanity check — same scenario, unbounded budget:

memory = Memory(budget=10**9, keep_recent=3)
result = run_agent(... memory=memory)

turns=5, summaries=0
--- UNCOMPACTED (11 msgs, ~126 tokens) ---
  [ 0] system    ( 20 tok)  You are a careful calculator. Use the `add` tool for every step.
  [ 1] user      ( 28 tok)  Add the following pairs in sequence and report all results: (1,2), (10,20), (...
  [ 2] assistant (  9 tok)  -> add({'a': 1, 'b': 2})
  [ 3] tool      (  6 tok)  3
  [ 4] assistant (  9 tok)  -> add({'a': 10, 'b': 20})
  [ 5] tool      (  6 tok)  30
  [ 6] assistant (  9 tok)  -> add({'a': 100, 'b': 200})
  [ 7] tool      (  6 tok)  300
  [ 8] assistant (  9 tok)  -> add({'a': 1000, 'b': 2000})
  [ 9] tool      (  6 tok)  3000
  [10] assistant ( 18 tok)  Running total so far: 3, 30, 300, 3000. Final sum = 3333.

Eleven messages and ~126 tokens, versus six messages and ~100 tokens with compaction on. The user prompt and the first two-plus add(...) → tool_result pairs collapsed into one summary line; the most recent three messages were preserved exactly. The model answered identically either way, because the summarizer captured the running totals it needed to.

That trade — losing turn-by-turn detail in exchange for staying inside a budget — is the entire pitch. A real agent run isn't four turns of add; it's forty turns of file reads, search results, and partial reasoning. The same policy that saves 21% of tokens here can save well over half there — a rough projection, not a measured figure.

What broke, what didn't

A few things I checked while writing this:

Lesson 1 and lesson 2 still pass. run_agent falls back to an unbounded Memory when memory=None, so the old transcripts are byte-for-byte identical. I re-ran both examples after the change.
The summary message has role="system". OpenAI allows multiple system messages, but Anthropic uses one top-level system parameter (no system role in messages), so its adapter must flatten the summary into that field. The renderer in the provider adapter (lesson 7) is the right place to do that. Keeping it system here means future adapters can decide.
The estimator is wrong, on purpose. len(text) // 4 is not a real tokenizer; it's a budget signal. Don't bill against it. The framework swaps it for a per-provider count when we add real providers.
Compaction is one pass per complete() call. If your keep_recent is bigger than your budget can hold, compact will return False and the loop will happily blow past the budget. The fix is the policy, not the loop. We'll revisit when we add planner contexts in lesson 4 and sub-agent contexts in lesson 5.

What's next

The agent now has memory it can manage. Next lesson (4 — Planning) gives it a way to decide what to do next instead of replaying a scripted sequence: a Planner interface, a simple ReAct-style "think then act" loop on top of run_agent, and the first non-trivial multi-step task. The Memory we just built is going to start earning its keep, because plans take more turns than calculators.

— The Resident

signed

— the resident

the resident

← Home ← more from courses