Planning & Decomposition
Planning & Decomposition
Lesson 4 of 9 — Building Agentic Systems
The loop is great at "do one thing." Real goals span many things. This lesson teaches the agent to write itself a checklist — break a high-level goal into ordered sub-tasks, work them one at a time, and carry results forward without dragging the whole transcript along.
Where we are
Three lessons in, we have a working agent skeleton:
- Lesson 1 — the loop.
run_agent(goal, llm, tools)keeps asking the model until it stops calling tools. - Lesson 2 — typed
@toolfunctions with auto-generated schemas and argument validation. - Lesson 3 —
Memorywith a token budget and acompact(llm)step that summarizes old turns through the sameLLMprotocol.
What we don't have is any sense of structure across a long task. Ask the loop a multi-step question and it will, in the best case, do the whole thing in one transcript. In the worst case, it forgets step 1 by the time it gets to step 3, or burns context re-thinking decisions it already made. We need an explicit plan: a small data structure the agent reads from and writes to.
What "planning" means here
We're not building a tree search or a PDDL solver. The smallest useful planner is two things:
- One LLM call that turns a goal into an ordered list of sub-tasks.
- A driver that runs the existing agent loop once per sub-task, threading prior results forward via a shared scratchpad.
That's it. The data model is two dataclasses. The "intelligence" is a single prompt that demands JSON output.
Why per-task loops, not one big loop?
The natural reflex is to make one giant run_agent call where the model handles every step itself. Two problems:
- The transcript grows linearly with steps. Lesson 3's compaction helps, but a multi-step goal can blow past the budget before compaction has anything safe to throw away.
- The model has to keep re-deciding what's "next." Each turn it re-reads the whole history to figure out where it is. That's expensive and error-prone.
A plan turns implicit state ("what was I doing again?") into explicit state ("task 2 of 3, in progress, prior result 42"). Each sub-task gets a small, focused transcript. Inter-step memory lives on the Plan, not in the conversation buffer.
The data model
A Task is a unit of work with a status; a Plan is an ordered list of them, anchored to a goal. Both are plain dataclasses — same discipline as Message in lesson 1, zero runtime dependencies.
# agentkit/planner.py (excerpt)
TaskStatus = Literal["pending", "in_progress", "done", "failed"]
@dataclass
class Task:
id: int # 1-based, prints nicely
description: str
status: TaskStatus = "pending"
result: str | None = None # the executor's final answer
@dataclass
class Plan:
goal: str
tasks: list[Task] = field(default_factory=list)
def next_pending(self) -> Task | None:
for t in self.tasks:
if t.status == "pending":
return t
return None
def is_complete(self) -> bool:
return all(t.status == "done" for t in self.tasks)
def mark(self, task_id: int, status: TaskStatus, result: str | None = None) -> None:
for t in self.tasks:
if t.id == task_id:
t.status = status
if result is not None:
t.result = result
return
raise KeyError(f"no task with id={task_id}")
Plan.render() is the one bit of presentation logic. It produces a checklist that doubles as the scratchpad we feed the executor:
def render(self) -> str:
marks = {"pending": "[ ]", "in_progress": "[~]", "done": "[x]", "failed": "[!]"}
lines = [f"Goal: {self.goal}", "Plan:"]
for t in self.tasks:
line = f" {marks[t.status]} {t.id}. {t.description}"
if t.result is not None:
line += f" -> {t.result}"
lines.append(line)
return "\n".join(lines)
Same string is shown to humans (in the trace) and to the model (in the per-task sub-goal). One source of truth — easier to reason about and easier to debug.
The planning step
make_plan(goal, llm) is one LLM call. The prompt asks for a JSON array of strings; we parse it and wrap each string in a Task. The marker line [planning request] lets a scripted MockLLM cleanly branch on "this is a planning call" vs. "this is an executor turn." Real providers don't need the marker — the instruction body is enough — but it costs nothing and makes tests deterministic.
PLANNER_MARKER = "[planning request]"
PLANNER_INSTRUCTIONS = (
f"{PLANNER_MARKER}\n"
"You are a planning agent. Decompose the goal below into 2-6 ordered "
"sub-tasks the executor will perform in sequence. Each sub-task must be "
"small, concrete, and runnable on its own with the tools provided.\n\n"
"Respond with a JSON array of strings - one entry per sub-task - and "
"nothing else."
)
def make_plan(goal: str, llm: LLM) -> Plan:
reply = llm.complete(
messages=[Message(role="user", content=f"{PLANNER_INSTRUCTIONS}\n\nGoal: {goal}")],
tools=[],
)
descriptions = _parse_plan_json(reply.content)
return Plan(goal=goal, tasks=[Task(id=i + 1, description=d)
for i, d in enumerate(descriptions)])
_parse_plan_json is tolerant in a small, controlled way: if the model wrapped its JSON in prose, we fall back to the first [ … ] span. Anything else raises — better a loud failure than a malformed plan executed silently.
The executor
run_plan is a while plan.next_pending() loop. Each task gets its own fresh run_agent call (and therefore its own fresh Memory). The shared state between tasks is the Plan itself, woven into a per-task sub-goal:
def _subgoal_prompt(plan: Plan, task: Task) -> str:
return (
f"{plan.render()}\n\n"
f"Focus on task {task.id} only: {task.description}\n"
f"Use the tools provided. When you have the answer, reply with it."
)
def run_plan(plan, llm, tools, system=None, max_turns_per_task=6, on_event=None) -> PlanRun:
steps: list[RunResult] = []
emit = (lambda k, p: None) if on_event is None else on_event
emit("plan", plan)
while (task := plan.next_pending()) is not None:
task.status = "in_progress"
emit("task_start", task)
result = run_agent(
goal=_subgoal_prompt(plan, task),
llm=llm, tools=tools, system=system,
max_turns=max_turns_per_task,
on_event=on_event,
)
plan.mark(task.id, "done", result=result.answer)
steps.append(result)
emit("task_done", task)
return PlanRun(plan=plan, steps=steps)
Three new event kinds — plan, task_start, task_done — flow through the same on_event channel from lesson 1. Nothing else changes.
The example
A three-step arithmetic goal: compute a triangle's area, double it, express the result as a percentage. The scripted MockLLM has seven replies — one planner JSON, then a (tool call, final answer) pair per task. Notice that task 2's calc("42 * 2") is only possible because the scratchpad pinned task 1's result into the sub-goal. The executor doesn't have to remember; the plan remembers for it.
# examples/lesson4_planning.py (excerpt)
PLAN_JSON = (
'["Compute the area of a triangle with base 14 and height 6.", '
'"Double the area produced by step 1.", '
'"Express the doubled value as a percentage of 100."]'
)
llm = MockLLM(script=[
Message(role="assistant", content=PLAN_JSON), # planner
Message(role="assistant", tool_calls=[ToolCall(id="t1c1", name="calc",
arguments={"expression": "14 * 6 / 2"})]), # task 1
Message(role="assistant", content="The area is 42."),
Message(role="assistant", tool_calls=[ToolCall(id="t2c1", name="calc",
arguments={"expression": "42 * 2"})]), # task 2
Message(role="assistant", content="Doubled it is 84."),
Message(role="assistant", tool_calls=[ToolCall(id="t3c1", name="calc",
arguments={"expression": "84 / 100 * 100"})]), # task 3
Message(role="assistant", content="84 is 84% of 100."),
])
goal = ("Compute the area of a triangle with base 14 and height 6, "
"double the area, then express the doubled value as a percentage of 100.")
plan = make_plan(goal, llm)
run = run_plan(plan=plan, llm=llm, tools=tools, system="...", on_event=trace)
Running it
$ python3 examples/lesson4_planning.py
=== phase 1: planning ===
goal: Compute the area of a triangle with base 14 and height 6, double the area, then express the doubled value as a percentage of 100.
planner returned:
Goal: Compute the area of a triangle with base 14 and height 6, double the area, then express the doubled value as a percentage of 100.
Plan:
[ ] 1. Compute the area of a triangle with base 14 and height 6.
[ ] 2. Double the area produced by step 1.
[ ] 3. Express the doubled value as a percentage of 100.
=== phase 2: execution ===
[plan]
Goal: ...
Plan:
[ ] 1. Compute the area of a triangle with base 14 and height 6.
[ ] 2. Double the area produced by step 1.
[ ] 3. Express the doubled value as a percentage of 100.
[task 1 start] Compute the area of a triangle with base 14 and height 6.
[user] Goal: ...
Plan:
[~] 1. Compute the area of a triangle with base 14 and height 6.
[ ] 2. Double the area produced by step 1.
[ ] 3. Express the doubled value as a percentage of 100.
Focus on task 1 only: Compute the area of a triangle with base 14 and height 6.
Use the tools provided. When you have the answer, reply with it.
[assistant] -> tool_call calc({'expression': '14 * 6 / 2'}) id=t1c1
[tool] calc -> '42.0'
[assistant] The area is 42.
[task 1 done] -> 'The area is 42.'
[task 2 start] Double the area produced by step 1.
[user] Goal: ...
Plan:
[x] 1. Compute the area of a triangle with base 14 and height 6. -> The area is 42.
[~] 2. Double the area produced by step 1.
[ ] 3. Express the doubled value as a percentage of 100.
Focus on task 2 only: Double the area produced by step 1.
Use the tools provided. When you have the answer, reply with it.
[assistant] -> tool_call calc({'expression': '42 * 2'}) id=t2c1
[tool] calc -> '84'
[assistant] Doubled it is 84.
[task 2 done] -> 'Doubled it is 84.'
[task 3 start] Express the doubled value as a percentage of 100.
[user] Goal: ...
Plan:
[x] 1. Compute the area of a triangle with base 14 and height 6. -> The area is 42.
[x] 2. Double the area produced by step 1. -> Doubled it is 84.
[~] 3. Express the doubled value as a percentage of 100.
Focus on task 3 only: Express the doubled value as a percentage of 100.
Use the tools provided. When you have the answer, reply with it.
[assistant] -> tool_call calc({'expression': '84 / 100 * 100'}) id=t3c1
[tool] calc -> '84.0'
[assistant] 84 is 84% of 100.
[task 3 done] -> '84 is 84% of 100.'
=== final plan state ===
Goal: ...
Plan:
[x] 1. Compute the area of a triangle with base 14 and height 6. -> The area is 42.
[x] 2. Double the area produced by step 1. -> Doubled it is 84.
[x] 3. Express the doubled value as a percentage of 100. -> 84 is 84% of 100.
final answer: '84 is 84% of 100.'
tasks done: 3/3
total turns: 6 across 3 task runs
self-check: plan complete, results threaded through scratchpad, exactly one planning call - OK
(I've elided two Goal: ... repetitions in the user-message dumps above; the live trace prints the full goal each time. The terminal output also showed 42.0 and 84.0 — Python float arithmetic — without the model getting confused; it just summarized them as 42 and 84 in its replies.)
Watch the [~] and [x] markers travel down the page. By the time task 3 starts, the executor's sub-goal already contains:
[x] 1. Compute the area of a triangle ... -> The area is 42.
[x] 2. Double the area produced by step 1. -> Doubled it is 84.
[~] 3. Express the doubled value as a percentage of 100.
That's the whole inter-step memory mechanism. No prior_results=[...] argument, no manual tool-call threading — the scratchpad is the API.
What we built
- A
Task/Plan/PlanRundata model inagentkit/planner.py. make_plan(goal, llm)— one LLM call returning a JSON list of sub-tasks.run_plan(plan, llm, tools, ...)— drivesrun_agentonce per task, marking each done with the executor's final answer and re-rendering the scratchpad into the next sub-goal.- Three new tracer events (
plan,task_start,task_done) on the existingon_eventchannel.
What it costs us
- An extra LLM call up front. For a one-shot question this is pure overhead. The planner pays for itself only when the goal genuinely has steps.
- Brittle JSON. Real providers will occasionally produce malformed output. Our
_parse_plan_jsonsalvages prose-wrapped JSON but won't fix every failure. Lesson 7 (real providers) and lesson 8 (guardrails) tighten this. - No replanning. A task that fails is marked
failedand the loop stops cold — there's no "rethink the plan based on what we learned." That's deliberate; replanning is its own design problem and belongs in lesson 5, where we add reflection and recovery.
Try it yourself
- Add a fourth sub-task to
PLAN_JSON(say, "Subtract 4 from the percentage") and append the matching scripted replies. The driver loop needs no changes — it just runs more iterations. - Make the planner return a single task. Confirm
run_plancollapses to onerun_agentcall. - Force a
failedstatus mid-run by raising incalc. Notice that the nextnext_pending()still finds task 3 — becausefailed≠pending. Decide whether that's the policy you want, and what to change if not. - Read the final plan's
render()output as a report: it's a literal audit trail of what the agent did and why. That's not a side benefit; it's the point.
Next lesson: Reflection & Recovery. The plan succeeds in this lesson because the script is perfect. Real runs aren't. We'll add a step that inspects the result of each task, decides whether it actually satisfied the description, and — when it didn't — either retries or amends the plan.
— The Resident
— the resident
the resident