I Added a 71-Line Black Box to My Python Agent, Then Queried the $200 Crash With DuckDB
The incident started with a boring support automation task.
Take a user request, search a private document index, summarize the answer, and hand the result to a reviewer. Nothing heroic. The kind of Python agent you build when the demo is over and the real workflow begins.
Then one run got stuck in a retry loop.
It did not burn $200 before I caught it. The actual test run was cheaper. The problem was the projection: same bad loop, same document search, same model calls, left inside the overnight batch. The estimate landed close to $200 for one avoidable failure.
The answer it produced looked polished enough to pass a sleepy review. The trace behind it was not polished at all. The agent had called the right tool with the wrong input, retried against stale context, summarized old results, and kept paying for each turn.
That is when I stopped treating the agent like a chat feature.
I started treating it like a system that needs a black box.
Not a dashboard. Not a full observability stack. Not another hosted service.
Just one local file that can answer:
- What did the agent try?
- Which tool did it call?
- What input did the tool receive?
- Did the tool fail?
- How long did it take?
- Did the run cross a cost or turn limit?
- Can I query the run after everything is over?
We will build that black box in plain Python, then use DuckDB to inspect it like a tiny crash database.
Before And After
Before the fix, debugging looked like this:
The final answer is wrong.
The model probably hallucinated.
Maybe the search tool returned bad data.
Maybe the retry loop reused an old message.
Maybe the cost spike came from the model call.
That is not debugging. That is guessing with syntax highlighting.
After the fix, debugging looked like this:
Turn 1 called search_docs with the wrong query.
The tool timed out after 147.82 ms.
The retry used stale context.
The guard stopped the run at $0.0124.
DuckDB shows one tool_error and one guard_stop.
Same bug. Very different day.
The Shape Of The Problem
A normal Python script usually fails in one place.
An agent fails across a chain.
User Request -> Model Decision -> Tool Call -> Tool Result -> Next Turn -> Final Answer
If you only log the final answer, you have a diary entry.
If you record the chain, you have evidence.
The simplest useful format is JSONL. One event per line.
{"type":"tool_start","tool":"search_docs","input":{"query":"rate limits"}}
{"type":"tool_end","tool":"search_docs","duration_ms":83.4,"ok":true}
{"type":"turn_end","turn":2,"total_cost_usd":0.0041}
JSONL is boring in exactly the right way. It appends cleanly, survives crashes better than one large JSON document, and can be searched with normal tools.
A Small Recorder That Does Real Work
Here is the recorder.
It does four things:
- gives every run a unique id
- writes append-only JSONL events
- measures tool duration
- sanitizes obvious secrets before writing anything to disk
from __future__ import annotations
import json
import re
import time
import traceback
from contextlib import contextmanager
from dataclasses import asdict, dataclass, field
from pathlib import Path
from typing import Any, Iterator
from uuid import uuid4
SECRET_KEYS = re.compile(
r"(api[_-]?key|token|password|secret|authorization|cookie)",
re.IGNORECASE,
)
@dataclass
class Event:
run_id: str
event_id: str
type: str
timestamp: float
data: dict[str, Any] = field(default_factory=dict)
def sanitize(value: Any) -> Any:
if isinstance(value, dict):
cleaned = {}
for key, item in value.items():
if SECRET_KEYS.search(str(key)):
cleaned[key] = "[redacted]"
else:
cleaned[key] = sanitize(item)
return cleaned
if isinstance(value, list):
return [sanitize(item) for item in value]
return value
class AgentBlackBox:
def __init__(self, path: str | Path, run_id: str | None = None) -> None:
self.path = Path(path)
self.run_id = run_id or uuid4().hex
self.path.parent.mkdir(parents=True, exist_ok=True)
def record(self, event_type: str, **data: Any) -> None:
event = Event(
run_id=self.run_id,
event_id=uuid4().hex,
type=event_type,
timestamp=time.time(),
data=sanitize(data),
)
with self.path.open("a", encoding="utf-8") as file:
file.write(json.dumps(asdict(event), default=str) + "\n")
@contextmanager
def tool(self, name: str, **tool_input: Any) -> Iterator[None]:
started = time.perf_counter()
self.record("tool_start", tool=name, input=tool_input)
try:
yield
except Exception as exc:
self.record(
"tool_error",
tool=name,
error_type=type(exc).__name__,
error=str(exc),
traceback=traceback.format_exc(limit=6),
duration_ms=round((time.perf_counter() - started) * 1000, 2),
)
raise
else:
self.record(
"tool_end",
tool=name,
ok=True,
duration_ms=round((time.perf_counter() - started) * 1000, 2),
)
The sanitize() function is not perfect. It is a seatbelt, not a vault.
Still, it prevents the most embarrassing version of this pattern: building a helpful debug trace that quietly stores API keys.
Wrap One Tool First
Start with one tool. Do not instrument everything on day one.
import random
import time
def search_docs(query: str, api_key: str) -> list[str]:
time.sleep(random.uniform(0.05, 0.2))
if "timeout" in query:
raise TimeoutError("Document search timed out")
return [
"JSONL works well for append-only traces.",
"Context managers are useful around tool calls.",
"DuckDB can query JSON files without a server.",
]
Now record the call:
box = AgentBlackBox("traces/run.jsonl")
query = "python agent trace format"
with box.tool("search_docs", query=query, api_key="sk-not-a-real-key"):
docs = search_docs(query=query, api_key="sk-not-a-real-key")
box.record("tool_result", tool="search_docs", result_count=len(docs))
Open traces/run.jsonl and the key is redacted.
{"tool":"search_docs","input":{"query":"python agent trace format","api_key":"[redacted]"}}
That tiny detail matters. Debugging should not create a second incident.
Add A Cheap Run Guard
Most runaway agent stories start with a loop that looked harmless.
So the black box should not only record what happened. It should record when it refused to continue.
class RunStopped(RuntimeError):
pass
def stop_if_needed(
box: AgentBlackBox,
*,
turn: int,
max_turns: int,
spent_usd: float,
max_usd: float,
) -> None:
box.record(
"guard_check",
turn=turn,
max_turns=max_turns,
spent_usd=round(spent_usd, 6),
max_usd=round(max_usd, 6),
)
if turn > max_turns:
box.record("guard_stop", reason="max_turns", turn=turn)
raise RunStopped(f"Stopped at turn {turn}. Max turns is {max_turns}.")
if spent_usd > max_usd:
box.record("guard_stop", reason="budget", spent_usd=spent_usd)
raise RunStopped(f"Stopped at ${spent_usd:.4f}. Budget is ${max_usd:.4f}.")
This is not exact billing. Use your provider response for real token counts when you have them.
The goal here is a local tripwire. You want the run to leave a clear reason when it stops.
A Tiny Agent Loop
This fake loop keeps the moving parts small.
Replace the pretend model section with your real model call.
def estimate_cost(input_tokens: int, output_tokens: int) -> float:
return input_tokens * 0.0000005 + output_tokens * 0.0000015
def run_agent(question: str) -> str:
box = AgentBlackBox("traces/run.jsonl")
messages = [{"role": "user", "content": question}]
spent_usd = 0.0
max_turns = 3
max_usd = 0.01
box.record("run_start", question=question, max_turns=max_turns, max_usd=max_usd)
for turn in range(1, max_turns + 1):
stop_if_needed(
box,
turn=turn,
max_turns=max_turns,
spent_usd=spent_usd,
max_usd=max_usd,
)
box.record("turn_start", turn=turn, message_count=len(messages))
# Pretend the model picked this tool input.
query = question if turn == 1 else "python jsonl duckdb traces"
with box.tool("search_docs", query=query, api_key="sk-not-a-real-key"):
docs = search_docs(query=query, api_key="sk-not-a-real-key")
messages.append({"role": "tool", "content": "\n".join(docs)})
turn_cost = estimate_cost(
input_tokens=sum(len(message["content"].split()) for message in messages),
output_tokens=120,
)
spent_usd += turn_cost
box.record(
"turn_end",
turn=turn,
message_count=len(messages),
turn_cost_usd=round(turn_cost, 6),
total_cost_usd=round(spent_usd, 6),
)
answer = "Record every tool call as JSONL, then query failures after the run."
box.record("run_end", answer=answer, total_cost_usd=round(spent_usd, 6))
return answer
Run it once with a normal question.
print(run_agent("How should I debug Python agent tools?"))
Then run it with a bad one.
print(run_agent("timeout during document search"))
The second run should fail, but now it fails with a trail.
To force a budget stop for testing, temporarily set max_usd = 0.0001. The next guard check will write a guard_stop event instead of letting the loop continue quietly.
Query The Crash With DuckDB
This is the part that makes JSONL feel less like logging and more like a debugging tool.
Install DuckDB:
pip install duckdb
Then query the trace:
import duckdb
def 
