I Added a 71-Line Black Box to My Python Agent, Then Queried the $200 Crash With DuckDB

The incident started with a boring support automation task.

Take a user request, search a private document index, summarize the answer, and hand the result to a reviewer. Nothing heroic. The kind of Python agent you build when the demo is over and the real workflow begins.

Then one run got stuck in a retry loop.

It did not burn $200 before I caught it. The actual test run was cheaper. The problem was the projection: same bad loop, same document search, same model calls, left inside the overnight batch. The estimate landed close to $200 for one avoidable failure.

The answer it produced looked polished enough to pass a sleepy review. The trace behind it was not polished at all. The agent had called the right tool with the wrong input, retried against stale context, summarized old results, and kept paying for each turn.

That is when I stopped treating the agent like a chat feature.

I started treating it like a system that needs a black box.

Not a dashboard. Not a full observability stack. Not another hosted service.

Just one local file that can answer:

  • What did the agent try?
  • Which tool did it call?
  • What input did the tool receive?
  • Did the tool fail?
  • How long did it take?
  • Did the run cross a cost or turn limit?
  • Can I query the run after everything is over?

We will build that black box in plain Python, then use DuckDB to inspect it like a tiny crash database.

Before And After

Before the fix, debugging looked like this:

The final answer is wrong.
The model probably hallucinated.
Maybe the search tool returned bad data.
Maybe the retry loop reused an old message.
Maybe the cost spike came from the model call.
Enter fullscreen mode Exit fullscreen mode

That is not debugging. That is guessing with syntax highlighting.

After the fix, debugging looked like this:

Turn 1 called search_docs with the wrong query.
The tool timed out after 147.82 ms.
The retry used stale context.
The guard stopped the run at $0.0124.
DuckDB shows one tool_error and one guard_stop.
Enter fullscreen mode Exit fullscreen mode

Same bug. Very different day.

The Shape Of The Problem

A normal Python script usually fails in one place.

An agent fails across a chain.

User Request -> Model Decision -> Tool Call -> Tool Result -> Next Turn -> Final Answer
Enter fullscreen mode Exit fullscreen mode

Agent run flow diagram

If you only log the final answer, you have a diary entry.

If you record the chain, you have evidence.

The simplest useful format is JSONL. One event per line.

{"type":"tool_start","tool":"search_docs","input":{"query":"rate limits"}}
{"type":"tool_end","tool":"search_docs","duration_ms":83.4,"ok":true}
{"type":"turn_end","turn":2,"total_cost_usd":0.0041}
Enter fullscreen mode Exit fullscreen mode

JSONL is boring in exactly the right way. It appends cleanly, survives crashes better than one large JSON document, and can be searched with normal tools.

JSONL trace from a failed run

A Small Recorder That Does Real Work

Here is the recorder.

It does four things:

  • gives every run a unique id
  • writes append-only JSONL events
  • measures tool duration
  • sanitizes obvious secrets before writing anything to disk
from __future__ import annotations

import json
import re
import time
import traceback
from contextlib import contextmanager
from dataclasses import asdict, dataclass, field
from pathlib import Path
from typing import Any, Iterator
from uuid import uuid4


SECRET_KEYS = re.compile(
    r"(api[_-]?key|token|password|secret|authorization|cookie)",
    re.IGNORECASE,
)


@dataclass
class Event:
    run_id: str
    event_id: str
    type: str
    timestamp: float
    data: dict[str, Any] = field(default_factory=dict)


def sanitize(value: Any) -> Any:
    if isinstance(value, dict):
        cleaned = {}
        for key, item in value.items():
            if SECRET_KEYS.search(str(key)):
                cleaned[key] = "[redacted]"
            else:
                cleaned[key] = sanitize(item)
        return cleaned

    if isinstance(value, list):
        return [sanitize(item) for item in value]

    return value


class AgentBlackBox:
    def __init__(self, path: str | Path, run_id: str | None = None) -> None:
        self.path = Path(path)
        self.run_id = run_id or uuid4().hex
        self.path.parent.mkdir(parents=True, exist_ok=True)

    def record(self, event_type: str, **data: Any) -> None:
        event = Event(
            run_id=self.run_id,
            event_id=uuid4().hex,
            type=event_type,
            timestamp=time.time(),
            data=sanitize(data),
        )

        with self.path.open("a", encoding="utf-8") as file:
            file.write(json.dumps(asdict(event), default=str) + "\n")

    @contextmanager
    def tool(self, name: str, **tool_input: Any) -> Iterator[None]:
        started = time.perf_counter()
        self.record("tool_start", tool=name, input=tool_input)

        try:
            yield
        except Exception as exc:
            self.record(
                "tool_error",
                tool=name,
                error_type=type(exc).__name__,
                error=str(exc),
                traceback=traceback.format_exc(limit=6),
                duration_ms=round((time.perf_counter() - started) * 1000, 2),
            )
            raise
        else:
            self.record(
                "tool_end",
                tool=name,
                ok=True,
                duration_ms=round((time.perf_counter() - started) * 1000, 2),
            )
Enter fullscreen mode Exit fullscreen mode

The sanitize() function is not perfect. It is a seatbelt, not a vault.

Still, it prevents the most embarrassing version of this pattern: building a helpful debug trace that quietly stores API keys.

Wrap One Tool First

Start with one tool. Do not instrument everything on day one.

import random
import time


def search_docs(query: str, api_key: str) -> list[str]:
    time.sleep(random.uniform(0.05, 0.2))

    if "timeout" in query:
        raise TimeoutError("Document search timed out")

    return [
        "JSONL works well for append-only traces.",
        "Context managers are useful around tool calls.",
        "DuckDB can query JSON files without a server.",
    ]
Enter fullscreen mode Exit fullscreen mode

Now record the call:

box = AgentBlackBox("traces/run.jsonl")

query = "python agent trace format"

with box.tool("search_docs", query=query, api_key="sk-not-a-real-key"):
    docs = search_docs(query=query, api_key="sk-not-a-real-key")

box.record("tool_result", tool="search_docs", result_count=len(docs))
Enter fullscreen mode Exit fullscreen mode

Open traces/run.jsonl and the key is redacted.

{"tool":"search_docs","input":{"query":"python agent trace format","api_key":"[redacted]"}}
Enter fullscreen mode Exit fullscreen mode

That tiny detail matters. Debugging should not create a second incident.

Add A Cheap Run Guard

Most runaway agent stories start with a loop that looked harmless.

So the black box should not only record what happened. It should record when it refused to continue.

class RunStopped(RuntimeError):
    pass


def stop_if_needed(
    box: AgentBlackBox,
    *,
    turn: int,
    max_turns: int,
    spent_usd: float,
    max_usd: float,
) -> None:
    box.record(
        "guard_check",
        turn=turn,
        max_turns=max_turns,
        spent_usd=round(spent_usd, 6),
        max_usd=round(max_usd, 6),
    )

    if turn > max_turns:
        box.record("guard_stop", reason="max_turns", turn=turn)
        raise RunStopped(f"Stopped at turn {turn}. Max turns is {max_turns}.")

    if spent_usd > max_usd:
        box.record("guard_stop", reason="budget", spent_usd=spent_usd)
        raise RunStopped(f"Stopped at ${spent_usd:.4f}. Budget is ${max_usd:.4f}.")
Enter fullscreen mode Exit fullscreen mode

This is not exact billing. Use your provider response for real token counts when you have them.

The goal here is a local tripwire. You want the run to leave a clear reason when it stops.

A Tiny Agent Loop

This fake loop keeps the moving parts small.

Replace the pretend model section with your real model call.

def estimate_cost(input_tokens: int, output_tokens: int) -> float:
    return input_tokens * 0.0000005 + output_tokens * 0.0000015


def run_agent(question: str) -> str:
    box = AgentBlackBox("traces/run.jsonl")
    messages = [{"role": "user", "content": question}]
    spent_usd = 0.0
    max_turns = 3
    max_usd = 0.01

    box.record("run_start", question=question, max_turns=max_turns, max_usd=max_usd)

    for turn in range(1, max_turns + 1):
        stop_if_needed(
            box,
            turn=turn,
            max_turns=max_turns,
            spent_usd=spent_usd,
            max_usd=max_usd,
        )

        box.record("turn_start", turn=turn, message_count=len(messages))

        # Pretend the model picked this tool input.
        query = question if turn == 1 else "python jsonl duckdb traces"

        with box.tool("search_docs", query=query, api_key="sk-not-a-real-key"):
            docs = search_docs(query=query, api_key="sk-not-a-real-key")

        messages.append({"role": "tool", "content": "\n".join(docs)})

        turn_cost = estimate_cost(
            input_tokens=sum(len(message["content"].split()) for message in messages),
            output_tokens=120,
        )
        spent_usd += turn_cost

        box.record(
            "turn_end",
            turn=turn,
            message_count=len(messages),
            turn_cost_usd=round(turn_cost, 6),
            total_cost_usd=round(spent_usd, 6),
        )

    answer = "Record every tool call as JSONL, then query failures after the run."
    box.record("run_end", answer=answer, total_cost_usd=round(spent_usd, 6))
    return answer
Enter fullscreen mode Exit fullscreen mode

Run it once with a normal question.

print(run_agent("How should I debug Python agent tools?"))
Enter fullscreen mode Exit fullscreen mode

Then run it with a bad one.

print(run_agent("timeout during document search"))
Enter fullscreen mode Exit fullscreen mode

The second run should fail, but now it fails with a trail.

To force a budget stop for testing, temporarily set max_usd = 0.0001. The next guard check will write a guard_stop event instead of letting the loop continue quietly.

Query The Crash With DuckDB

This is the part that makes JSONL feel less like logging and more like a debugging tool.

Install DuckDB:

pip install duckdb
Enter fullscreen mode Exit fullscreen mode

Then query the trace:

import duckdb


def 

Visit Website