Three-Pane Agent Trace Evaluation (eval_trace)
The eval_trace display takes a single agent trace and splits it into three
synchronized side-by-side panes:
Reasoning | Function Calls | Final Answer
This lets an evaluator see, at a glance, what the agent thought, what it did, and what it produced — ideal for continuous evaluation where new traces arrive and must be judged quickly. It is the purpose-built answer to "show an agent's thought traces, function calls, and final answer side-by-side."
Unlike agent_trace (which stacks an interleaved trace
vertically in a single column) or pairwise
(which compares two separate traces), eval_trace decomposes one interleaved
trace into its three semantic components.
Quick start
python potato/flask_server.py start examples/agent-traces/continuous-eval/config.yaml -p 8000
See examples/agent-traces/continuous-eval/ for the full runnable project,
including the directory-watch variant (config-watch.yaml).
Configuration
instance_display:
layout:
direction: vertical # task header above the (internally horizontal) panes
gap: 12px
fields:
- key: task_description
type: text
label: "Task"
- key: trace # the field holding the agent trace
type: eval_trace
label: "Agent Trace"
display_options:
pane_labels: ["Reasoning", "Function Calls", "Final Answer"]
show_step_numbers: true
collapse_long_outputs: true
max_output_lines: 12
link_steps: true
Options
| Option | Default | Description |
|---|---|---|
pane_labels |
["Reasoning", "Function Calls", "Final Answer"] |
Headers for the three panes (list of 3 strings; padded with defaults if fewer). |
show_step_numbers |
true |
Show #N step numbers on reasoning/call cards. |
collapse_long_outputs |
true |
Collapse tool results longer than max_output_lines into an expandable block. |
max_output_lines |
20 |
Line threshold for collapsing results. |
link_steps |
true |
Enable cross-pane highlighting: clicking a card highlights the linked cards in the other panes. |
compact |
false |
Tighter padding/spacing. |
Data format
eval_trace accepts the same trace formats as agent_trace. The most common is
a list of {speaker, text} steps:
{
"id": "eval_001",
"task_description": "Find a vegan lasagna recipe.",
"trace": [
{"speaker": "Agent (Thought)", "text": "I'll search for a highly-rated recipe."},
{"speaker": "Agent (Action)", "text": "web_search(query='vegan lasagna')"},
{"speaker": "Environment", "text": "10 results found..."},
{"speaker": "Agent (Final Answer)", "text": "Here's a great recipe: ..."}
]
}
The thought/action/observation (one dict expands to up to three steps) and
step_type/content formats are also supported.
How steps map to panes
| Step (type inferred from speaker/label) | Pane |
|---|---|
Thought / reasoning / planning / system |
Reasoning |
Action / tool / function / call |
Function Calls (the adjacent Environment/result nests under the call as ↳) |
Final Answer / send_message / respond / finish — or the last action if none match |
Final Answer |
To set an explicit final answer, end the trace with a step whose speaker matches
an answer pattern (e.g. "Agent (Final Answer)"), or a send_message(...) action.
Step linking
Steps are grouped into logical cycles: a thought (or thoughts) plus the calls it
triggers share a data-step-index. With link_steps: true, clicking any card
highlights every card sharing that index across the panes, so you can trace a
thought to the action it produced.
Continuous evaluation
Pair eval_trace with any of Potato's runtime ingestion transports so traces are
evaluated as they arrive:
- Webhook + SSE —
trace_ingestion: {enabled: true}exposesPOST /api/traces/webhookand notifies annotators viaGET /api/traces/stream. - Langfuse polling — add a
langfusesource undertrace_ingestion.sources. - Directory watch —
data_directory+watch_data_directory: trueingests dropped.json/.jsonlfiles.
Runtime-added traces are immediately assignable to annotators (dynamic sources
default the per-user quota to unlimited). See
examples/agent-traces/continuous-eval/README.md for curl examples.
Notes & limitations
eval_traceis display-only — it collects no annotations itself. Pair it with annotation schemes (e.g.reasoning_quality,tool_use_correctness,answer_helpfulness) as in the example.- Span annotation is not supported on
eval_trace(per-pane card IDs do not follow the single.text-contentwrapper contract). Useagent_traceorcodeif you need span highlighting on trace text.
Related
- Agent Traces — vertical step-card display and evaluation patterns
- Coding Agent Annotation — diff/terminal/file-tree trace display
- Instance Display — display-field configuration reference
- LangChain Integration — webhook ingestion