Annotating Agent Traces with Potato
This guide covers how to use Potato for evaluating AI agent traces (trajectories). Potato provides all the infrastructure needed for rigorous agent evaluation: multi-annotator coordination, rich annotation schemas, inter-annotator agreement, quality control, and crowdsourcing support.
Overview
An agent trace (or trajectory) is a record of an AI agent's step-by-step actions while completing a task. Each step typically includes:
- Thought/Reasoning: The agent's internal planning
- Action: A tool call or function invocation
- Observation: The result from the environment
Potato maps these traces onto its existing annotation infrastructure:
| Evaluation Level | What's Annotated | Potato Mechanism |
|---|---|---|
| Trajectory | Overall success, efficiency, safety | radio, likert, multiselect, slider schemas |
| Step | Action correctness, tool choice quality | per_turn_ratings on dialogue display, multirate schema |
| Span | Hallucinations, factual errors in text | span schema with dialogue as span_target |
| Comparison | Which agent is better | Pairwise display + pairwise schema |
Data Format
All agent traces should be formatted as JSONL with each line containing one trace:
{
"id": "trace_001",
"task_description": "Book a flight from NYC to London",
"agent_name": "GPT-4-ReAct",
"metadata_table": [
{"Property": "Steps", "Value": "7"},
{"Property": "Tokens", "Value": "2340"},
{"Property": "Cost", "Value": "$0.12"}
],
"conversation": [
{"speaker": "Agent (Thought)", "text": "I need to search for flights..."},
{"speaker": "Agent (Action)", "text": "search_flights(origin='JFK', destination='LHR')"},
{"speaker": "Environment", "text": "Found 3 flights: BA117 ($450)..."}
]
}
Key Fields
id: Unique trace identifiertask_description: What the agent was asked to doconversation: List of{speaker, text}dicts representing the trace stepsmetadata_table: Optional list of{Property, Value}dicts for the spreadsheet displayagent_name: Optional agent identifier
Speaker Naming Conventions
Use consistent speaker names to enable per-turn ratings:
Agent (Thought)- Agent reasoning/planningAgent (Action)- Tool calls and actionsEnvironment- Tool/API resultsUser- User messages (for interactive agents)System- System messages
Display Configuration
Basic: Dialogue Display
The simplest approach uses the dialogue display type, which works with zero code changes:
instance_display:
layout:
direction: vertical
gap: 16px
fields:
- key: task_description
type: text
label: "Task"
- key: metadata_table
type: spreadsheet
label: "Trace Info"
display_options:
compact: true
max_height: 120
- key: conversation
type: dialogue
label: "Agent Trace"
span_target: true
display_options:
show_turn_numbers: true
alternating_shading: true
per_turn_ratings:
speakers: ["Agent (Action)"]
schema_name: "action_correctness"
scheme:
type: likert
size: 3
labels: ["Wrong", "Right"]
Rich: Agent Trace Display
The agent_trace display type provides purpose-built rendering with step cards:
instance_display:
fields:
- key: conversation
type: agent_trace
label: "Agent Trace"
span_target: true
display_options:
show_step_numbers: true
show_summary: true
collapse_observations: false
step_type_colors:
thought: "#e8f4fd"
action: "#fff3e0"
observation: "#e8f5e9"
The agent trace display automatically: - Color-codes steps by type (thought/action/observation) - Shows a summary header with step counts - Supports collapsible observations for long traces - Renders inline screenshot thumbnails if present
Visual Agents: Image + Dialogue
For GUI/web agents with screenshots:
instance_display:
layout:
direction: vertical
gap: 16px
fields:
- key: screenshot_url
type: image
label: "Screenshot"
display_options:
max_height: 400
- key: conversation
type: dialogue
label: "Agent Actions"
display_options:
show_turn_numbers: true
Screenshot Sequences: Gallery Display
For step-by-step screenshot sequences:
instance_display:
fields:
- key: screenshots
type: gallery
label: "Screenshots"
display_options:
layout: horizontal
thumbnail_size: 300
show_captions: true
zoomable: true
Annotation Schemas
Trajectory-Level Assessment
annotation_schemes:
# Task success (3-way classification)
- annotation_type: radio
name: task_success
description: "Did the agent complete the task?"
labels:
- name: success
tooltip: "Fully completed correctly"
- name: partial
tooltip: "Partially completed or with errors"
- name: failure
tooltip: "Not completed"
sequential_key_binding: true
# Efficiency (5-point Likert)
- annotation_type: likert
name: efficiency
description: "How efficient was the agent?"
min_label: "Very inefficient"
max_label: "Optimal"
size: 5
# Safety
- annotation_type: radio
name: safety
labels:
- name: safe
- name: minor_concern
- name: major_violation
Error Taxonomy (MAST Framework)
The MAST taxonomy (NeurIPS 2025, kappa=0.88) defines 14 failure modes:
- annotation_type: multiselect
name: mast_errors
description: "Select all failure modes observed"
labels:
# System Design
- name: disobey_task_spec
tooltip: "Does not follow task specification"
- name: step_repetition
tooltip: "Repeats same action unnecessarily"
- name: loss_of_history
tooltip: "Forgets prior context"
- name: unaware_of_termination
tooltip: "Does not know when to stop"
# Inter-Agent
- name: fail_to_clarify
tooltip: "Fails to ask for needed clarification"
- name: task_derailment
tooltip: "Goes off track from main task"
- name: reasoning_action_mismatch
tooltip: "Reasoning contradicts action taken"
# Task Verification
- name: premature_termination
tooltip: "Stops before task is complete"
- name: incomplete_verification
tooltip: "Does not fully verify results"
- name: no_errors
tooltip: "No errors observed"
Span-Level Hallucination Marking
- annotation_type: span
name: hallucination_spans
labels:
- name: hallucination
tooltip: "Claim not grounded in evidence"
- name: incorrect_fact
tooltip: "Factually incorrect statement"
Per-Step Ratings (via per_turn_ratings)
Per-turn ratings add inline Likert widgets to specific speakers in the dialogue.
Single dimension (one rating per step):
# In the dialogue display_options:
per_turn_ratings:
speakers: ["Agent (Action)"]
schema_name: "action_correctness"
scheme:
type: likert
size: 3
labels: ["Wrong", "Right"]
Multi-dimension (multiple ratings per step):
# In the dialogue display_options:
per_turn_ratings:
speakers: ["Agent (Action)"]
schemes:
- schema_name: action_correctness
scheme:
type: likert
size: 3
labels: ["Wrong", "Right"]
- schema_name: reasoning_quality
scheme:
type: likert
size: 5
labels: ["Poor", "Excellent"]
Each scheme gets its own hidden input and annotation column. This is useful for evaluating multiple dimensions of step quality simultaneously (e.g., correctness and reasoning quality).
Dynamic Multirate (options_from_data)
For traces with varying numbers of steps, use options_from_data to generate multirate rows dynamically from instance data:
- annotation_type: multirate
name: step_ratings
description: "Rate each step"
options_from_data: step_summaries # reads from instance["step_summaries"]
labels: ["Incorrect", "Questionable", "Correct"]
Your data should include the step_summaries field:
{
"id": "trace_001",
"step_summaries": ["Search for flights", "Compare prices", "Book cheapest"],
...
}
Trace Converter CLI
Convert traces from common agent frameworks to Potato's format:
# Convert ReAct JSON traces
python -m potato.trace_converter --input traces.json --input-format react --output data.jsonl
# Convert LangChain/LangSmith exports
python -m potato.trace_converter --input langsmith_export.json --input-format langchain --output data.jsonl
# Convert Langfuse exports
python -m potato.trace_converter --input langfuse_traces.json --input-format langfuse --output data.jsonl
# Convert WebArena benchmark traces
python -m potato.trace_converter --input webarena_results.json --input-format webarena --output data.jsonl
# Convert ATIF (academic standard) traces
python -m potato.trace_converter --input atif_traces.json --input-format atif --output data.jsonl
# Auto-detect format
python -m potato.trace_converter --input traces.json --auto-detect --output data.jsonl
# List supported formats
python -m potato.trace_converter --list-formats
Supported Input Formats
| Format | Source | Key Fields |
|---|---|---|
react |
Generic ReAct JSON | steps[].thought/action/observation |
langchain |
LangSmith export | child_runs[].run_type/inputs/outputs |
langfuse |
Langfuse export | observations[].type/input/output |
atif |
Academic standard | steps[].thought/action/observation + structured task/agent |
webarena |
GUI benchmarks | actions[].action_type/element + screenshots[] |
openai |
OpenAI Chat/Assistants API | messages[].role/content/tool_calls |
anthropic |
Anthropic Claude Messages API | messages[].content[] with typed blocks (text, tool_use, tool_result) |
swebench |
SWE-bench benchmark | instance_id + problem_statement + patch/model_patch |
otel |
OpenTelemetry/OTLP | resourceSpans (OTLP) or trace_id/span_id (flat) |
multi_agent |
CrewAI, AutoGen, LangGraph | Auto-detected: agents+steps[].agent, messages[].sender, events[].node |
mcp |
Model Context Protocol | interactions[].method (tools/call, resources/read, prompts/get) |
aider |
Aider code editing | Chat history with edit blocks and file changes |
claude_code |
Claude Code CLI | Session traces with tool calls and file edits |
web_agent |
Generic web agent | actions[].action_type with DOM selectors and screenshots |
swe_agent_trajectory |
SWE-Agent trajectory | Full trajectory with history[].action/observation pairs |
New Format Examples
OpenAI Chat Completions
python -m potato.trace_converter --input openai_logs.json --input-format openai --output data.jsonl
Input format — messages with role/content strings, optional tool_calls:
{
"id": "chatcmpl-abc123",
"model": "gpt-4",
"messages": [
{"role": "user", "content": "What is the weather?"},
{"role": "assistant", "content": null, "tool_calls": [
{"id": "call_1", "type": "function",
"function": {"name": "get_weather", "arguments": "{\"location\":\"NYC\"}"}}
]},
{"role": "tool", "tool_call_id": "call_1", "content": "Sunny, 72F"},
{"role": "assistant", "content": "It's sunny and 72F in NYC."}
]
}
Also supports the OpenAI Assistants API format with assistant_id and steps arrays.
Anthropic Claude Messages
python -m potato.trace_converter --input claude_logs.json --input-format anthropic --output data.jsonl
Input format — content is a list of typed blocks:
{
"id": "msg_abc123",
"model": "claude-sonnet-4-20250514",
"messages": [
{"role": "user", "content": "Analyze this data."},
{"role": "assistant", "content": [
{"type": "text", "text": "Let me analyze this."},
{"type": "tool_use", "id": "toolu_1", "name": "python", "input": {"code": "df.describe()"}}
]},
{"role": "user", "content": [
{"type": "tool_result", "tool_use_id": "toolu_1", "content": "count 100\nmean 42.5"}
]}
]
}
Handles thinking blocks, tool_result with is_error, and request/response pair format.
SWE-bench
python -m potato.trace_converter --input swebench_results.json --input-format swebench --output data.jsonl
Input format — coding benchmark instances:
{
"instance_id": "django__django-16527",
"problem_statement": "Bug description...",
"repo": "django/django",
"model_patch": "diff --git a/...",
"FAIL_TO_PASS": "[\"test_case_1\"]",
"test_result": "resolved"
}
OpenTelemetry/OTLP
python -m potato.trace_converter --input otel_export.json --input-format otel --output data.jsonl
Supports both OTLP nested format (resourceSpans) and flattened per-span format (trace_id/span_id). Uses GenAI Semantic Conventions (gen_ai.prompt, gen_ai.completion, tool.name).
Multi-Agent (CrewAI / AutoGen / LangGraph)
python -m potato.trace_converter --input multi_agent_log.json --input-format multi_agent --output data.jsonl
Auto-detects the sub-format:
- CrewAI: agents list + steps with agent field
- AutoGen: messages with sender/receiver
- LangGraph: events with node field
MCP Interaction Logs
python -m potato.trace_converter --input mcp_session.json --input-format mcp --output data.jsonl
Input format — JSON-RPC 2.0 interactions:
{
"id": "session_001",
"server": "my-mcp-server",
"interactions": [
{"method": "tools/call", "params": {"name": "search", "arguments": {"q": "test"}},
"result": {"content": [{"type": "text", "text": "Found 5 results"}]}}
]
}
Export Format
Export agent evaluation annotations for analysis:
python -m potato.export --format agent_eval --config config.yaml --output eval_results/
This produces:
- agent_evaluation.json: Full structured output with per-trace and aggregate statistics
- agent_evaluation_summary.csv: One-row-per-trace summary for spreadsheet analysis
Quality Control
All of Potato's existing QC features work with agent traces:
Gold Standards
Include traces with known-correct labels:
{"id": "trace_gold_001", "task_description": "...", "conversation": [...],
"gold_labels": {"task_success": "success", "efficiency": 5}}
Training Phase
Configure a training phase to calibrate annotators:
training:
enabled: true
num_items: 5
feedback_mode: immediate
passing_score: 0.8
Inter-Annotator Agreement
Potato automatically computes IAA metrics: - Cohen's kappa for trajectory-level categorical labels - Krippendorff's alpha for Likert scales - Available in the admin dashboard
Example Projects
Potato includes eight ready-to-use example projects:
1. Agent Trace Evaluation
python potato/flask_server.py start examples/agent-traces/agent-trace-evaluation/config.yaml -p 8000
Text agent traces with full annotation schema (success, efficiency, MAST errors, hallucination spans).
2. Visual Agent Evaluation
python potato/flask_server.py start examples/agent-traces/visual-agent-evaluation/config.yaml -p 8000
GUI agent traces with screenshots and grounding accuracy assessment.
3. Agent Comparison
python potato/flask_server.py start examples/agent-traces/agent-comparison/config.yaml -p 8000
Side-by-side comparison of two agent traces on the same task.
4. RAG Evaluation
python potato/flask_server.py start examples/agent-traces/rag-evaluation/config.yaml -p 8000
RAG pipeline evaluation with retrieval relevance and citation accuracy.
5. OpenAI Trace Evaluation
python potato/flask_server.py start examples/agent-traces/openai-evaluation/config.yaml -p 8000
OpenAI Chat Completions traces with tool calls, evaluated for task success and response quality.
6. Anthropic Claude Trace Evaluation
python potato/flask_server.py start examples/agent-traces/anthropic-evaluation/config.yaml -p 8000
Anthropic Messages API traces with tool_use/tool_result blocks and thinking blocks.
7. SWE-bench Patch Evaluation
python potato/flask_server.py start examples/agent-traces/swebench-evaluation/config.yaml -p 8000
SWE-bench coding agent patches with correctness assessment and code quality ratings.
8. Multi-Agent Evaluation
python potato/flask_server.py start examples/agent-traces/multi-agent-evaluation/config.yaml -p 8000
Multi-agent traces from CrewAI, AutoGen, and LangGraph with coordination quality assessment.
Workflow-Specific Configurations
Customer Service Agent
- Display:
dialoguewithper_turn_ratings - Key schemas: Issue resolution (radio), response quality (likert), policy compliance (span)
Coding Agent (SWE-bench style)
- Display:
dialoguefor reasoning +codefor generated patch - Key schemas: Patch correctness (radio), code quality (likert), root cause vs symptom (radio)
Research/RAG Agent
- Display:
dialoguefor trace +spanfor citation verification - Key schemas: Retrieval relevance (multirate), faithfulness (likert), citation accuracy (span)
GUI/Computer-Use Agent
- Display:
imageorgalleryfor screenshots +dialoguefor actions - Key schemas: Grounding accuracy (radio), navigation efficiency (likert), GUI errors (multiselect)
Related Documentation
- Schemas and Templates - All annotation type reference
- Quality Control - Gold standards, training, IAA
- Configuration - Complete configuration reference
- Admin Dashboard - Monitoring and management