Coding Agent Trace Annotation
Potato supports annotation of agentic coding system traces -- sessions from tools like Claude Code, OpenCode, Cursor, Aider, SWE-Agent, and other AI coding assistants. This guide covers how to set up coding agent evaluation projects.
Overview
Coding agent traces consist of sequences of tool calls (file reads, edits, terminal commands, searches) interleaved with agent reasoning. Potato renders these with purpose-built formatting:
- Code diffs (Edit/Write): Red/green unified diff view
- Terminal blocks (Bash): Dark monospace terminal styling
- Code blocks (Read/Grep): Line-numbered code display
- File tree sidebar: Shows all files touched, grouped by operation
- Collapsible outputs: Long outputs auto-collapse with expand controls
Quick Start
# Run the example from the repository root
python potato/flask_server.py start examples/agent-traces/coding-agent-evaluation/config.yaml -p 8000
Data Format
Structured Turns Format (Recommended)
The structured_turns format preserves full tool call structure for rich rendering:
{
"id": "session_001",
"task_description": "Fix the authentication bypass in login.py",
"model": "claude-sonnet-4-20250514",
"structured_turns": [
{
"role": "user",
"content": "Fix the authentication bypass in login.py",
"tool_calls": []
},
{
"role": "assistant",
"content": "I'll investigate the auth issue.",
"tool_calls": [
{
"tool": "Read",
"input": {"file_path": "src/auth/login.py"},
"output": "def login(user, password):\n if user.role == 'admin':\n return True\n ...",
"output_type": "code",
"language": "python"
},
{
"tool": "Edit",
"input": {
"file_path": "src/auth/login.py",
"old_string": "if user.role == 'admin':\n return True",
"new_string": "if verify_password(password, user.password_hash):"
},
"output": "Edit applied successfully.",
"output_type": "diff"
},
{
"tool": "Bash",
"input": {"command": "pytest tests/test_auth.py -v"},
"output": "4 passed",
"output_type": "terminal"
}
]
}
]
}
Tool Call Fields
Each tool call in tool_calls has:
| Field | Required | Description |
|---|---|---|
tool |
Yes | Tool name (Read, Edit, Bash, Grep, Glob, Write, etc.) |
input |
Yes | Tool input parameters (dict) |
output |
No | Tool output (string) |
output_type |
No | Rendering hint: code, diff, terminal, generic (auto-detected if omitted) |
language |
No | Programming language for syntax hints (auto-detected from file extension) |
Converting From Other Formats
Use the trace converter to convert from Anthropic Messages API, SWE-Agent, or other formats:
# Convert Claude Code / Anthropic Messages API traces
python -m potato.trace_converter -i traces.json -f claude_code -o data/converted.jsonl
# Auto-detect format
python -m potato.trace_converter -i traces.json --auto-detect -o data/converted.jsonl
The claude_code converter handles:
- Anthropic Messages API format (content blocks with tool_use/tool_result)
- Pre-structured structured_turns format
- Generic turns or steps format with tool calls
Configuration
Display Configuration
Use the coding_trace display type in your instance_display config:
instance_display:
layout:
direction: vertical
gap: 16px
fields:
- key: task_description
type: text
label: "Task"
- key: structured_turns
type: coding_trace
label: "Agent Session"
display_options:
show_file_tree: true # Show file tree sidebar
diff_view: unified # Diff rendering style
collapse_long_outputs: true # Auto-collapse long outputs
max_output_lines: 50 # Lines before collapsing
terminal_theme: dark # Terminal block theme
show_step_numbers: true # Show step numbers
show_reasoning: true # Show agent reasoning text
Display Options
| Option | Default | Description |
|---|---|---|
show_file_tree |
true |
Show sidebar with all files touched |
diff_view |
unified |
Diff rendering style |
collapse_long_outputs |
true |
Auto-collapse outputs longer than max_output_lines |
max_output_lines |
50 |
Number of lines before collapsing |
terminal_theme |
dark |
Terminal block color theme |
show_step_numbers |
true |
Show step numbers for assistant turns |
show_tool_badges |
true |
Show tool name badges on tool calls |
show_reasoning |
true |
Show agent reasoning text |
compact |
false |
Use compact layout |
Annotation Schemas
The coding trace display works with all standard Potato annotation schemas. Common combinations:
annotation_schemes:
# Task-level success rating
- annotation_type: radio
name: task_success
description: "Did the agent complete the task?"
labels:
- name: success
- name: partial
- name: failure
# Code quality rating
- annotation_type: likert
name: code_quality
description: "Rate the quality of the code changes"
size: 5
# Issue identification
- annotation_type: multiselect
name: issues
description: "Select any issues observed"
labels:
- name: unnecessary_reads
- name: wrong_tool
- name: incomplete_fix
- name: regression
- name: missing_tests
- name: scope_creep
# Free-form notes
- annotation_type: text
name: notes
description: "Additional observations"
Process Reward Schema (PRM)
For collecting binary per-step correctness signals for training Process Reward Models:
- annotation_type: process_reward
name: step_rewards
description: "Mark the first incorrect step"
steps_key: structured_turns
mode: first_error # or "per_step"
| Option | Default | Description |
|---|---|---|
steps_key |
steps |
Key in instance data containing the steps array |
step_text_key |
action |
Key within each step for display text |
mode |
first_error |
first_error: click first wrong step, rest auto-marked. per_step: annotate each independently |
Code Review Schema
For GitHub PR review-style annotation with inline comments:
- annotation_type: code_review
name: review
description: "Review the agent's code changes"
comment_categories: [bug, style, suggestion, security]
verdict_options: [approve, request_changes, comment_only]
file_rating_dimensions: [correctness, quality]
| Option | Default | Description |
|---|---|---|
comment_categories |
[bug, style, suggestion, security, question] |
Categories for inline comments |
verdict_options |
[approve, request_changes, comment_only] |
Overall review verdict options |
file_rating_dimensions |
[correctness, quality] |
Per-file rating dimensions (1-5 scale) |
Click on diff lines in the coding_trace display to add inline comments with file path and line number auto-filled.
Trace Converters
Convert traces from various coding agent formats:
# Claude Code / Anthropic Messages API
python -m potato.trace_converter -i traces.json -f claude_code -o data/converted.jsonl
# Aider chat history
python -m potato.trace_converter -i chat.md -f aider -o data/converted.jsonl
# SWE-Agent trajectories
python -m potato.trace_converter -i trajectory.json -f swe_agent_trajectory -o data/converted.jsonl
# Auto-detect format
python -m potato.trace_converter -i traces.json --auto-detect -o data/converted.jsonl
Export Formats
Export annotations for ML training pipelines:
python -m potato.export -f coding_eval -o exports/ --types prm,preference,swebench,code_review
| Format | Output | Use Case |
|---|---|---|
prm |
prm_training_data.jsonl |
Process Reward Model training |
preference |
preference_pairs.jsonl |
DPO/RLHF from pairwise annotations |
swebench |
swebench_results.jsonl |
SWE-bench compatible evaluation |
code_review |
code_reviews.jsonl |
Structured review data |
Supported Tool Types
The display renders each tool type with appropriate formatting:
| Tool Names | Rendering | Style |
|---|---|---|
Read, read |
Code block with line numbers | Blue badge |
Edit, edit, Replace |
Unified diff (red/green lines) | Orange badge |
Write, write, Create |
"New file" code block (all green) | Green badge |
Bash, Terminal, Shell |
Dark terminal block with $ prompt |
Dark badge |
Grep, Glob, Search, Find |
Code block (search results) | Purple badge |
| Other tools | JSON-formatted input/output | Grey badge |
Live Coding Agent Mode
Watch a coding agent work in real-time, intervene, rollback, replay with different instructions, and edit agent actions.
Quick Start
# With Ollama (fully local, no API key needed)
python potato/flask_server.py start examples/agent-traces/live-coding-agent/config.yaml -p 8000
Configuration
live_coding_agent:
backend_type: ollama_tool_use # or anthropic_tool_use, claude_sdk
ai_config:
model: qwen2.5-coder:7b # Any Ollama model with tool support
base_url: http://localhost:11434
working_dir: ./workspace
max_turns: 20
sandbox_mode: worktree # worktree (default), docker, direct
Agent Backends
| Backend | Config Key | Requirements |
|---|---|---|
| Ollama (local) | ollama_tool_use |
Ollama running locally, no API key |
| Anthropic API | anthropic_tool_use |
ANTHROPIC_API_KEY env var |
| Claude Agent SDK | claude_sdk |
claude-agent-sdk package installed |
Sandbox Modes
| Mode | Description | Best For |
|---|---|---|
worktree |
Git worktree per session (lightweight copy) | Production use, safe isolation |
docker |
Docker container with mounted workspace | Maximum isolation |
direct |
Agent works directly in working_dir | Development, simple setup |
Controls
During a live session, annotators can: - Pause/Resume: Stop the agent between tool calls - Send Instructions: Guide the agent ("try a different approach") - Stop: End the session and save the trace
Checkpoints and Rollback
After each file-modifying tool call, a git checkpoint is created. Annotators can: - View all checkpoints in a timeline - Rollback to any previous step (restores files and conversation) - See diffs between checkpoints
Branching and Replay
From any checkpoint, create alternative trajectories: - Replay from step: Branch with new instructions - Edit action: Modify a tool call's input and re-execute - Compare branches: View different approaches side by side - All branches are saved in the trace export
API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/api/live_coding_agent/start |
POST | Start a session |
/api/live_coding_agent/stream/<id> |
GET | SSE event stream |
/api/live_coding_agent/pause/<id> |
POST | Pause agent |
/api/live_coding_agent/resume/<id> |
POST | Resume agent |
/api/live_coding_agent/instruct/<id> |
POST | Send instruction |
/api/live_coding_agent/stop/<id> |
POST | Stop and save |
/api/live_coding_agent/checkpoints/<id> |
GET | List checkpoints |
/api/live_coding_agent/rollback/<id> |
POST | Rollback to step |
/api/live_coding_agent/replay/<id> |
POST | Create branch and replay |
/api/live_coding_agent/branches/<id> |
GET | List branches |
/api/live_coding_agent/switch_branch/<id> |
POST | Switch branch |
Examples
See examples/agent-traces/ for complete example projects:
live-coding-agent/-- Live coding agent with real-time streaming and controlscoding-agent-evaluation/-- Static coding agent trace evaluationcoding-agent-prm/-- Fast PRM data collection with first_error modecoding-agent-review/-- GitHub PR-style code review with inline commentscoding-agent-comparison/-- Multi-dimensional agent quality comparisonswebench-evaluation/-- SWE-bench coding agent evaluation
Related Documentation
- Schemas and Templates -- All annotation schema types
- Configuration Reference -- Complete configuration options