Coding Agent Trace Annotation

Potato supports annotation of agentic coding system traces -- sessions from tools like Claude Code, OpenCode, Cursor, Aider, SWE-Agent, and other AI coding assistants. This guide covers how to set up coding agent evaluation projects.

Overview

Coding agent traces consist of sequences of tool calls (file reads, edits, terminal commands, searches) interleaved with agent reasoning. Potato renders these with purpose-built formatting:

Code diffs (Edit/Write): Red/green unified diff view
Terminal blocks (Bash): Dark monospace terminal styling
Code blocks (Read/Grep): Line-numbered code display
File tree sidebar: Shows all files touched, grouped by operation
Collapsible outputs: Long outputs auto-collapse with expand controls

Quick Start

# Run the example from the repository root
python potato/flask_server.py start examples/agent-traces/coding-agent-evaluation/config.yaml -p 8000

Data Format

Structured Turns Format (Recommended)

The structured_turns format preserves full tool call structure for rich rendering:

{
  "id": "session_001",
  "task_description": "Fix the authentication bypass in login.py",
  "model": "claude-sonnet-4-20250514",
  "structured_turns": [
    {
      "role": "user",
      "content": "Fix the authentication bypass in login.py",
      "tool_calls": []
    },
    {
      "role": "assistant",
      "content": "I'll investigate the auth issue.",
      "tool_calls": [
        {
          "tool": "Read",
          "input": {"file_path": "src/auth/login.py"},
          "output": "def login(user, password):\n    if user.role == 'admin':\n        return True\n    ...",
          "output_type": "code",
          "language": "python"
        },
        {
          "tool": "Edit",
          "input": {
            "file_path": "src/auth/login.py",
            "old_string": "if user.role == 'admin':\n        return True",
            "new_string": "if verify_password(password, user.password_hash):"
          },
          "output": "Edit applied successfully.",
          "output_type": "diff"
        },
        {
          "tool": "Bash",
          "input": {"command": "pytest tests/test_auth.py -v"},
          "output": "4 passed",
          "output_type": "terminal"
        }
      ]
    }
  ]
}

Tool Call Fields

Each tool call in tool_calls has:

Field	Required	Description
`tool`	Yes	Tool name (Read, Edit, Bash, Grep, Glob, Write, etc.)
`input`	Yes	Tool input parameters (dict)
`output`	No	Tool output (string)
`output_type`	No	Rendering hint: `code`, `diff`, `terminal`, `generic` (auto-detected if omitted)
`language`	No	Programming language for syntax hints (auto-detected from file extension)

Converting From Other Formats

Use the trace converter to convert from Anthropic Messages API, SWE-Agent, or other formats:

# Convert Claude Code / Anthropic Messages API traces
python -m potato.trace_converter -i traces.json -f claude_code -o data/converted.jsonl

# Auto-detect format
python -m potato.trace_converter -i traces.json --auto-detect -o data/converted.jsonl

The claude_code converter handles: - Anthropic Messages API format (content blocks with tool_use/tool_result) - Pre-structured structured_turns format - Generic turns or steps format with tool calls

Configuration

Display Configuration

Use the coding_trace display type in your instance_display config:

instance_display:
  layout:
    direction: vertical
    gap: 16px
  fields:
    - key: task_description
      type: text
      label: "Task"
    - key: structured_turns
      type: coding_trace
      label: "Agent Session"
      display_options:
        show_file_tree: true        # Show file tree sidebar
        diff_view: unified          # Diff rendering style
        collapse_long_outputs: true # Auto-collapse long outputs
        max_output_lines: 50       # Lines before collapsing
        terminal_theme: dark       # Terminal block theme
        show_step_numbers: true    # Show step numbers
        show_reasoning: true       # Show agent reasoning text

Display Options

Option	Default	Description
`show_file_tree`	`true`	Show sidebar with all files touched
`diff_view`	`unified`	Diff rendering style
`collapse_long_outputs`	`true`	Auto-collapse outputs longer than `max_output_lines`
`max_output_lines`	`50`	Number of lines before collapsing
`terminal_theme`	`dark`	Terminal block color theme
`show_step_numbers`	`true`	Show step numbers for assistant turns
`show_tool_badges`	`true`	Show tool name badges on tool calls
`show_reasoning`	`true`	Show agent reasoning text
`compact`	`false`	Use compact layout

Annotation Schemas

The coding trace display works with all standard Potato annotation schemas. Common combinations:

annotation_schemes:
  # Task-level success rating
  - annotation_type: radio
    name: task_success
    description: "Did the agent complete the task?"
    labels:
      - name: success
      - name: partial
      - name: failure

  # Code quality rating
  - annotation_type: likert
    name: code_quality
    description: "Rate the quality of the code changes"
    size: 5

  # Issue identification
  - annotation_type: multiselect
    name: issues
    description: "Select any issues observed"
    labels:
      - name: unnecessary_reads
      - name: wrong_tool
      - name: incomplete_fix
      - name: regression
      - name: missing_tests
      - name: scope_creep

  # Free-form notes
  - annotation_type: text
    name: notes
    description: "Additional observations"

Process Reward Schema (PRM)

For collecting binary per-step correctness signals for training Process Reward Models:

  - annotation_type: process_reward
    name: step_rewards
    description: "Mark the first incorrect step"
    steps_key: structured_turns
    mode: first_error  # or "per_step"

Option	Default	Description
`steps_key`	`steps`	Key in instance data containing the steps array
`step_text_key`	`action`	Key within each step for display text
`mode`	`first_error`	`first_error`: click first wrong step, rest auto-marked. `per_step`: annotate each independently

Code Review Schema

For GitHub PR review-style annotation with inline comments:

  - annotation_type: code_review
    name: review
    description: "Review the agent's code changes"
    comment_categories: [bug, style, suggestion, security]
    verdict_options: [approve, request_changes, comment_only]
    file_rating_dimensions: [correctness, quality]

Option	Default	Description
`comment_categories`	`[bug, style, suggestion, security, question]`	Categories for inline comments
`verdict_options`	`[approve, request_changes, comment_only]`	Overall review verdict options
`file_rating_dimensions`	`[correctness, quality]`	Per-file rating dimensions (1-5 scale)

Click on diff lines in the coding_trace display to add inline comments with file path and line number auto-filled.

Trace Converters

Convert traces from various coding agent formats:

# Claude Code / Anthropic Messages API
python -m potato.trace_converter -i traces.json -f claude_code -o data/converted.jsonl

# Aider chat history
python -m potato.trace_converter -i chat.md -f aider -o data/converted.jsonl

# SWE-Agent trajectories
python -m potato.trace_converter -i trajectory.json -f swe_agent_trajectory -o data/converted.jsonl

# Auto-detect format
python -m potato.trace_converter -i traces.json --auto-detect -o data/converted.jsonl

Export Formats

Export annotations for ML training pipelines:

python -m potato.export -f coding_eval -o exports/ --types prm,preference,swebench,code_review

Format	Output	Use Case
`prm`	`prm_training_data.jsonl`	Process Reward Model training
`preference`	`preference_pairs.jsonl`	DPO/RLHF from pairwise annotations
`swebench`	`swebench_results.jsonl`	SWE-bench compatible evaluation
`code_review`	`code_reviews.jsonl`	Structured review data

Supported Tool Types

The display renders each tool type with appropriate formatting:

Tool Names	Rendering	Style
`Read`, `read`	Code block with line numbers	Blue badge
`Edit`, `edit`, `Replace`	Unified diff (red/green lines)	Orange badge
`Write`, `write`, `Create`	"New file" code block (all green)	Green badge
`Bash`, `Terminal`, `Shell`	Dark terminal block with `$` prompt	Dark badge
`Grep`, `Glob`, `Search`, `Find`	Code block (search results)	Purple badge
Other tools	JSON-formatted input/output	Grey badge

Live Coding Agent Mode

Watch a coding agent work in real-time, intervene, rollback, replay with different instructions, and edit agent actions.

Quick Start

# With Ollama (fully local, no API key needed)
python potato/flask_server.py start examples/agent-traces/live-coding-agent/config.yaml -p 8000

Configuration

live_coding_agent:
  backend_type: ollama_tool_use   # or anthropic_tool_use, claude_sdk
  ai_config:
    model: qwen2.5-coder:7b       # Any Ollama model with tool support
    base_url: http://localhost:11434
  working_dir: ./workspace
  max_turns: 20
  sandbox_mode: worktree          # worktree (default), docker, direct

Agent Backends

Backend	Config Key	Requirements
Ollama (local)	`ollama_tool_use`	Ollama running locally, no API key
Anthropic API	`anthropic_tool_use`	`ANTHROPIC_API_KEY` env var
Claude Agent SDK	`claude_sdk`	`claude-agent-sdk` package installed

Sandbox Modes

Mode	Description	Best For
`worktree`	Git worktree per session (lightweight copy)	Production use, safe isolation
`docker`	Docker container with mounted workspace	Maximum isolation
`direct`	Agent works directly in working_dir	Development, simple setup

Controls

During a live session, annotators can: - Pause/Resume: Stop the agent between tool calls - Send Instructions: Guide the agent ("try a different approach") - Stop: End the session and save the trace

Checkpoints and Rollback

After each file-modifying tool call, a git checkpoint is created. Annotators can: - View all checkpoints in a timeline - Rollback to any previous step (restores files and conversation) - See diffs between checkpoints

Branching and Replay

From any checkpoint, create alternative trajectories: - Replay from step: Branch with new instructions - Edit action: Modify a tool call's input and re-execute - Compare branches: View different approaches side by side - All branches are saved in the trace export

API Endpoints

Endpoint	Method	Description
`/api/live_coding_agent/start`	POST	Start a session
`/api/live_coding_agent/stream/<id>`	GET	SSE event stream
`/api/live_coding_agent/pause/<id>`	POST	Pause agent
`/api/live_coding_agent/resume/<id>`	POST	Resume agent
`/api/live_coding_agent/instruct/<id>`	POST	Send instruction
`/api/live_coding_agent/stop/<id>`	POST	Stop and save
`/api/live_coding_agent/checkpoints/<id>`	GET	List checkpoints
`/api/live_coding_agent/rollback/<id>`	POST	Rollback to step
`/api/live_coding_agent/replay/<id>`	POST	Create branch and replay
`/api/live_coding_agent/branches/<id>`	GET	List branches
`/api/live_coding_agent/switch_branch/<id>`	POST	Switch branch

Examples

See examples/agent-traces/ for complete example projects:

live-coding-agent/ -- Live coding agent with real-time streaming and controls
coding-agent-evaluation/ -- Static coding agent trace evaluation
coding-agent-prm/ -- Fast PRM data collection with first_error mode
coding-agent-review/ -- GitHub PR-style code review with inline comments
coding-agent-comparison/ -- Multi-dimensional agent quality comparison
swebench-evaluation/ -- SWE-bench coding agent evaluation

Schemas and Templates -- All annotation schema types
Configuration Reference -- Complete configuration options

Coding Agent Trace Annotation

Overview

Quick Start

Data Format

Structured Turns Format (Recommended)

Tool Call Fields

Converting From Other Formats

Configuration

Display Configuration

Display Options

Annotation Schemas

Process Reward Schema (PRM)

Code Review Schema

Trace Converters

Export Formats

Supported Tool Types

Live Coding Agent Mode

Quick Start

Configuration

Agent Backends

Sandbox Modes

Controls

Checkpoints and Rollback

Branching and Replay

API Endpoints

Examples

Related Documentation