Agent Evaluation Guide

This guide covers evaluating AI agent systems with Potato, including coding agents, web agents, RAG pipelines, and multi-agent systems.

Overview

Potato supports evaluating AI agents at multiple levels:

Level	What You Annotate	Example
Trajectory	Overall task success	"Did the agent complete the task?"
Step	Individual action correctness	Per-turn Likert ratings on each agent step
Span	Specific text segments	Highlight hallucinated claims, factual errors
Comparison	Side-by-side evaluation	"Which agent performed better?"

Trace Conversion

Import traces from any major agent framework:

python -m potato.trace_converter --input traces.json --input-format openai --output data.jsonl

Supported formats: OpenAI, Anthropic/Claude, ReAct, LangChain, LangFuse, WebArena, SWE-bench, OpenTelemetry, CrewAI/AutoGen/LangGraph, MCP, Aider, Claude Code, ATIF, SWE-Agent, and Web Agent.

For full details, see Agent Traces.

Coding Agent Evaluation

Evaluate agentic coding systems (Claude Code, SWE-Agent, Aider) with:

Diff rendering for code changes
Process Reward Model (PRM) annotation
Code review workflows

See Coding Agent Annotation.

Web Agent Evaluation

Review GUI agent traces with an interactive screenshot viewer:

Step-by-step navigation through screenshots
SVG overlays showing clicks, bounding boxes, mouse paths, and scroll actions
Inline annotation controls per step
Live browsing mode with automatic trace recording

See Web Agent Annotation.

Trajectory Evaluation

Per-step error marking with typed error taxonomies:

Mark individual steps as correct, incorrect, or partially correct
Assign error types (hallucination, reasoning error, tool misuse, etc.)
Span-level annotation within agent output

See Trajectory Evaluation.

Live Agent Interaction

Observe and interact with a live AI agent in real time, recording traces as you go:

See Live Agent Interaction.

Using AI Assistance for Evaluation

Speed up agent evaluation with AI-powered features:

AI Support - LLM label suggestions for agent evaluation tasks
Chat Support - Ask an LLM questions about complex agent traces

Example Configurations

Ready-to-use examples in examples/agent-traces/:

Example	What It Evaluates
`agent-trace-evaluation/`	Text agent traces with MAST error taxonomy
`visual-agent-evaluation/`	GUI agents with screenshot grounding
`agent-comparison/`	Side-by-side A/B agent comparison
`rag-evaluation/`	RAG retrieval relevance and citation accuracy
`openai-evaluation/`	OpenAI Chat API traces with tool calls
`anthropic-evaluation/`	Claude messages with tool_use blocks
`swebench-evaluation/`	Coding agents with patch correctness ratings
`multi-agent-evaluation/`	Multi-agent coordination (CrewAI, AutoGen, LangGraph)
`web-agent-review/`	Pre-recorded web traces with overlay viewer
`web-agent-creation/`	Live web browsing with trace recording

Run any example:

python potato/flask_server.py start examples/agent-traces/agent-trace-evaluation/config.yaml -p 8000