Agent Evaluation Guide
This guide covers evaluating AI agent systems with Potato, including coding agents, web agents, RAG pipelines, and multi-agent systems.
Overview
Potato supports evaluating AI agents at multiple levels:
| Level | What You Annotate | Example |
|---|---|---|
| Trajectory | Overall task success | "Did the agent complete the task?" |
| Step | Individual action correctness | Per-turn Likert ratings on each agent step |
| Span | Specific text segments | Highlight hallucinated claims, factual errors |
| Comparison | Side-by-side evaluation | "Which agent performed better?" |
Trace Conversion
Import traces from any major agent framework:
python -m potato.trace_converter --input traces.json --input-format openai --output data.jsonl
Supported formats: OpenAI, Anthropic/Claude, ReAct, LangChain, LangFuse, WebArena, SWE-bench, OpenTelemetry, CrewAI/AutoGen/LangGraph, MCP, Aider, Claude Code, ATIF, SWE-Agent, and Web Agent.
For full details, see Agent Traces.
Coding Agent Evaluation
Evaluate agentic coding systems (Claude Code, SWE-Agent, Aider) with:
- Diff rendering for code changes
- Process Reward Model (PRM) annotation
- Code review workflows
Web Agent Evaluation
Review GUI agent traces with an interactive screenshot viewer:
- Step-by-step navigation through screenshots
- SVG overlays showing clicks, bounding boxes, mouse paths, and scroll actions
- Inline annotation controls per step
- Live browsing mode with automatic trace recording
See Web Agent Annotation.
Trajectory Evaluation
Per-step error marking with typed error taxonomies:
- Mark individual steps as correct, incorrect, or partially correct
- Assign error types (hallucination, reasoning error, tool misuse, etc.)
- Span-level annotation within agent output
Live Agent Interaction
Observe and interact with a live AI agent in real time, recording traces as you go:
Using AI Assistance for Evaluation
Speed up agent evaluation with AI-powered features:
- AI Support - LLM label suggestions for agent evaluation tasks
- Chat Support - Ask an LLM questions about complex agent traces
Example Configurations
Ready-to-use examples in examples/agent-traces/:
| Example | What It Evaluates |
|---|---|
agent-trace-evaluation/ |
Text agent traces with MAST error taxonomy |
visual-agent-evaluation/ |
GUI agents with screenshot grounding |
agent-comparison/ |
Side-by-side A/B agent comparison |
rag-evaluation/ |
RAG retrieval relevance and citation accuracy |
openai-evaluation/ |
OpenAI Chat API traces with tool calls |
anthropic-evaluation/ |
Claude messages with tool_use blocks |
swebench-evaluation/ |
Coding agents with patch correctness ratings |
multi-agent-evaluation/ |
Multi-agent coordination (CrewAI, AutoGen, LangGraph) |
web-agent-review/ |
Pre-recorded web traces with overlay viewer |
web-agent-creation/ |
Live web browsing with trace recording |
Run any example:
python potato/flask_server.py start examples/agent-traces/agent-trace-evaluation/config.yaml -p 8000