Multi-Agent Team Annotation¶
Annotating multi-agent systems needs more than a flat per-turn transcript — you need to attribute outcomes to which agent, which step, and which handoff. This page covers Potato's multi-agent-specific annotation surfaces (the M-series), which build on the agent trace and MAST taxonomy.
Failure attribution (failure_attribution)¶
Capture the (responsible agent, decisive step, reason) triple that the failure-attribution literature needs (Zhang et al., "Which Agent Causes Task Failures and When?", ICML 2025; the Who&When dataset). The agent dropdown and step picker are populated from the trace's own turns at render time, so the annotator chooses from what actually happened.
annotation_schemes:
- annotation_type: failure_attribution
name: attribution
description: "If it failed: which agent, which step, and why?"
steps_key: steps # field in the instance data holding the turn list
agent_key: agent # which field of each turn names the agent
# agents: [Planner, Coder, Reviewer] # optional static list instead of deriving from the trace
Stored as {"responsible_agent", "decisive_step", "reason"}. Pair it with the
agent trace display so annotators see the interactions while
attributing the failure. A runnable example is at
examples/agent-traces/failure-attribution/:
python potato/flask_server.py start examples/agent-traces/failure-attribution/config.yaml -p 8000
Orchestration pattern (recipe)¶
The orchestration architecture often dominates a run's outcome (MAESTRO,
2601.00481), so it's worth capturing as a first-class label. No new schema is needed
— a radio confirms or corrects the run's pattern, paired with the trace display.
The label then guides both the downstream evaluation lens and how you lay the trace
out (sequential → lanes, hierarchical → tree, group-chat → board).
annotation_schemes:
- annotation_type: radio
name: orchestration_pattern
description: "Which orchestration pattern does this run actually follow?"
labels: [single_agent, sequential_pipeline, hierarchical_manager, group_chat, blackboard, debate, hub_and_spoke]
has_free_response: true
Runnable example: examples/agent-traces/orchestration-pattern/. Pair with
agent_interaction_graph (structure) and agent_scorecard (per-agent scoring).
MAST tagging at step granularity (recipe)¶
You don't need a new schema to bind the MAST taxonomy to the
exact step (and therefore the acting agent) where a failure occurred — configure the
existing per-step trajectory_eval schema with the 14 MAST
modes as its error_types, grouped by the three MAST categories. Annotators then tag
each turn with the precise failure mode instead of labeling the trace as a whole.
Pair it with failure_attribution (responsible agent) and handoff_review
(inter-agent edges) for full coverage.
annotation_schemes:
- annotation_type: trajectory_eval
name: mast_steps
description: "Tag each step with the MAST failure mode(s) it exhibits."
steps_key: steps
step_text_key: content
error_types:
- name: "Specification & System Design"
subtypes: ["1.1 Disobey task specification", "1.2 Disobey role specification", "1.3 Step repetition", "1.4 Loss of conversation history", "1.5 Unaware of termination conditions"]
- name: "Inter-Agent Misalignment"
subtypes: ["2.1 Conversation reset", "2.2 Fail to ask for clarification", "2.3 Task derailment", "2.4 Information withholding", "2.5 Ignored other agent's input", "2.6 Reasoning-action mismatch"]
- name: "Task Verification & Termination"
subtypes: ["3.1 Premature termination", "3.2 No or incomplete verification", "3.3 Incorrect verification"]
Runnable example: examples/agent-traces/mast-step-tagging/.
Interaction graph (agent_interaction_graph)¶
Render the whole run as a directed interaction graph — nodes are the agents, edges are the message/handoff transitions between them (thicker = more frequent) — and let the annotator mark the critical path (click a node) and flag problematic edges (click an edge to cycle normal → critical → problematic). No open competitor offers a clickable agent-interaction graph (cf. AgentGraph, AAAI 2026). The graph is laid out automatically from the trace, so it needs no precomputed coordinates.
annotation_schemes:
- annotation_type: agent_interaction_graph
name: graph
description: "Mark the critical path and flag any problematic handoffs."
steps_key: steps
agent_key: agent
Stored as {"critical_nodes": [...], "edges": {"A->B": "problematic", ...}}. Every
node and edge is keyboard-focusable and activates on Enter/Space, and a live text
summary lists critical nodes and flagged edges so meaning is never conveyed by color
alone (WCAG). Example: examples/agent-traces/interaction-graph/.
Cross-lane emergent behavior (emergent_behavior)¶
Tag collective behaviors that span multiple turns and agents — collusion, groupthink, cascading errors, role drift (collective-behavior work, 2604.05339). An emergent behavior isn't a contiguous text span; it's a set of participating turns, possibly from different agents/lanes. For each behavior the annotator checks the turns that participate and adds a note — a "cross-lane span" expressed as a turn-set (which keeps it independent of, and safe for, the core span engine).
annotation_schemes:
- annotation_type: emergent_behavior
name: emergent
description: "For each collective behavior, check the turns (across agents) that participate."
steps_key: steps
agent_key: agent
behaviors: [collusion, groupthink, cascading_error, role_drift]
allow_note: true
Stored as {behavior: {turns: [idx...], note}} (only non-empty behaviors). Example:
examples/agent-traces/emergent-behavior/.
Handoff review (handoff_review)¶
Treat every handoff — one agent passing control to another — as a first-class
object to annotate. Wherever the acting agent changes between consecutive turns,
Potato emits a handoff card A → B; the annotator flags inter-agent misalignment
and rates the handoff quality. Grounded in MAST's inter-agent failure modes, LACP
(Zhang et al., 2510.13821) and the "Echoing" phenomenon (2511.09710).
annotation_schemes:
- annotation_type: handoff_review
name: handoffs
description: "For each handoff: flag any misalignment and rate the quality."
steps_key: steps
agent_key: agent
flags: [info_loss, dropped_constraint, garbling, goal_drift] # customizable
quality_scale: 5
Stored as a list of {index, step, from, to, flags, quality}. Handoffs are derived
from the trace at render time (no manual setup). Example:
examples/agent-traces/handoff-review/.
Per-agent + per-team scorecard (agent_scorecard)¶
Score a run on two levels at once (MultiAgentBench, Zhou et al., ACL 2025, 2503.01935): each agent gets per-dimension scores (role fidelity, contribution, coordination), the team gets shared-dimension scores, and optional milestones are checked off. Agent rows are derived from the trace's own turns, so the matrix matches who actually participated.
annotation_schemes:
- annotation_type: agent_scorecard
name: scorecard
description: "Score each agent, the team, and which milestones were reached."
steps_key: steps
agent_key: agent
scale: 5
agent_dimensions: [role fidelity, contribution, coordination]
team_dimensions: [coordination, communication, efficiency]
milestones: [plan produced, task delegated correctly, result verified] # optional
Stored as {"agents": {name: {dim: score}}, "team": {dim: score}, "milestones": {name: bool}}.
Example: examples/agent-traces/agent-scorecard/.
Tool / resource-contention timeline (tool_contention)¶
Visualize concurrent tool/resource use across agents on a multi-lane timeline (one lane per agent) and flag concurrency failures — deadlock, circular wait, race conditions, shared-resource collisions (DPBench, 2602.13255). Contention regions where two calls touch the same resource at overlapping times are highlighted across the lanes and listed for classification.
annotation_schemes:
- annotation_type: tool_contention
name: contention
description: "Classify each shared-resource contention region."
calls_key: calls # list of {agent, tool, start, end, resource}
agent_key: agent
resource_key: resource
contention_labels: [deadlock, circular_wait, race_condition, benign]
Contentions are computed at render time (same resource, overlapping interval).
Stored as {"contentions": {idx: label}}. Example:
examples/agent-traces/tool-contention/.
Tool-call review (tool_call_review)¶
Judge each tool / function call in a trace individually: was the right tool
chosen, were the arguments correct, was the ordering right? (mirrors BFCL v4 /
MCPMark). Tool calls are extracted from the trace steps at render time — each step's
tool_calls/tool_call/action becomes a card showing the tool name and
pretty-printed arguments, with a per-call verdict and notes.
annotation_schemes:
- annotation_type: tool_call_review
name: tool_review
description: "Judge each tool call: right tool? correct arguments?"
steps_key: steps
# verdict_options: [correct, wrong_tool, wrong_args, wrong_order] # customizable
Stored as a list of {index, step, tool, verdict, notes}. Example:
examples/agent-traces/tool-call-review/.
Related documentation¶
- Failure-Mode Taxonomy (MAST) — tag how it failed
- Agent Traces — the trace display
- Trajectory Evaluation — per-step error annotation