Agent Failure-Mode Taxonomy (MAST)¶

Tag why an agent trace failed using a built-in, research-backed taxonomy — without hand-authoring the label set. The flagship preset is MAST (the Multi-Agent System failure taxonomy from Cemri et al. 2025, "Why Do Multi-Agent LLM Systems Fail?"): 14 failure modes across 3 categories, empirically derived with κ=0.88 human inter-annotator agreement.

Commercial tools auto-detect failure modes; Potato gives you a turnkey human failure-mode annotation workflow — optionally seeded by an LLM-judge pre-label that annotators then validate.

Quick start¶

Add taxonomy_preset: mast to any hierarchical_multiselect scheme:

annotation_schemes:
  - annotation_type: hierarchical_multiselect
    name: failure_modes
    description: "Tag every MAST failure mode this trace exhibits"
    taxonomy_preset: mast      # auto-fills the 14 modes + hover definitions
    show_search: true

That's it — no need to list the modes. Each mode renders with its code (e.g. 1.1 Disobey task specification) and an ⓘ marker; hovering or keyboard-focusing it shows the mode's definition so annotators apply the modes consistently.

A runnable example is at examples/agent-traces/failure-taxonomy/:

python potato/flask_server.py start examples/agent-traces/failure-taxonomy/config.yaml -p 8000

The MAST taxonomy¶

Category	Modes
Specification & System Design	1.1 Disobey task specification · 1.2 Disobey role specification · 1.3 Step repetition · 1.4 Loss of conversation history · 1.5 Unaware of termination conditions
Inter-Agent Misalignment	2.1 Conversation reset · 2.2 Fail to ask for clarification · 2.3 Task derailment · 2.4 Information withholding · 2.5 Ignored other agent's input · 2.6 Reasoning-action mismatch
Task Verification & Termination	3.1 Premature termination · 3.2 No or incomplete verification · 3.3 Incorrect verification

Each mode carries a one-line definition (shown as a tooltip). See potato/server_utils/failure_taxonomy.py for the full text.

Options¶

Option	Description
`taxonomy_preset`	Name of a built-in taxonomy (currently `mast`). Fills `taxonomy` + per-mode tooltips.
`taxonomy`	An explicit nested taxonomy. Wins over the preset if both are given.
`tooltips`	A `{label: text}` map of hover definitions. Merged over the preset's (explicit wins).
`show_search`	Show a search box — handy for 14+ labels.

Because it is a normal hierarchical_multiselect scheme, selections export as a comma-separated list of coded mode labels, and all the usual options (max_selections, auto_select_parent, …) apply.

Pairing with an LLM judge¶

The taxonomy is small and explicit enough for an LLM judge to apply. A common loop:

An LLM proposes failure modes for each incoming trace (a pre-label).
Annotators validate or correct the proposal in the UI above.
Judge Alignment reports judge↔human agreement (κ) per mode, so you can see where the judge is reliable and where humans must review.

This mirrors the MAST paper's own workflow (human κ=0.88, LLM-judge κ=0.77) and turns failure-mode tagging into a calibrated, auditable signal.

Adding your own taxonomy¶

Append to TAXONOMY_PRESETS in potato/server_utils/failure_taxonomy.py. Each preset is an ordered {category: [(code, name, description), ...]} mapping; the schema wiring (to_hierarchical, to_tooltips) does the rest.

Trajectory Evaluation — per-step error taxonomy + severity
Process Supervision (PRM) — per-step reward labeling
Judge Alignment — judge↔human agreement on the tags
Three-Pane Trace Eval — richer trace display