Skip to content

Glossary

Definitions of annotation and agent-evaluation terms used in Potato. Each term links to the relevant feature.

Annotation

Annotation — adding structured labels (categories, spans, ratings, rankings) to data so it can train or evaluate models. In Potato, tasks are defined in YAML.

Span annotation — highlighting and labeling a contiguous segment of text (e.g. an entity, an error, a hallucinated claim). See span linking.

Inter-annotator agreement (IAA) — how consistently multiple annotators label the same items; measured with Cohen's κ, Fleiss' κ, or Krippendorff's α. See Quality Control.

Adjudication — resolving disagreements between annotators to produce gold labels. See Adjudication.

Gold label — the agreed-upon correct annotation for an item, used as a reference for scoring judges or models.

Agentic evaluation

Agentic annotation / agent evaluation — evaluating an AI agent's run (reasoning, tool calls, outputs), not just static text. See the Agent Evaluation Guide.

Trace — a record of an agent's execution: the sequence of messages, tool calls, and observations for one task. Potato imports traces from OpenAI, Anthropic, LangChain, LangGraph, CrewAI, OpenTelemetry, and more.

Trajectory — the ordered sequence of an agent's steps (thoughts → tool calls → observations → answer). Trajectory evaluation scores each step.

Tool call (function call) — an agent's invocation of an external tool/function with arguments; a primary unit of agent evaluation.

Trajectory match — a deterministic evaluator that compares an agent's tool-call sequence to a reference (strict / unordered / subset / superset). See Evaluators.

LLM-as-judge — using an LLM to score model/agent outputs against a rubric. Potato measures and calibrates the judge against human gold (Cohen's κ, ECE).

Process Reward Model (PRM) — a model that scores the correctness of each step in a trajectory (not just the final answer); Potato's process_reward scheme collects this training data.

RLHF / DPO / SFT data — preference and demonstration data for training: SFT (prompt → completion), DPO (prompt → chosen / rejected). Potato exports these from pairwise, ranking, and trajectory-correction tasks.

Experiment — one run of evaluators over a versioned dataset, producing comparable aggregate scores over time.

Dynamic slice — a saved semantic + metadata filter that auto-includes new matching traces, used to find what to review. See Semantic Curation.

Model arena — sending one prompt to several models side by side and recording which is best (a win-rate leaderboard). See Model Arena.