Frequently Asked Questions¶

Short, direct answers to common questions about Potato, annotation, and agent (agentic) evaluation. See the linked guides for detail.

What is Potato?¶

Potato is a free, open-source, self-hosted annotation and agent-evaluation platform for NLP, agentic, and GenAI research. You configure tasks entirely in YAML — no coding — to annotate text, audio, video, images, documents, and AI agent traces. See Quick Start.

What is agentic annotation?¶

Agentic annotation is the practice of evaluating AI agent runs — their reasoning steps, tool calls, and final outputs — rather than just labeling static text. Potato renders agent trajectories and lets humans (and LLM judges) rate correctness step by step, mark error spans, edit trajectories, and compare agents. See the Agent Evaluation Guide.

Is Potato a free alternative to LangSmith, LabelBox, or Braintrust?¶

Yes. Potato is free and self-hosted, and covers the agent-evaluation loop those tools provide — programmatic evaluators, versioned datasets and experiments, automation rules, CI gating, LLM-as-judge calibration, and a multi-model arena — without per-seat or per-trace fees and without sending your data to a SaaS. See the comparison.

How do I evaluate AI agent trajectories?¶

Import a trace (OpenAI, Anthropic, LangChain, LangGraph, CrewAI, OpenTelemetry, and more), display it with the agent_trace or three-pane eval_trace view, and score it with human schemes or programmatic evaluators (deterministic trajectory match, tool-use correctness, LLM-judge). See Agent Traces and Trajectory Evaluation.

How do I annotate agent traces and tool calls?¶

Convert traces with python -m potato.trace_converter, then annotate tool calls, reasoning, and observations per step. Tool calls render natively in the coding_trace and eval_trace displays. See Agent Traces.

How do I do LLM-as-judge evaluation and calibration?¶

Configure an LLM judge, run it over human-labeled items, and measure agreement (Cohen's κ, ECE/Brier). Potato can auto-calibrate the judge by turning human corrections into few-shot examples, and judges categorical, span, and free-text outputs. See Judge Alignment and Judge Calibration.

How do I collect RLHF / DPO / preference data?¶

Use pairwise or ranking schemes (or the model arena for side-by-side model comparison), then export to SFT/DPO via the trajectory-correction and dataset exporters.

Can I gate CI on evaluation quality?¶

Yes. The pytest plugin runs your evaluations in CI and fails the build when a metric drops below a threshold (--potato-threshold correct=0.8), so prompt/model regressions are caught on every PR.

How do I capture agent runs from my own code?¶

Instrument your agent with the potato_trace SDK: decorate functions with @traceable and runs are captured and sent to Potato. An OpenTelemetry exporter is also available.

Does Potato support crowdsourcing (MTurk, Prolific)?¶

Yes — native MTurk and Prolific integration with platform-specific authentication, plus quality control, training phases, and inter-annotator agreement.

What annotation types does Potato support?¶

20+ schemes: radio/checkbox/Likert, span/NER, spans linking, coreference, ranking, best-worst scaling, pairwise/conjoint comparison, sliders, soft labels, rubric grids, and agent-specific schemes (trajectory eval, process-reward, code review). See the Schema Gallery.

Is my data private?¶

Yes. Potato is self-hosted — your data stays on your infrastructure. There is no required external service.

How do I install and run Potato?¶

pip install potato-annotation
python potato/flask_server.py start <config.yaml> -p 8000

See Quick Start and Usage.

Glossary — definitions of annotation & agent-evaluation terms
Agent Evaluation Guide
Comparison with other tools