Datasets & Experiments¶
Potato's evaluation backbone: versioned datasets of evaluation examples and experiments that score outputs against them with programmatic evaluators. Together they turn Potato from "annotate once" into "evaluate continuously" — curate a test set, run evaluators, and track scores across prompt/model versions over time.
Enabling¶
datasets:
enabled: true
storage: file # "file" (default) | "sqlite"
storage selects the backend:
file— git-diffable JSONL snapshots under<output_annotation_dir>/eval_store/datasets/. The default.sqlite— a single<output_annotation_dir>/eval_store/datasets.sqlite, for large dataset/experiment counts and faster queries.
When enabled, the admin dashboard shows a Datasets & Experiments link in its header.
Datasets¶
A dataset is a named collection of examples:
| Field | Meaning |
|---|---|
id |
Unique example id |
inputs |
The task input (prompt, trace, question) |
reference_outputs |
Optional gold output / reference trajectory |
metadata |
Arbitrary extra fields (e.g. outputs, rejected, source) |
split |
test (default), train, … |
Versioning¶
Every add/update/delete creates a new immutable version (v0001, v0002,
…). Versions can be tagged (e.g. prod), and reads can pin a version with
as_of:
as_of=latest(default) — newest versionas_of=prod— the version carrying theprodtagas_of=v0002— an explicit version id
A tag points to exactly one version; re-tagging moves it.
Curating examples¶
- From the live task — Import loaded instances (or
POST .../import_instances) turns the task's loaded instances into examples. - From ingested traces only — Import ingested traces (or
POST .../import_traces) imports just the runtime-ingested traces (webhook / LangSmith / Langfuse), optionally filtered to onesource. - With human annotations as references — tick include human annotations as
references (or pass
include_annotations: true). The aggregated human annotation per scheme becomes each example'sreference_outputs. Two methods (aggregation_method):majority(default) — exact-match majority vote; vote counts + agreement recorded in metadata.dawid_skene— worker-reliability-weighted consensus. Dawid-Skene jointly estimates each annotator's reliability (via EM over their confusion matrices) and re-weights votes accordingly, so a careful annotator outvotes a careless one and you get a per-example confidence. This is the standard upgrade over majority vote for noisy crowds; per-annotator reliability and per-example confidence are recorded in metadata. (Seepotato/server_utils/consensus.py.)
- Via the API — add examples directly (see below).
- Otherwise reference outputs can be added later, or scored against
metadata['outputs'].
Experiments¶
An experiment runs one or more evaluators against a dataset version and
records per-example results plus aggregate scores. Pick a dataset and evaluators
on the overview page and click Run, or POST .../experiments/run.
The flagship agent evaluators (trajectory_match, tool_use,
tool_call_accuracy, llm_trajectory_judge) appear first in the picker. See
Evaluators for the full list and semantics.
LLM-judge evaluators call your configured
ai_supportendpoint and may take a while on large datasets.
Comparing experiments¶
Select two or more experiments and Compare to see aggregate scores side by side. The first is the baseline; deltas and the best value per metric are highlighted so regressions stand out.
Each delta is annotated with a paired-bootstrap significance badge and a 95%
confidence interval, computed per-example against the baseline (aligned by
example_id). A change is flagged significant only when its CI excludes 0 —
so a +0.02 that's really noise reads as n.s., while a decisive gain reads as
significant. This is the same statistics layer
(potato/server_utils/eval_stats.py: bootstrap CIs, Wilson intervals for
win-rates, paired significance) used by the Model Arena
leaderboard, so error bars and significance are consistent across the suite.
Export to fine-tuning data¶
Any dataset version exports to JSONL for fine-tuning, reusing the same record shapes as the trajectory correction exporter:
- SFT —
{"prompt": <inputs>, "completion": <reference_outputs>}(examples withoutreference_outputsare skipped). - DPO —
{"prompt": <inputs>, "chosen": <reference_outputs>, "rejected": <metadata.rejected | metadata.outputs>}(examples without both are skipped).
Use the Export SFT / Export DPO buttons on the dataset detail page, or:
curl -OJ "http://localhost:8000/datasets/api/datasets/agent-eval-v1/export?format=sft"
The X-Skipped-Examples response header reports how many examples were skipped.
Reward data & the eval→improve loop¶
Beyond SFT/DPO, the suite turns evaluations into reward-model data and closes the loop to prompt improvement — each step human-grounded:
- Rubrics-as-Rewards (E9) —
GET /datasets/api/experiments/<id>/export_rewardsconverts an experiment's rubric-DAG / agent-as-judge results into criterion-level reward rows ({prompt, response, reward, criteria}) for RM/RLVR training in non-verifiable domains. (server_utils/rubric_reward.py.) - Active preference sampling (E10) —
GET /admin/arena/api/suggest_pairs?strategy=uncertaintyranks which response pairs to label next by how informative the comparison is (closest Bradley-Terry scores first), with an honestrandombaseline. (server_utils/active_preference.py.) - Metric induction (E11) —
GET /admin/api/induce-metrics?schema=<free-text scheme>mines recurring evaluation metrics from annotators' free-text comments (AutoLibra- style) and proposes candidates for a human to confirm into a rubric. (server_utils/metric_induction.py.) - Eval→improve export (E12) —
GET /datasets/api/datasets/<name>/optimize_export?fmt=gepaexports a curated dataset as a GEPA/DSPy optimization trainset; the optimizer proposes a prompt rewrite, surfaced as aPromptDiffa human approves or rejects before it ships (the optimizer never silently changes the prompt). (server_utils/prompt_optimization.py.)
API reference¶
All endpoints require admin auth (X-API-Key header or admin session).
| Method | Path | Purpose |
|---|---|---|
| GET | /datasets/api/datasets |
List datasets |
| POST | /datasets/api/datasets |
Create a dataset |
| GET | /datasets/api/datasets/<name> |
Dataset detail (versions) |
| DELETE | /datasets/api/datasets/<name> |
Delete a dataset |
| GET | /datasets/api/datasets/<name>/examples |
List examples (as_of, splits) |
| POST | /datasets/api/datasets/<name>/examples |
Add examples (new version) |
| POST | /datasets/api/datasets/<name>/tag |
Tag a version |
| GET | /datasets/api/datasets/<name>/export?format=sft\|dpo |
Export JSONL |
| POST | /datasets/api/datasets/<name>/import_instances |
Curate from loaded instances (include_annotations) |
| POST | /datasets/api/datasets/<name>/import_traces |
Curate from ingested traces (source, include_annotations) |
| POST | /datasets/api/experiments/run |
Run an experiment |
| GET | /datasets/api/experiments |
List experiments (summaries) |
| GET | /datasets/api/experiments/<id> |
Experiment detail |
Inspecting & controlling the annotation process¶
The eval-admin API (/admin/eval/..., admin-only) inspects and controls the
annotation process for these tasks — surfaced as the Annotation process panel
on the overview page:
| Method | Path | Purpose |
|---|---|---|
| GET | /admin/eval/status |
Overview: datasets, experiments, annotation progress, ingested-trace counts, assignment state |
| GET | /admin/eval/progress |
Per-instance annotation status (source, #annotators, saturated, triage priority) |
| GET | /admin/eval/ingested_traces |
Runtime-ingested traces with source breakdown |
| POST | /admin/eval/assignment |
{action: "pause"\|"resume"} — freeze/resume new assignments (existing assignments untouched) |
For full inter-annotator agreement use /admin/iaa; for
per-annotator timing use /admin/api/annotators.
Example¶
curl -X POST http://localhost:8000/datasets/api/datasets \
-H "Content-Type: application/json" \
-d '{"name": "agent-eval-v1", "description": "Tool-use correctness"}'
curl -X POST http://localhost:8000/datasets/api/experiments/run \
-H "Content-Type: application/json" \
-d '{"dataset": "agent-eval-v1",
"evaluators": [{"name": "trajectory_match", "params": {"mode": "unordered"}}]}'
Example project¶
examples/agent-traces/experiments/ is a runnable demo:
python potato/flask_server.py start examples/agent-traces/experiments/config.yaml -p 8000