Release Notes: v2.6.1 — Agentic Evaluation Suite (G1–G10)¶

This release builds out a full agent-evaluation loop on top of Potato's annotation core, closing the gaps identified against LangSmith and LabelBox. It adds programmatic evaluators, versioned datasets & experiments, a tracing SDK, an automation-rules engine, pytest CI gating, automated judge calibration, span/free-text judging, semantic curation, and a multi-model arena — capture → automate → curate → evaluate → gate → calibrate, end to end. Every feature ships with tests, docs, a runnable example, and admin UI.

Programmatic evaluators (`potato.evaluators`)¶

A Flask-free, dependency-light evaluator library usable in the server, CI, and the automation engine: deterministic trajectory match (strict/unordered/ subset/superset with per-tool argument matching), tool-use correctness, reference-free LLM-judge, and heuristics (exact/contains/regex/edit-distance/ JSON/embedding). Optional lazy adapter for the MIT agentevals package. See Programmatic Evaluators.

Datasets & Experiments (`potato.eval_datasets`, `potato.experiments`)¶

First-class versioned datasets (snapshot-per-mutation, tags, as_of, splits) behind a pluggable file/sqlite store, and experiments that score a dataset version with evaluators and compare runs over time. Curate examples from loaded instances, ingested traces, or aggregated human annotations; export to SFT/DPO. Admin UI + the eval inspect/control API (/admin/eval/*, including assignment pause/resume). See Datasets & Experiments.

Tracing SDK (`potato_trace`)¶

A lightweight, top-level @traceable decorator + context manager (sync & async) that captures nested run trees and ships them to Potato's ingestion webhook, plus an optional OpenTelemetry exporter. Dependency-light — importing it never pulls Flask. See Tracing SDK.

Automation rules engine (`potato.automation`)¶

A programmable filter → sampling rate → actions engine over every item entering Potato (loaded or runtime-ingested). Actions route items to the annotation queue, curate them into datasets, run evaluators, fire outbound webhooks, or notify annotators (heavy actions run on a background worker; sampling is deterministic). The generic webhook normalizer now carries through status/feedback/score/ tags so triage and automation rules can match those signals. See Automation Rules.

CI evaluation (`potato.testing`)¶

A pytest plugin (@pytest.mark.potato_eval, the potato_eval fixture, and an expect(...) assertion API) that runs evaluations in your own test suite and fails the build on score-threshold regressions (--potato-threshold), with optional experiment recording. Registered via a pytest11 entry point. See CI Evaluation.

Judge: automated calibration + span/free-text (`judge_alignment`)¶

Auto-calibration — instances where a human corrected the judge become few-shot examples (leakage-guarded), re-run under a new prompt version, with base-vs-new κ reported.
Beyond categorical — the judge now scores span (IoU-F1 vs human spans) and free-text (rubric dimensions: continuous/boolean/categorical) outputs, not just radio/likert.

See Judge Alignment.

`eval_trace` span annotation¶

The three-pane reasoning | function-calls | final-answer display is now span-annotatable (span_target: true) for fine-grained error localization, via the shared multi-field span pipeline. See eval_trace.

Semantic curation / Catalog (`potato.curation`)¶

An embedding index (lazy, pluggable; no ML import at startup) powering similarity search ("find traces like this failure") and dynamic slices — saved semantic + metadata filters that auto-include new matching traces and curate into datasets. See Semantic Curation.

Model arena (`potato.arena`)¶

Fan one prompt out to N models side by side (provider-agnostic via AIEndpointFactory), compare responses, pick the best, and track a win-rate leaderboard. One model failing never aborts the others. See Model Arena.

HuggingFace Spaces, models & docs¶

A manifest-driven HuggingFace Spaces deploy toolkit (build/deploy demo Spaces from spaces_manifest.yaml + 10 demo projects), a new Using HuggingFace Models guide, a Hub round-trip section in HuggingFace export, and accessibility/chrome polish (focus rings, larger nav targets).

Notes¶

New config blocks: datasets, automation, curation, arena — see the config reference.
The agent-eval datasets package is potato.eval_datasets (not datasets) to avoid shadowing the HuggingFace datasets library on the import path.

Release Notes: v2.6.1 — Agentic Evaluation Suite (G1–G10)¶

Programmatic evaluators (potato.evaluators)¶

Datasets & Experiments (potato.eval_datasets, potato.experiments)¶

Tracing SDK (potato_trace)¶

Automation rules engine (potato.automation)¶

CI evaluation (potato.testing)¶

Judge: automated calibration + span/free-text (judge_alignment)¶

eval_trace span annotation¶

Semantic curation / Catalog (potato.curation)¶

Model arena (potato.arena)¶