Release Notes: v2.6.1 — Agentic Evaluation Suite (G1–G10)¶
This release builds out a full agent-evaluation loop on top of Potato's annotation core, closing the gaps identified against LangSmith and LabelBox. It adds programmatic evaluators, versioned datasets & experiments, a tracing SDK, an automation-rules engine, pytest CI gating, automated judge calibration, span/free-text judging, semantic curation, and a multi-model arena — capture → automate → curate → evaluate → gate → calibrate, end to end. Every feature ships with tests, docs, a runnable example, and admin UI.
Programmatic evaluators (potato.evaluators)¶
A Flask-free, dependency-light evaluator library usable in the server, CI, and
the automation engine: deterministic trajectory match (strict/unordered/
subset/superset with per-tool argument matching), tool-use correctness,
reference-free LLM-judge, and heuristics (exact/contains/regex/edit-distance/
JSON/embedding). Optional lazy adapter for the MIT agentevals package. See
Programmatic Evaluators.
Datasets & Experiments (potato.eval_datasets, potato.experiments)¶
First-class versioned datasets (snapshot-per-mutation, tags, as_of, splits)
behind a pluggable file/sqlite store, and experiments that score a
dataset version with evaluators and compare runs over time. Curate examples from
loaded instances, ingested traces, or aggregated human annotations; export to
SFT/DPO. Admin UI + the eval inspect/control API (/admin/eval/*, including
assignment pause/resume). See Datasets & Experiments.
Tracing SDK (potato_trace)¶
A lightweight, top-level @traceable decorator + context manager (sync & async)
that captures nested run trees and ships them to Potato's ingestion webhook, plus
an optional OpenTelemetry exporter. Dependency-light — importing it never pulls
Flask. See Tracing SDK.
Automation rules engine (potato.automation)¶
A programmable filter → sampling rate → actions engine over every item entering
Potato (loaded or runtime-ingested). Actions route items to the annotation queue,
curate them into datasets, run evaluators, fire outbound webhooks, or notify
annotators (heavy actions run on a background worker; sampling is deterministic).
The generic webhook normalizer now carries through status/feedback/score/
tags so triage and automation rules can match those signals. See
Automation Rules.
CI evaluation (potato.testing)¶
A pytest plugin (@pytest.mark.potato_eval, the potato_eval fixture, and an
expect(...) assertion API) that runs evaluations in your own test suite and
fails the build on score-threshold regressions (--potato-threshold), with
optional experiment recording. Registered via a pytest11 entry point. See
CI Evaluation.
Judge: automated calibration + span/free-text (judge_alignment)¶
- Auto-calibration — instances where a human corrected the judge become few-shot examples (leakage-guarded), re-run under a new prompt version, with base-vs-new κ reported.
- Beyond categorical — the judge now scores span (IoU-F1 vs human spans) and free-text (rubric dimensions: continuous/boolean/categorical) outputs, not just radio/likert.
See Judge Alignment.
eval_trace span annotation¶
The three-pane reasoning | function-calls | final-answer display is now
span-annotatable (span_target: true) for fine-grained error localization,
via the shared multi-field span pipeline. See eval_trace.
Semantic curation / Catalog (potato.curation)¶
An embedding index (lazy, pluggable; no ML import at startup) powering similarity search ("find traces like this failure") and dynamic slices — saved semantic + metadata filters that auto-include new matching traces and curate into datasets. See Semantic Curation.
Model arena (potato.arena)¶
Fan one prompt out to N models side by side (provider-agnostic via
AIEndpointFactory), compare responses, pick the best, and track a win-rate
leaderboard. One model failing never aborts the others. See
Model Arena.
HuggingFace Spaces, models & docs¶
A manifest-driven HuggingFace Spaces deploy toolkit
(build/deploy demo Spaces from spaces_manifest.yaml + 10 demo projects), a new
Using HuggingFace Models guide, a Hub
round-trip section in HuggingFace export,
and accessibility/chrome polish (focus rings, larger nav targets).
Notes¶
- New config blocks:
datasets,automation,curation,arena— see the config reference. - The agent-eval datasets package is
potato.eval_datasets(notdatasets) to avoid shadowing the HuggingFacedatasetslibrary on the import path.