Release Notes: v2.6.2 — Agent-Evaluation Differentiation (D/E series) + Multi-Agent & Multimodal Annotation (M series)¶
Building on the v2.6.1 agentic-evaluation suite (G1–G10), this release pushes Potato beyond parity with LangSmith/LabelBox into capabilities no open tool offers: a clickable multi-agent interaction graph, cross-agent failure attribution, purpose-built multimodal-agent annotation surfaces, statistically rigorous judge–human alignment, and reward/optimization export. 13 new annotation schemas, new evaluators and analytics dashboards, and a set of robustness fixes. Every new schema ships with unit + Selenium persistence tests, a runnable example, and docs.
New evaluators & judge tooling (D-series)¶
rubric_dag,rag_triad,agent_as_judgeevaluators. The RAG triad scores context relevance / groundedness / answer relevance;rubric_dagevaluates a dependency graph of criteria;agent_as_judgeruns an agentic judge.- Judge robustness eval cards — verbosity-bias, position-swap consistency, and Expected Calibration Error (ECE).
- Hardened core judge JSON parsing to tolerate fenced/
<think>-wrapped model output.
Statistical rigor & consensus (D-series)¶
Bootstrap confidence intervals, Wilson intervals, paired significance tests, Dawid–Skene (EM) consensus, judge-drift trend tracking, and trace analytics.
Model arena (D-series)¶
Provider-agnostic multi-model arena with Elo / Bradley–Terry rankings and pairwise-preference (DPO) export; persona-driven multi-turn user simulation.
Curation, integrity & optimization (E-series)¶
- Failure-mode discovery — cosine k-means clustering of failures with LLM-labeled axes, surfaced on the Catalog.
- LLM-cheating / low-effort detection — Correlated-Agreement + LLM-echo signals on a new annotation-integrity dashboard.
- Perspectivist export — soft-label distributions that preserve disagreement.
- Reward export & optimization — Rubrics-as-Rewards export, active-preference sampling, automatic metric induction, and a prompt-optimization (eval→improve) loop.
Multi-agent team annotation (M-series) — new¶
Annotate the team structure, not just a flat transcript:
agent_interaction_graph— a clickable directed graph (nodes = agents, edges = handoffs); mark the critical path and flag problematic edges. (No open competitor offers this.)failure_attribution— responsible agent + decisive step + reason (Who&When).handoff_review— every agent→agent handoff as an annotatable object with inter-agent-misalignment flags + quality.agent_scorecard— per-agent role fidelity / contribution / coordination, team dimensions, and milestones (MultiAgentBench-style).tool_contention— per-agent lanes of concurrent tool calls; flag deadlock / race / shared-resource collisions.emergent_behavior— tag turn-sets spanning agents for collusion / groupthink / cascading errors / role drift.- Recipes: MAST-at-step (the 14 MAST modes on
trajectory_eval) and orchestration-pattern confirmation.
See Multi-Agent Team Annotation.
Multimodal-agent annotation (M-series) — new¶
For agents that drive GUIs, watch video, or hold spoken conversations:
gui_trajectory— per-step screenshot + action with click-grounding (OSWorld / ScreenSpot / AndroidWorld).voice_interaction— dual-track (user/agent) timeline with barge-in/overlap classification (Full-Duplex-Bench).temporal_grounding— mark gold event intervals on video with a live IoU vs. the model's prediction (ET-Bench / TimeScope).speech_transcript— per-segment ASR/TTS/pronunciation/disfluency tagging with inline correction.multimodal_reasoning— rate each step of an interleaved text↔image↔tool trace for coherence and visual hallucination.tool_call_review— per-tool-call correctness (right tool / args / order; BFCL / MCPMark).table_grid— document-table cell-structure annotation (OmniDocBench / RealHiTBench).
See Multimodal-Agent Annotation.
Fixes¶
agent_interaction_graph— persist edge classification before re-render so the restore path can't clobber an in-progress edge cycle.trajectory_eval— tolerate plain-string /label-keyederror_typesinstead of raising an opaque'str' object has no attribute 'get'(whichsafe_generate_layoutsilently turned into an error block).table_grid— compact, table-like cells (previouslyaspect-ratio:1/1ballooned them in wide containers).- Examples — restore the missing data file for the interaction-graph example, and add an integrity guard test that runs the real boot-time data-file validation (which the preview CLI skips) across all agent-eval examples.
Reliability — trace & span round-trip integrity¶
- Span serialization data loss fixed. On save→reload, spans silently lost
target_field,id, knowledge-base entity links (kb_id/kb_source/kb_label), discontinuous-spanadditional_parts, andformat_coords— the on-disk serializer wrote fewer fields than the loader read. This broke multi-field span rendering across sessions. Both serializers now delegate toSpanAnnotation.to_dict()as the single source of truth. - Export CLI span crash fixed. The CLI treated stored span data as a dict and
crashed on
.items(); it now correctly parses the list-of-[span, value]pairs into the schema-keyed shape every exporter expects (with legacy-dict fallback). - Round-trip regression suites added for spans/links/events and for the full agent
trace import → store → export pipeline:
CanonicalTracefidelity with a drift guard, per-importer auto-detection and normalization, the capture-SDK payload path, and real example-trace data.
Annotator progress dashboard (opt-in, read-only) — new¶
An optional /progress page lets annotators see their own progress plus project-level
totals, without exposing admin actions or other annotators' identities. Off by
default; enable with an annotator_dashboard config block. Project stats are computed
by a shared helper reused by the admin Overview tab. Accessible (progressbar ARIA,
live regions, keyboard-friendly) and resilient (request timeout, graceful empty states).
See Annotator Progress Dashboard.
Notes¶
- The agent-eval datasets package remains
potato.eval_datasets(notdatasets) to avoid shadowing the HuggingFacedatasetslibrary on the import path. - Potato now ships 53 annotation schema types, all free and open-source.