Skip to content

Release Notes: v2.6.2 — Agent-Evaluation Differentiation (D/E series) + Multi-Agent & Multimodal Annotation (M series)

Building on the v2.6.1 agentic-evaluation suite (G1–G10), this release pushes Potato beyond parity with LangSmith/LabelBox into capabilities no open tool offers: a clickable multi-agent interaction graph, cross-agent failure attribution, purpose-built multimodal-agent annotation surfaces, statistically rigorous judge–human alignment, and reward/optimization export. 13 new annotation schemas, new evaluators and analytics dashboards, and a set of robustness fixes. Every new schema ships with unit + Selenium persistence tests, a runnable example, and docs.

New evaluators & judge tooling (D-series)

  • rubric_dag, rag_triad, agent_as_judge evaluators. The RAG triad scores context relevance / groundedness / answer relevance; rubric_dag evaluates a dependency graph of criteria; agent_as_judge runs an agentic judge.
  • Judge robustness eval cards — verbosity-bias, position-swap consistency, and Expected Calibration Error (ECE).
  • Hardened core judge JSON parsing to tolerate fenced/<think>-wrapped model output.

Statistical rigor & consensus (D-series)

Bootstrap confidence intervals, Wilson intervals, paired significance tests, Dawid–Skene (EM) consensus, judge-drift trend tracking, and trace analytics.

Model arena (D-series)

Provider-agnostic multi-model arena with Elo / Bradley–Terry rankings and pairwise-preference (DPO) export; persona-driven multi-turn user simulation.

Curation, integrity & optimization (E-series)

  • Failure-mode discovery — cosine k-means clustering of failures with LLM-labeled axes, surfaced on the Catalog.
  • LLM-cheating / low-effort detection — Correlated-Agreement + LLM-echo signals on a new annotation-integrity dashboard.
  • Perspectivist export — soft-label distributions that preserve disagreement.
  • Reward export & optimization — Rubrics-as-Rewards export, active-preference sampling, automatic metric induction, and a prompt-optimization (eval→improve) loop.

Multi-agent team annotation (M-series) — new

Annotate the team structure, not just a flat transcript:

  • agent_interaction_graph — a clickable directed graph (nodes = agents, edges = handoffs); mark the critical path and flag problematic edges. (No open competitor offers this.)
  • failure_attribution — responsible agent + decisive step + reason (Who&When).
  • handoff_review — every agent→agent handoff as an annotatable object with inter-agent-misalignment flags + quality.
  • agent_scorecard — per-agent role fidelity / contribution / coordination, team dimensions, and milestones (MultiAgentBench-style).
  • tool_contention — per-agent lanes of concurrent tool calls; flag deadlock / race / shared-resource collisions.
  • emergent_behavior — tag turn-sets spanning agents for collusion / groupthink / cascading errors / role drift.
  • Recipes: MAST-at-step (the 14 MAST modes on trajectory_eval) and orchestration-pattern confirmation.

See Multi-Agent Team Annotation.

Multimodal-agent annotation (M-series) — new

For agents that drive GUIs, watch video, or hold spoken conversations:

  • gui_trajectory — per-step screenshot + action with click-grounding (OSWorld / ScreenSpot / AndroidWorld).
  • voice_interaction — dual-track (user/agent) timeline with barge-in/overlap classification (Full-Duplex-Bench).
  • temporal_grounding — mark gold event intervals on video with a live IoU vs. the model's prediction (ET-Bench / TimeScope).
  • speech_transcript — per-segment ASR/TTS/pronunciation/disfluency tagging with inline correction.
  • multimodal_reasoning — rate each step of an interleaved text↔image↔tool trace for coherence and visual hallucination.
  • tool_call_review — per-tool-call correctness (right tool / args / order; BFCL / MCPMark).
  • table_grid — document-table cell-structure annotation (OmniDocBench / RealHiTBench).

See Multimodal-Agent Annotation.

Fixes

  • agent_interaction_graph — persist edge classification before re-render so the restore path can't clobber an in-progress edge cycle.
  • trajectory_eval — tolerate plain-string / label-keyed error_types instead of raising an opaque 'str' object has no attribute 'get' (which safe_generate_layout silently turned into an error block).
  • table_grid — compact, table-like cells (previously aspect-ratio:1/1 ballooned them in wide containers).
  • Examples — restore the missing data file for the interaction-graph example, and add an integrity guard test that runs the real boot-time data-file validation (which the preview CLI skips) across all agent-eval examples.

Reliability — trace & span round-trip integrity

  • Span serialization data loss fixed. On save→reload, spans silently lost target_field, id, knowledge-base entity links (kb_id/kb_source/kb_label), discontinuous-span additional_parts, and format_coords — the on-disk serializer wrote fewer fields than the loader read. This broke multi-field span rendering across sessions. Both serializers now delegate to SpanAnnotation.to_dict() as the single source of truth.
  • Export CLI span crash fixed. The CLI treated stored span data as a dict and crashed on .items(); it now correctly parses the list-of-[span, value] pairs into the schema-keyed shape every exporter expects (with legacy-dict fallback).
  • Round-trip regression suites added for spans/links/events and for the full agent trace import → store → export pipeline: CanonicalTrace fidelity with a drift guard, per-importer auto-detection and normalization, the capture-SDK payload path, and real example-trace data.

Annotator progress dashboard (opt-in, read-only) — new

An optional /progress page lets annotators see their own progress plus project-level totals, without exposing admin actions or other annotators' identities. Off by default; enable with an annotator_dashboard config block. Project stats are computed by a shared helper reused by the admin Overview tab. Accessible (progressbar ARIA, live regions, keyboard-friendly) and resilient (request timeout, graceful empty states). See Annotator Progress Dashboard.

Notes

  • The agent-eval datasets package remains potato.eval_datasets (not datasets) to avoid shadowing the HuggingFace datasets library on the import path.
  • Potato now ships 53 annotation schema types, all free and open-source.