Release Notes: v2.4.0 — Agent Evaluation, AI-Assisted Annotation & Enterprise Integration

This release transforms Potato into a comprehensive platform for evaluating AI agents, with new interactive annotation interfaces, AI-assisted labeling, enterprise integrations, and significant quality-of-life improvements. 54 commits since v2.3.0, 200+ new tests.

Agent Evaluation

Web Agent Annotation

Interactive trace viewer for web-browsing AI agents with SVG overlays (click markers, bounding boxes, mouse paths, scroll indicators)
Review Mode: Filmstrip navigation through pre-recorded agent screenshots with per-step ratings
Creation Mode: iframe-based live web browsing with automatic interaction recording (clicks, typing, scrolling)
Keyboard shortcuts for rapid step-by-step evaluation
Trace converters for WebArena, Mind2Web, and Anthropic Computer Use formats
8 example projects covering review, creation, and specialized evaluation workflows

Live Agent Evaluation

Watch AI agents execute tasks in real time while annotating their behavior
Agent Runner Manager for parallel agent execution with task control
Trace Ingestion webhook receiver for capturing running agent traces
Step-level inter-annotator agreement and quality control modules
Web playback UI with screenshots, overlays, and navigation controls

Agent Trace Examples

14 agent trace evaluation examples:
agent-trace-evaluation — Text agents with MAST error taxonomy
visual-agent-evaluation — GUI agents with screenshot grounding
agent-comparison — Side-by-side A/B comparison
web-agent-review — Pre-recorded web traces with overlay viewer
web-agent-creation — Live web browsing with trace recording
live-agent-evaluation — Real-time agent execution
complex-annotation — Multi-schema trace annotation (radio, likert, span, text)
rag-evaluation, openai-evaluation, anthropic-evaluation, swebench-evaluation, multi-agent-evaluation, langchain-integration

AI-Assisted Annotation

Collapsible AI assistant panel for annotator guidance on difficult instances
Multi-turn conversations with intelligent context (task description, labels, current instance text)
ChatManager singleton with system prompt templating and endpoint dispatch
Native multi-turn support for OpenAI, Anthropic, and Ollama endpoints
All conversations logged as behavioral data (ChatMessage dataclass in BehavioralData)
Three REST APIs: /api/chat/send, /api/chat/history, /api/chat/config
Keyboard shortcuts and instance change detection
See example: examples/ai-assisted/llm-chat/

Advanced Active Learning

5 query strategies for efficient annotation:
Uncertainty sampling
Diversity-based selection
BADGE (Batch Active Learning by Diverse Gradient Embeddings)
BALD (Bayesian Active Learning by Disagreement)
Hybrid ensemble approach
LLM cold-start for intelligent instance selection before any labels exist
CoverICL ensemble for diverse in-context learning examples
Probability calibration and confidence estimation (logprob and consistency methods)
TF-IDF default vectorizer with sentence-transformer support
See Active Learning Guide for details
Examples: examples/advanced/active-learning-strategies/, examples/advanced/active-learning-llm-cold-start/

Enterprise Integration

Webhook System

Event-driven integration with external systems via signed payloads
5 event types: annotation.created, item.fully_annotated, task.completed, user.phase_completed, quality.attention_check_failed
HMAC-SHA256 payload signing (Standard Webhooks spec)
Non-blocking delivery with SQLite retry store and configurable retries
Wildcard event subscription
Admin APIs for webhook management and test firing

HuggingFace Ecosystem

Hub Export: Push annotation datasets with auto-generated DatasetCards — pip install potato-annotation[huggingface]
Datasets Integration: load_as_dataset() and load_annotations() for zero-copy in-memory loading
Spaces Deployment: Pre-configured Docker setup for one-click HuggingFace Spaces deployment
Live Demo: huggingface.co/spaces/Blablablab/potato

LangChain Callback Handler

PotatoCallbackHandler extends LangChain's BaseCallbackHandler for automatic trace ingestion
Parent-child tracking of chain/LLM/tool runs with hierarchy
LangSmith-compatible payloads sent on root chain completion
Thread-safe background sending — pip install potato-annotation[langchain]

SSO/OAuth Authentication

Google, GitHub, and generic OIDC provider support via Authlib
Automatic user provisioning on first login — pip install potato-annotation[auth]

Password Management

PBKDF2-SHA256 hashing with per-user salts and hmac.compare_digest
Admin CLI/API password reset: potato reset-password
Self-service token-based reset flow with email templates
Database backend (SQLite/PostgreSQL) via DatabaseAuthBackend

Annotation System Improvements

Required Annotation Enforcement

Server-side blocking prevents forward navigation when required fields are empty
Client-side validation with real-time visual feedback on unfilled schemas
validation: required and required: true config options

Display System

Collapsible Instructions Banner: Inline annotation instructions rendered as a collapsible panel via annotation_instructions config key
Span Target Contract: Enforced across all display types — displays declaring supports_span_target = True must produce a .text-content wrapper
Dialogue Span Annotations: Full support for highlighting text within dialogue/conversation displays
Span Schema columns Option: Control grid layout of span label checkboxes (e.g., columns: 4)
Header Logo Support: Custom task branding via config
Base CSS Injection: Project-level custom CSS via base_css config option
Multi-Phase Workflows: Conditional rendering for complex multi-stage annotation tasks

Data & Export

Parquet Export: Columnar annotation output via pyarrow — pip install potato-annotation[export]
Per-File Encoding: Specify encoding for individual entries in data_files
6 Trace Converters: OpenAI, Anthropic, SWE-bench, OTEL, multi-agent, MCP formats
Admin Export API: Trigger exports via REST (POST /admin/api/export, GET /admin/api/export/formats)
AI Endpoint Enhancement: chat_query_with_image method for image-aware AI conversations

Quality Control

Enhanced QC metrics and behavioral data extraction
JSON/JSONL support for quality control files
Categorical value normalization for agreement analysis

UX

Powered by Potato footer with GitHub and citation links
Semantic CSS classes and design-system variables
Keybinding conflict resolution across multi-schema annotations

Bug Fixes

Next button permanently disabled after page load — setLoading(false) skipped re-enabling the button
Dialogue span offset mismatch between DOM textContent and data-original-text causing silent span creation failures
SurveyFlow phase advancement broken by annotation.js intercepting form submissions (Issue #126)
os import shadowing in run_server that broke startup
UnicodeDecodeError on Windows with non-ASCII data files (Issue #112)
Per-phase layout overwriting when multiple phases used separate layout files (Issue #119)
HTML sanitizer too restrictive for structural content (Issue #120)
Radio/multiselect persistence not restoring correctly after navigation
Span overlay positioning and class name mismatches
Label format bug in admin export (list-of-tuples crash)
All-phases navigation loop and has_free_response boolean crash (Issue #111)

Dependency Changes

Dependencies are now organized into optional groups for lighter installs:

pip install potato-annotation              # Core only
pip install potato-annotation[ai]          # + OpenAI, Ollama
pip install potato-annotation[huggingface] # + Hub, Datasets
pip install potato-annotation[langchain]   # + LangChain callback handler
pip install potato-annotation[auth]        # + OAuth/SSO via Authlib
pip install potato-annotation[formats]     # + PDF, DOCX, code highlighting
pip install potato-annotation[export]      # + Parquet via pyarrow
pip install potato-annotation[viz]         # + UMAP visualization
pip install potato-annotation[all]         # Everything

Testing

200+ new tests covering all major features: - Password management (38), config validation (9), password reset API (11) - Chat sidebar (47) - Active learning unit (54) and integration (20) - Web agent UI (72 selenium tests) - Dialogue span annotation and display contracts - Multi-phase workflow integration