Release Notes: v2.4.0 — Agent Evaluation, AI-Assisted Annotation & Enterprise Integration

This release transforms Potato into a comprehensive platform for evaluating AI agents, with new interactive annotation interfaces, AI-assisted labeling, enterprise integrations, and significant quality-of-life improvements. 54 commits since v2.3.0, 200+ new tests.

Agent Evaluation

Web Agent Annotation

  • Interactive trace viewer for web-browsing AI agents with SVG overlays (click markers, bounding boxes, mouse paths, scroll indicators)
  • Review Mode: Filmstrip navigation through pre-recorded agent screenshots with per-step ratings
  • Creation Mode: iframe-based live web browsing with automatic interaction recording (clicks, typing, scrolling)
  • Keyboard shortcuts for rapid step-by-step evaluation
  • Trace converters for WebArena, Mind2Web, and Anthropic Computer Use formats
  • 8 example projects covering review, creation, and specialized evaluation workflows

Live Agent Evaluation

  • Watch AI agents execute tasks in real time while annotating their behavior
  • Agent Runner Manager for parallel agent execution with task control
  • Trace Ingestion webhook receiver for capturing running agent traces
  • Step-level inter-annotator agreement and quality control modules
  • Web playback UI with screenshots, overlays, and navigation controls

Agent Trace Examples

  • 14 agent trace evaluation examples:
  • agent-trace-evaluation — Text agents with MAST error taxonomy
  • visual-agent-evaluation — GUI agents with screenshot grounding
  • agent-comparison — Side-by-side A/B comparison
  • web-agent-review — Pre-recorded web traces with overlay viewer
  • web-agent-creation — Live web browsing with trace recording
  • live-agent-evaluation — Real-time agent execution
  • complex-annotation — Multi-schema trace annotation (radio, likert, span, text)
  • rag-evaluation, openai-evaluation, anthropic-evaluation, swebench-evaluation, multi-agent-evaluation, langchain-integration

AI-Assisted Annotation

LLM Chat Sidebar

  • Collapsible AI assistant panel for annotator guidance on difficult instances
  • Multi-turn conversations with intelligent context (task description, labels, current instance text)
  • ChatManager singleton with system prompt templating and endpoint dispatch
  • Native multi-turn support for OpenAI, Anthropic, and Ollama endpoints
  • All conversations logged as behavioral data (ChatMessage dataclass in BehavioralData)
  • Three REST APIs: /api/chat/send, /api/chat/history, /api/chat/config
  • Keyboard shortcuts and instance change detection
  • See example: examples/ai-assisted/llm-chat/

Advanced Active Learning

  • 5 query strategies for efficient annotation:
  • Uncertainty sampling
  • Diversity-based selection
  • BADGE (Batch Active Learning by Diverse Gradient Embeddings)
  • BALD (Bayesian Active Learning by Disagreement)
  • Hybrid ensemble approach
  • LLM cold-start for intelligent instance selection before any labels exist
  • CoverICL ensemble for diverse in-context learning examples
  • Probability calibration and confidence estimation (logprob and consistency methods)
  • TF-IDF default vectorizer with sentence-transformer support
  • See Active Learning Guide for details
  • Examples: examples/advanced/active-learning-strategies/, examples/advanced/active-learning-llm-cold-start/

Enterprise Integration

Webhook System

  • Event-driven integration with external systems via signed payloads
  • 5 event types: annotation.created, item.fully_annotated, task.completed, user.phase_completed, quality.attention_check_failed
  • HMAC-SHA256 payload signing (Standard Webhooks spec)
  • Non-blocking delivery with SQLite retry store and configurable retries
  • Wildcard event subscription
  • Admin APIs for webhook management and test firing

HuggingFace Ecosystem

  • Hub Export: Push annotation datasets with auto-generated DatasetCards — pip install potato-annotation[huggingface]
  • Datasets Integration: load_as_dataset() and load_annotations() for zero-copy in-memory loading
  • Spaces Deployment: Pre-configured Docker setup for one-click HuggingFace Spaces deployment
  • Live Demo: huggingface.co/spaces/Blablablab/potato

LangChain Callback Handler

  • PotatoCallbackHandler extends LangChain's BaseCallbackHandler for automatic trace ingestion
  • Parent-child tracking of chain/LLM/tool runs with hierarchy
  • LangSmith-compatible payloads sent on root chain completion
  • Thread-safe background sending — pip install potato-annotation[langchain]

SSO/OAuth Authentication

  • Google, GitHub, and generic OIDC provider support via Authlib
  • Automatic user provisioning on first login — pip install potato-annotation[auth]

Password Management

  • PBKDF2-SHA256 hashing with per-user salts and hmac.compare_digest
  • Admin CLI/API password reset: potato reset-password
  • Self-service token-based reset flow with email templates
  • Database backend (SQLite/PostgreSQL) via DatabaseAuthBackend

Annotation System Improvements

Required Annotation Enforcement

  • Server-side blocking prevents forward navigation when required fields are empty
  • Client-side validation with real-time visual feedback on unfilled schemas
  • validation: required and required: true config options

Display System

  • Collapsible Instructions Banner: Inline annotation instructions rendered as a collapsible panel via annotation_instructions config key
  • Span Target Contract: Enforced across all display types — displays declaring supports_span_target = True must produce a .text-content wrapper
  • Dialogue Span Annotations: Full support for highlighting text within dialogue/conversation displays
  • Span Schema columns Option: Control grid layout of span label checkboxes (e.g., columns: 4)
  • Header Logo Support: Custom task branding via config
  • Base CSS Injection: Project-level custom CSS via base_css config option
  • Multi-Phase Workflows: Conditional rendering for complex multi-stage annotation tasks

Data & Export

  • Parquet Export: Columnar annotation output via pyarrow — pip install potato-annotation[export]
  • Per-File Encoding: Specify encoding for individual entries in data_files
  • 6 Trace Converters: OpenAI, Anthropic, SWE-bench, OTEL, multi-agent, MCP formats
  • Admin Export API: Trigger exports via REST (POST /admin/api/export, GET /admin/api/export/formats)
  • AI Endpoint Enhancement: chat_query_with_image method for image-aware AI conversations

Quality Control

  • Enhanced QC metrics and behavioral data extraction
  • JSON/JSONL support for quality control files
  • Categorical value normalization for agreement analysis

UX

  • Powered by Potato footer with GitHub and citation links
  • Semantic CSS classes and design-system variables
  • Keybinding conflict resolution across multi-schema annotations

Bug Fixes

  • Next button permanently disabled after page load — setLoading(false) skipped re-enabling the button
  • Dialogue span offset mismatch between DOM textContent and data-original-text causing silent span creation failures
  • SurveyFlow phase advancement broken by annotation.js intercepting form submissions (Issue #126)
  • os import shadowing in run_server that broke startup
  • UnicodeDecodeError on Windows with non-ASCII data files (Issue #112)
  • Per-phase layout overwriting when multiple phases used separate layout files (Issue #119)
  • HTML sanitizer too restrictive for structural content (Issue #120)
  • Radio/multiselect persistence not restoring correctly after navigation
  • Span overlay positioning and class name mismatches
  • Label format bug in admin export (list-of-tuples crash)
  • All-phases navigation loop and has_free_response boolean crash (Issue #111)

Dependency Changes

Dependencies are now organized into optional groups for lighter installs:

pip install potato-annotation              # Core only
pip install potato-annotation[ai]          # + OpenAI, Ollama
pip install potato-annotation[huggingface] # + Hub, Datasets
pip install potato-annotation[langchain]   # + LangChain callback handler
pip install potato-annotation[auth]        # + OAuth/SSO via Authlib
pip install potato-annotation[formats]     # + PDF, DOCX, code highlighting
pip install potato-annotation[export]      # + Parquet via pyarrow
pip install potato-annotation[viz]         # + UMAP visualization
pip install potato-annotation[all]         # Everything

Testing

200+ new tests covering all major features: - Password management (38), config validation (9), password reset API (11) - Chat sidebar (47) - Active learning unit (54) and integration (20) - Web agent UI (72 selenium tests) - Dialogue span annotation and display contracts - Multi-phase workflow integration