Release Notes: v2.4.0 — Agent Evaluation, AI-Assisted Annotation & Enterprise Integration
This release transforms Potato into a comprehensive platform for evaluating AI agents, with new interactive annotation interfaces, AI-assisted labeling, enterprise integrations, and significant quality-of-life improvements. 54 commits since v2.3.0, 200+ new tests.
Agent Evaluation
Web Agent Annotation
- Interactive trace viewer for web-browsing AI agents with SVG overlays (click markers, bounding boxes, mouse paths, scroll indicators)
- Review Mode: Filmstrip navigation through pre-recorded agent screenshots with per-step ratings
- Creation Mode: iframe-based live web browsing with automatic interaction recording (clicks, typing, scrolling)
- Keyboard shortcuts for rapid step-by-step evaluation
- Trace converters for WebArena, Mind2Web, and Anthropic Computer Use formats
- 8 example projects covering review, creation, and specialized evaluation workflows
Live Agent Evaluation
- Watch AI agents execute tasks in real time while annotating their behavior
- Agent Runner Manager for parallel agent execution with task control
- Trace Ingestion webhook receiver for capturing running agent traces
- Step-level inter-annotator agreement and quality control modules
- Web playback UI with screenshots, overlays, and navigation controls
Agent Trace Examples
- 14 agent trace evaluation examples:
agent-trace-evaluation— Text agents with MAST error taxonomyvisual-agent-evaluation— GUI agents with screenshot groundingagent-comparison— Side-by-side A/B comparisonweb-agent-review— Pre-recorded web traces with overlay viewerweb-agent-creation— Live web browsing with trace recordinglive-agent-evaluation— Real-time agent executioncomplex-annotation— Multi-schema trace annotation (radio, likert, span, text)rag-evaluation,openai-evaluation,anthropic-evaluation,swebench-evaluation,multi-agent-evaluation,langchain-integration
AI-Assisted Annotation
LLM Chat Sidebar
- Collapsible AI assistant panel for annotator guidance on difficult instances
- Multi-turn conversations with intelligent context (task description, labels, current instance text)
- ChatManager singleton with system prompt templating and endpoint dispatch
- Native multi-turn support for OpenAI, Anthropic, and Ollama endpoints
- All conversations logged as behavioral data (ChatMessage dataclass in BehavioralData)
- Three REST APIs:
/api/chat/send,/api/chat/history,/api/chat/config - Keyboard shortcuts and instance change detection
- See example:
examples/ai-assisted/llm-chat/
Advanced Active Learning
- 5 query strategies for efficient annotation:
- Uncertainty sampling
- Diversity-based selection
- BADGE (Batch Active Learning by Diverse Gradient Embeddings)
- BALD (Bayesian Active Learning by Disagreement)
- Hybrid ensemble approach
- LLM cold-start for intelligent instance selection before any labels exist
- CoverICL ensemble for diverse in-context learning examples
- Probability calibration and confidence estimation (logprob and consistency methods)
- TF-IDF default vectorizer with sentence-transformer support
- See Active Learning Guide for details
- Examples:
examples/advanced/active-learning-strategies/,examples/advanced/active-learning-llm-cold-start/
Enterprise Integration
Webhook System
- Event-driven integration with external systems via signed payloads
- 5 event types:
annotation.created,item.fully_annotated,task.completed,user.phase_completed,quality.attention_check_failed - HMAC-SHA256 payload signing (Standard Webhooks spec)
- Non-blocking delivery with SQLite retry store and configurable retries
- Wildcard event subscription
- Admin APIs for webhook management and test firing
HuggingFace Ecosystem
- Hub Export: Push annotation datasets with auto-generated DatasetCards —
pip install potato-annotation[huggingface] - Datasets Integration:
load_as_dataset()andload_annotations()for zero-copy in-memory loading - Spaces Deployment: Pre-configured Docker setup for one-click HuggingFace Spaces deployment
- Live Demo: huggingface.co/spaces/Blablablab/potato
LangChain Callback Handler
PotatoCallbackHandlerextends LangChain's BaseCallbackHandler for automatic trace ingestion- Parent-child tracking of chain/LLM/tool runs with hierarchy
- LangSmith-compatible payloads sent on root chain completion
- Thread-safe background sending —
pip install potato-annotation[langchain]
SSO/OAuth Authentication
- Google, GitHub, and generic OIDC provider support via Authlib
- Automatic user provisioning on first login —
pip install potato-annotation[auth]
Password Management
- PBKDF2-SHA256 hashing with per-user salts and
hmac.compare_digest - Admin CLI/API password reset:
potato reset-password - Self-service token-based reset flow with email templates
- Database backend (SQLite/PostgreSQL) via DatabaseAuthBackend
Annotation System Improvements
Required Annotation Enforcement
- Server-side blocking prevents forward navigation when required fields are empty
- Client-side validation with real-time visual feedback on unfilled schemas
validation: requiredandrequired: trueconfig options
Display System
- Collapsible Instructions Banner: Inline annotation instructions rendered as a collapsible panel via
annotation_instructionsconfig key - Span Target Contract: Enforced across all display types — displays declaring
supports_span_target = Truemust produce a.text-contentwrapper - Dialogue Span Annotations: Full support for highlighting text within dialogue/conversation displays
- Span Schema
columnsOption: Control grid layout of span label checkboxes (e.g.,columns: 4) - Header Logo Support: Custom task branding via config
- Base CSS Injection: Project-level custom CSS via
base_cssconfig option - Multi-Phase Workflows: Conditional rendering for complex multi-stage annotation tasks
Data & Export
- Parquet Export: Columnar annotation output via pyarrow —
pip install potato-annotation[export] - Per-File Encoding: Specify encoding for individual entries in
data_files - 6 Trace Converters: OpenAI, Anthropic, SWE-bench, OTEL, multi-agent, MCP formats
- Admin Export API: Trigger exports via REST (
POST /admin/api/export,GET /admin/api/export/formats) - AI Endpoint Enhancement:
chat_query_with_imagemethod for image-aware AI conversations
Quality Control
- Enhanced QC metrics and behavioral data extraction
- JSON/JSONL support for quality control files
- Categorical value normalization for agreement analysis
UX
- Powered by Potato footer with GitHub and citation links
- Semantic CSS classes and design-system variables
- Keybinding conflict resolution across multi-schema annotations
Bug Fixes
- Next button permanently disabled after page load —
setLoading(false)skipped re-enabling the button - Dialogue span offset mismatch between DOM textContent and
data-original-textcausing silent span creation failures - SurveyFlow phase advancement broken by annotation.js intercepting form submissions (Issue #126)
osimport shadowing inrun_serverthat broke startup- UnicodeDecodeError on Windows with non-ASCII data files (Issue #112)
- Per-phase layout overwriting when multiple phases used separate layout files (Issue #119)
- HTML sanitizer too restrictive for structural content (Issue #120)
- Radio/multiselect persistence not restoring correctly after navigation
- Span overlay positioning and class name mismatches
- Label format bug in admin export (list-of-tuples crash)
- All-phases navigation loop and
has_free_responseboolean crash (Issue #111)
Dependency Changes
Dependencies are now organized into optional groups for lighter installs:
pip install potato-annotation # Core only
pip install potato-annotation[ai] # + OpenAI, Ollama
pip install potato-annotation[huggingface] # + Hub, Datasets
pip install potato-annotation[langchain] # + LangChain callback handler
pip install potato-annotation[auth] # + OAuth/SSO via Authlib
pip install potato-annotation[formats] # + PDF, DOCX, code highlighting
pip install potato-annotation[export] # + Parquet via pyarrow
pip install potato-annotation[viz] # + UMAP visualization
pip install potato-annotation[all] # Everything
Testing
200+ new tests covering all major features: - Password management (38), config validation (9), password reset API (11) - Chat sidebar (47) - Active learning unit (54) and integration (20) - Web agent UI (72 selenium tests) - Dialogue span annotation and display contracts - Multi-phase workflow integration