Release Notes: v2.6.0 — QDA Mode, LLM-as-Judge Calibration & Trajectory Editing¶

This release brings Potato into full qualitative-data-analysis (QDA) territory and deepens its agent-evaluation toolkit. It adds an interactive QDA Mode (universal persistence, memos, search, a living codebook, and cases), an LLM-as-judge calibration/alignment workflow with a signal-based triage queue, and trajectory-editing schemas for producing SFT/DPO training data. It also relicenses Potato to GPL-3.0-or-later and lands a large robustness wave. 84 commits since v2.5.0.

Qualitative Data Analysis (QDA) Mode¶

A new opt-in qda_mode that turns Potato into a collaborative qualitative-coding environment. Enabling it auto-cascades sensible defaults (codebook open, memos on, cases on).

Universal SQLite persistence — a shared project database underpins the new QDA surfaces (memos, search index, codebook, cases) alongside the existing annotation store.
Memos — analyst memos with a sidebar UI, REST API, and a dedicated exporter flag, so reflective notes live next to the data.
Universal FTS5 search & claim — full-text search across instances with an annotator search-and-claim sidebar; admin search and a claim guard prevent double-work.
Living codebook + cases (QDA Phase 1) — a universal codebook with case grouping.
In-vivo coding — create a code straight from a text selection.
On-the-fly add — add codes in place with an in-form reconcile and a review worklist.
Retroactive curation — append-only merge/split of codes with an LLM propose-confirm flow.
Revision provenance & versioning — the codebook bumps a revision on any change, with a lightweight /version poll so clients pick up edits; provenance is tracked as an overlay separate from prompt-facing records.
Docs & examples — see QDA Mode; a composed qda-mode example plus per-feature example READMEs are included.

Agent Evaluation & LLM-as-Judge¶

LLM-as-judge auto-labeling + blind human calibration — auto-label trajectories with an LLM judge, then calibrate against blind human judgments to measure and tune trustworthiness. See Judge Calibration.
Judge↔human alignment + signal-based triage queue — quantify judge/human agreement and route the most informative items to humans first via a signal-based triage queue. See Judge Alignment and Triage Queue.
eval_trace display — a three-pane agent-trace display purpose-built for continuous evaluation. See eval_trace.
Trajectory editing schemas — trajectory_edit and trajectory_correction for capturing edited/corrected agent trajectories as SFT/DPO training data. See Trajectory Correction.
Coding-agent eval — openai_tool_use coding backend, a fixed coding-agent eval pipeline and web trace-review UI, and agentic display defect fixes surfaced by screenshot/live verification.

Annotation Workflow & Assignment¶

Heterogeneous coverage — per-item annotator caps, IAA reporting, and adjudication routing for tasks where items need different numbers of annotators.
Reclaim abandoned assignments — recover annotation assignments abandoned by Prolific or QC-blocked workers, with configurable retention policies and idempotent, transaction-isolated reclaim.
Custom Batch assignment strategy (#160) — assign predefined batches of items to specific annotators.
Phase page ordering — order pages within a phase from configuration; phase type is now inferred from the canonical phase name (#154).
Reverse-proxy URL prefixes — serve Potato under a sub-path behind a reverse proxy (#161).

Licensing¶

Relicensed to GPL-3.0-or-later (from PolyForm Shield). setup.py, LICENSE, and the README badge all reflect the change.

Performance & Robustness¶

A large QA-hardening wave (F-022 through F-051):

Boot performance — the ML stack is no longer eager-loaded at server boot (F-051): import time ~6.5s→2s, 50k-item boot ~10s→5.7s, RSS ~750→365MB.
Schema rename — annotation_type: highlight → span with a migration (F-050).
Dynamic sources — ingested traces are now assignable to annotators under the dynamic-source quota (F-037); data_directory/data_sources-only configs validate correctly (F-038).
Route registration — registered 14 previously-dead admin/API routes and added an invariant test (F-042); registered the test reset_state route and stabilized the video-persistence suite (F-041).
Training phase — fixed the __main__/module-split 404 and wired the all-phases example (F-043/F-044).
Prolific & payloads — fixed the Prolific __main__-split (F-045) and hardened /updateinstance against non-dict payloads (F-046).
Persistence — persist display-based bounding boxes for document/PDF displays (F-040); stop populateInputValues clobbering restored runtime codebook codes.
Export & survey — stop silently dropping survey/phase responses (F-047); don't title-case author-written survey/consent labels (F-049).
Webhooks — fixed four save-path bugs (500 on save, miswired events).
Solo mode — backend fixes (B1–B7), UI polish, and fixed prompt optimization (F-022) so the optimizer is constructed and errors are no longer masked.
Active learning — LLM cold-start honors nested llm config and tolerates fenced JSON.
Search UX — matched terms highlight with <mark> instead of literal brackets.

Upgrade¶

pip install --upgrade potato-annotation==2.6.0