Release Notes: v2.6.0 — QDA Mode, LLM-as-Judge Calibration & Trajectory Editing¶
This release brings Potato into full qualitative-data-analysis (QDA) territory and deepens its agent-evaluation toolkit. It adds an interactive QDA Mode (universal persistence, memos, search, a living codebook, and cases), an LLM-as-judge calibration/alignment workflow with a signal-based triage queue, and trajectory-editing schemas for producing SFT/DPO training data. It also relicenses Potato to GPL-3.0-or-later and lands a large robustness wave. 84 commits since v2.5.0.
Qualitative Data Analysis (QDA) Mode¶
A new opt-in qda_mode that turns Potato into a collaborative qualitative-coding environment. Enabling it auto-cascades sensible defaults (codebook open, memos on, cases on).
- Universal SQLite persistence — a shared project database underpins the new QDA surfaces (memos, search index, codebook, cases) alongside the existing annotation store.
- Memos — analyst memos with a sidebar UI, REST API, and a dedicated exporter flag, so reflective notes live next to the data.
- Universal FTS5 search & claim — full-text search across instances with an annotator search-and-claim sidebar; admin search and a claim guard prevent double-work.
- Living codebook + cases (QDA Phase 1) — a universal codebook with case grouping.
- In-vivo coding — create a code straight from a text selection.
- On-the-fly add — add codes in place with an in-form reconcile and a review worklist.
- Retroactive curation — append-only merge/split of codes with an LLM propose-confirm flow.
- Revision provenance & versioning — the codebook bumps a revision on any change, with a lightweight
/versionpoll so clients pick up edits; provenance is tracked as an overlay separate from prompt-facing records. - Docs & examples — see QDA Mode; a composed
qda-modeexample plus per-feature example READMEs are included.
Agent Evaluation & LLM-as-Judge¶
- LLM-as-judge auto-labeling + blind human calibration — auto-label trajectories with an LLM judge, then calibrate against blind human judgments to measure and tune trustworthiness. See Judge Calibration.
- Judge↔human alignment + signal-based triage queue — quantify judge/human agreement and route the most informative items to humans first via a signal-based triage queue. See Judge Alignment and Triage Queue.
eval_tracedisplay — a three-pane agent-trace display purpose-built for continuous evaluation. See eval_trace.- Trajectory editing schemas —
trajectory_editandtrajectory_correctionfor capturing edited/corrected agent trajectories as SFT/DPO training data. See Trajectory Correction. - Coding-agent eval —
openai_tool_usecoding backend, a fixed coding-agent eval pipeline and web trace-review UI, and agentic display defect fixes surfaced by screenshot/live verification.
Annotation Workflow & Assignment¶
- Heterogeneous coverage — per-item annotator caps, IAA reporting, and adjudication routing for tasks where items need different numbers of annotators.
- Reclaim abandoned assignments — recover annotation assignments abandoned by Prolific or QC-blocked workers, with configurable retention policies and idempotent, transaction-isolated reclaim.
- Custom Batch assignment strategy (#160) — assign predefined batches of items to specific annotators.
- Phase page ordering — order pages within a phase from configuration; phase type is now inferred from the canonical phase name (#154).
- Reverse-proxy URL prefixes — serve Potato under a sub-path behind a reverse proxy (#161).
Licensing¶
- Relicensed to GPL-3.0-or-later (from PolyForm Shield).
setup.py,LICENSE, and the README badge all reflect the change.
Performance & Robustness¶
A large QA-hardening wave (F-022 through F-051):
- Boot performance — the ML stack is no longer eager-loaded at server boot (F-051): import time ~6.5s→2s, 50k-item boot ~10s→5.7s, RSS ~750→365MB.
- Schema rename —
annotation_type: highlight→spanwith a migration (F-050). - Dynamic sources — ingested traces are now assignable to annotators under the dynamic-source quota (F-037);
data_directory/data_sources-only configs validate correctly (F-038). - Route registration — registered 14 previously-dead admin/API routes and added an invariant test (F-042); registered the test
reset_stateroute and stabilized the video-persistence suite (F-041). - Training phase — fixed the
__main__/module-split 404 and wired the all-phases example (F-043/F-044). - Prolific & payloads — fixed the Prolific
__main__-split (F-045) and hardened/updateinstanceagainst non-dict payloads (F-046). - Persistence — persist display-based bounding boxes for document/PDF displays (F-040); stop
populateInputValuesclobbering restored runtime codebook codes. - Export & survey — stop silently dropping survey/phase responses (F-047); don't title-case author-written survey/consent labels (F-049).
- Webhooks — fixed four save-path bugs (500 on save, miswired events).
- Solo mode — backend fixes (B1–B7), UI polish, and fixed prompt optimization (F-022) so the optimizer is constructed and errors are no longer masked.
- Active learning — LLM cold-start honors nested
llmconfig and tolerates fenced JSON. - Search UX — matched terms highlight with
<mark>instead of literal brackets.
Upgrade¶
pip install --upgrade potato-annotation==2.6.0