Skip to content

Release Notes: v2.6.0 — QDA Mode, LLM-as-Judge Calibration & Trajectory Editing

This release brings Potato into full qualitative-data-analysis (QDA) territory and deepens its agent-evaluation toolkit. It adds an interactive QDA Mode (universal persistence, memos, search, a living codebook, and cases), an LLM-as-judge calibration/alignment workflow with a signal-based triage queue, and trajectory-editing schemas for producing SFT/DPO training data. It also relicenses Potato to GPL-3.0-or-later and lands a large robustness wave. 84 commits since v2.5.0.

Qualitative Data Analysis (QDA) Mode

A new opt-in qda_mode that turns Potato into a collaborative qualitative-coding environment. Enabling it auto-cascades sensible defaults (codebook open, memos on, cases on).

  • Universal SQLite persistence — a shared project database underpins the new QDA surfaces (memos, search index, codebook, cases) alongside the existing annotation store.
  • Memos — analyst memos with a sidebar UI, REST API, and a dedicated exporter flag, so reflective notes live next to the data.
  • Universal FTS5 search & claim — full-text search across instances with an annotator search-and-claim sidebar; admin search and a claim guard prevent double-work.
  • Living codebook + cases (QDA Phase 1) — a universal codebook with case grouping.
  • In-vivo coding — create a code straight from a text selection.
  • On-the-fly add — add codes in place with an in-form reconcile and a review worklist.
  • Retroactive curation — append-only merge/split of codes with an LLM propose-confirm flow.
  • Revision provenance & versioning — the codebook bumps a revision on any change, with a lightweight /version poll so clients pick up edits; provenance is tracked as an overlay separate from prompt-facing records.
  • Docs & examples — see QDA Mode; a composed qda-mode example plus per-feature example READMEs are included.

Agent Evaluation & LLM-as-Judge

  • LLM-as-judge auto-labeling + blind human calibration — auto-label trajectories with an LLM judge, then calibrate against blind human judgments to measure and tune trustworthiness. See Judge Calibration.
  • Judge↔human alignment + signal-based triage queue — quantify judge/human agreement and route the most informative items to humans first via a signal-based triage queue. See Judge Alignment and Triage Queue.
  • eval_trace display — a three-pane agent-trace display purpose-built for continuous evaluation. See eval_trace.
  • Trajectory editing schemastrajectory_edit and trajectory_correction for capturing edited/corrected agent trajectories as SFT/DPO training data. See Trajectory Correction.
  • Coding-agent evalopenai_tool_use coding backend, a fixed coding-agent eval pipeline and web trace-review UI, and agentic display defect fixes surfaced by screenshot/live verification.

Annotation Workflow & Assignment

  • Heterogeneous coverage — per-item annotator caps, IAA reporting, and adjudication routing for tasks where items need different numbers of annotators.
  • Reclaim abandoned assignments — recover annotation assignments abandoned by Prolific or QC-blocked workers, with configurable retention policies and idempotent, transaction-isolated reclaim.
  • Custom Batch assignment strategy (#160) — assign predefined batches of items to specific annotators.
  • Phase page ordering — order pages within a phase from configuration; phase type is now inferred from the canonical phase name (#154).
  • Reverse-proxy URL prefixes — serve Potato under a sub-path behind a reverse proxy (#161).

Licensing

  • Relicensed to GPL-3.0-or-later (from PolyForm Shield). setup.py, LICENSE, and the README badge all reflect the change.

Performance & Robustness

A large QA-hardening wave (F-022 through F-051):

  • Boot performance — the ML stack is no longer eager-loaded at server boot (F-051): import time ~6.5s→2s, 50k-item boot ~10s→5.7s, RSS ~750→365MB.
  • Schema renameannotation_type: highlightspan with a migration (F-050).
  • Dynamic sources — ingested traces are now assignable to annotators under the dynamic-source quota (F-037); data_directory/data_sources-only configs validate correctly (F-038).
  • Route registration — registered 14 previously-dead admin/API routes and added an invariant test (F-042); registered the test reset_state route and stabilized the video-persistence suite (F-041).
  • Training phase — fixed the __main__/module-split 404 and wired the all-phases example (F-043/F-044).
  • Prolific & payloads — fixed the Prolific __main__-split (F-045) and hardened /updateinstance against non-dict payloads (F-046).
  • Persistence — persist display-based bounding boxes for document/PDF displays (F-040); stop populateInputValues clobbering restored runtime codebook codes.
  • Export & survey — stop silently dropping survey/phase responses (F-047); don't title-case author-written survey/consent labels (F-049).
  • Webhooks — fixed four save-path bugs (500 on save, miswired events).
  • Solo mode — backend fixes (B1–B7), UI polish, and fixed prompt optimization (F-022) so the optimizer is constructed and errors are no longer masked.
  • Active learning — LLM cold-start honors nested llm config and tolerates fenced JSON.
  • Search UX — matched terms highlight with <mark> instead of literal brackets.

Upgrade

pip install --upgrade potato-annotation==2.6.0