Quality Control Features

Potato provides comprehensive quality control features to ensure high-quality annotations in your projects. This guide covers five key features:

Attention Checks - Verify annotator engagement with known-answer items
Gold Standards - Track accuracy against expert-labeled items
Pre-annotation Support - Pre-fill forms with model predictions
Agreement Metrics - Calculate inter-annotator agreement in real-time
Step-Level QC - Per-step quality control for agent trace evaluation

Attention Checks

Attention checks are items with known correct answers that are periodically injected into the annotation flow to verify that annotators are paying attention and not randomly clicking.

Configuration

attention_checks:
  enabled: true

  # Path to JSON file containing attention check items
  items_file: "attention_checks.json"

  # How often to inject attention checks (choose one):
  frequency: 10              # Insert one every 10 items
  # OR
  probability: 0.1           # 10% chance per item

  # Optional: flag suspiciously fast responses
  min_response_time: 3.0     # Flag if answered in < 3 seconds

  # Failure handling
  failure_handling:
    warn_threshold: 2        # Show warning after 2 failures
    warn_message: "Please read items carefully before answering."
    block_threshold: 5       # Block user after 5 failures
    block_message: "You have been blocked due to too many incorrect responses."

Attention Check Items File Format

Create a JSON file with your attention check items:

[
  {
    "id": "attn_001",
    "text": "Please select 'Positive' for this item to verify you are reading carefully.",
    "expected_answer": {
      "sentiment": "positive"
    }
  },
  {
    "id": "attn_002",
    "text": "This is a test item. The correct answer is 'Negative'. Please select it now.",
    "expected_answer": {
      "sentiment": "negative"
    }
  }
]

Fields: - id (required): Unique identifier for the attention check - text (required): The text to display to annotators - expected_answer (required): Dictionary mapping schema names to expected values

How It Works

Attention check items are loaded at server startup
Based on frequency or probability, checks are injected into the annotation flow
When an annotator submits a response, it's compared to the expected answer
Failures are tracked per-user
Warnings and blocks are triggered at configured thresholds

Admin Dashboard

View attention check statistics in the admin dashboard at /admin: - Overall pass/fail rates - Per-annotator statistics - Individual failure history

Gold Standards

Gold standards are expert-labeled items used to measure annotator accuracy. By default, gold standards are silent - results are recorded for admin review in the dashboard, but annotators don't see feedback. This allows you to track quality without influencing annotator behavior.

Configuration

gold_standards:
  enabled: true

  # Path to JSON file containing gold standard items
  items_file: "gold_standards.json"

  # How to use gold standards
  mode: "mixed"              # Options: training, mixed, separate
  # - training: Show only during training phase
  # - mixed: Mix into regular annotation (silent tracking)
  # - separate: Dedicated evaluation phase

  # For mixed mode, how often to inject
  frequency: 20              # Insert one every 20 items

  # Accuracy requirements (tracked in admin dashboard)
  accuracy:
    min_threshold: 0.7       # Minimum required accuracy (70%)
    evaluation_count: 10     # Evaluate after this many gold items

  # Feedback settings (disabled by default for silent tracking)
  # Enable for training scenarios where you want to give annotators feedback
  feedback:
    show_correct_answer: false  # Show correct answer after submission
    show_explanation: false     # Show explanation if provided

  # Auto-promotion: items become gold standards when annotators agree
  auto_promote:
    enabled: true
    min_annotators: 3          # Minimum annotators before checking
    agreement_threshold: 1.0   # 1.0 = unanimous, 0.8 = 80% agree

Gold Standard Items File Format

[
  {
    "id": "gold_001",
    "text": "The service was absolutely terrible and I will never return.",
    "gold_label": {
      "sentiment": "negative"
    },
    "explanation": "Strong negative language ('absolutely terrible', 'never return') clearly indicates negative sentiment.",
    "difficulty": "easy"
  },
  {
    "id": "gold_002",
    "text": "The food was okay but nothing special.",
    "gold_label": {
      "sentiment": "neutral"
    },
    "explanation": "Mixed signals balance to neutral sentiment.",
    "difficulty": "medium"
  }
]

Fields: - id (required): Unique identifier - text (required): The text to display - gold_label (required): Dictionary with correct annotations - explanation (optional): Explanation shown to annotators - difficulty (optional): Metadata for analysis

Feedback Display

After submitting a gold standard item, annotators see: - Whether their answer was correct or incorrect - The correct answer (if show_correct_answer: true) - An explanation (if show_explanation: true and explanation provided) - Accuracy warning if below threshold

Admin Dashboard

View gold standard metrics in the admin dashboard: - Overall accuracy across all annotators - Per-annotator accuracy tracking - Per-item difficulty analysis (which items are most often missed) - Users below accuracy threshold

Auto-Promotion to Gold Standard

You can configure Potato to automatically promote items to the gold standard pool when multiple annotators agree on the label. This is useful for: - Growing your gold standard pool organically - Identifying "easy" items where everyone agrees - Reducing the burden of manually creating gold standards

gold_standards:
  enabled: true
  items_file: "initial_gold_standards.json"  # Seed items (optional)

  auto_promote:
    enabled: true
    min_annotators: 3          # Wait for at least 3 annotators
    agreement_threshold: 1.0   # 1.0 = all must agree (unanimous)
                               # 0.8 = 80% must agree

How it works: 1. As items are annotated, the system tracks all responses 2. When min_annotators have annotated an item, agreement is checked 3. If agreement meets agreement_threshold, the item is promoted 4. Promoted items are added to the gold standard pool and used for future quality checks

Admin visibility: - View promoted items in /admin/api/quality_control - See "promotion candidates" (items close to threshold) - Track which items were auto-promoted vs. manually defined

Pre-annotation Support

Pre-annotation allows you to pre-fill annotation forms with model predictions, useful for: - Active learning workflows - Correcting model outputs - Bootstrapping from existing annotations

Configuration

pre_annotation:
  enabled: true

  # Field in data items containing predictions
  field: "predictions"

  # Can annotators change pre-filled values?
  allow_modification: true

  # Show confidence scores if available
  show_confidence: true

  # Highlight items below this confidence threshold
  highlight_low_confidence: 0.7

Data Format

Include predictions in your data items:

{
  "id": "item_001",
  "text": "I love this product!",
  "predictions": {
    "sentiment": "positive",
    "confidence": 0.92
  }
}

For span annotations:

{
  "id": "item_002",
  "text": "Apple announced new iPhone in California.",
  "predictions": {
    "entities": [
      {"start": 0, "end": 5, "label": "ORG", "confidence": 0.85},
      {"start": 27, "end": 37, "label": "LOC", "confidence": 0.91}
    ]
  }
}

How It Works

When an item is loaded, the predictions field is extracted
If the annotator hasn't already annotated this item, predictions are used to pre-fill the form
Annotators can modify the pre-filled values (if allow_modification: true)
Low-confidence items can be visually highlighted

Best Practices

Use pre-annotation for correction workflows where you have model predictions
Set allow_modification: true to let annotators fix errors
Use confidence thresholds to flag items needing more attention
Track modification rates to assess model quality

Agreement Metrics

Real-time inter-annotator agreement metrics are available in the admin dashboard, using Krippendorff's alpha.

Configuration

agreement_metrics:
  enabled: true

  # Minimum annotators per item for calculation
  min_overlap: 2

  # Auto-refresh settings
  auto_refresh: true
  refresh_interval: 60       # Seconds between updates

Interpreting Krippendorff's Alpha

Alpha Value	Interpretation
α ≥ 0.8	Good agreement - reliable for most purposes
0.67 ≤ α < 0.8	Tentative agreement - draw tentative conclusions
0.33 ≤ α < 0.67	Low agreement - review guidelines
α < 0.33	Poor agreement - significant issues

Admin Dashboard

The Agreement tab in the admin dashboard shows: - Overall average alpha across all schemas - Per-schema agreement metrics - Number of items evaluated - Metric type (nominal vs interval) - Human-readable interpretation

When to Use Different Metrics

The system automatically selects the appropriate metric: - Nominal metric: For categorical annotations (radio, multiselect) - Interval metric: For numeric annotations (likert, slider, number)

API Endpoints

Quality Control Metrics

GET /admin/api/quality_control

Returns:

{
  "enabled": true,
  "attention_checks": {
    "enabled": true,
    "total_checks": 50,
    "total_passed": 45,
    "total_failed": 5,
    "by_user": {
      "user1": {"passed": 10, "failed": 0, "pass_rate": 1.0},
      "user2": {"passed": 8, "failed": 2, "pass_rate": 0.8}
    }
  },
  "gold_standards": {
    "enabled": true,
    "total_evaluations": 30,
    "total_correct": 25,
    "by_user": {...},
    "by_item": {...}
  }
}

Agreement Metrics

GET /admin/api/agreement

Returns:

{
  "enabled": true,
  "overall": {
    "average_krippendorff_alpha": 0.75,
    "interpretation": "Tentative agreement"
  },
  "by_schema": {
    "sentiment": {
      "krippendorff_alpha": 0.82,
      "items_evaluated": 100,
      "interpretation": "Good agreement"
    }
  }
}

Example Configuration

Here's a complete example with all quality control features enabled:

annotation_task_name: "Sentiment Analysis with Quality Control"

# Main annotation scheme
annotation_schemes:
  - name: sentiment
    annotation_type: radio
    labels: [positive, negative, neutral]
    description: "Select the sentiment of the text"

# Quality Control Configuration
attention_checks:
  enabled: true
  items_file: "data/attention_checks.json"
  frequency: 15
  failure_handling:
    warn_threshold: 2
    block_threshold: 5

# Optional: discard all completed work from QC-blocked users.
# If omitted, completed annotations are preserved, while the failed
# attention-check response itself is not kept.
instance_reclaim:
  quality_control:
    preserve_completed_annotations: false

gold_standards:
  enabled: true
  items_file: "data/gold_standards.json"
  mode: mixed
  frequency: 25
  accuracy:
    min_threshold: 0.7
    evaluation_count: 5
  feedback:
    show_correct_answer: true
    show_explanation: true

pre_annotation:
  enabled: true
  field: "model_prediction"
  allow_modification: true

agreement_metrics:
  enabled: true
  min_overlap: 2
  refresh_interval: 60

Step-Level Quality Control

For agent trace evaluation and trajectory annotation tasks, Potato supports step-level quality control — checking annotator performance at the granularity of individual agent steps rather than whole instances.

Configuration

quality_control:
  step_level:
    enabled: true

    # Gold standards with known-correct step labels
    gold_standards_file: "data/gold_steps.json"

    # Attention checks at the step level
    attention_checks:
      enabled: true
      frequency: 5  # Insert attention check every N steps

Step-level QC is particularly useful for: - Trajectory evaluation tasks where each step has independent correctness - Process reward model annotation where per-step labels are critical - Code review tasks with multiple files to review

Step-level agreement metrics (Cohen's kappa) are computed per annotator pair for step-level schemas and available in the admin dashboard.

Troubleshooting

Attention checks not appearing

Verify items_file path is correct (relative to task directory)
Check that items have required fields (id, expected_answer)
Ensure frequency or probability is set

Gold standard feedback not showing

Check feedback.show_correct_answer is true
Verify items have gold_label field
Check browser console for JavaScript errors

Agreement metrics showing "No items with N+ annotators"

Ensure items have been annotated by multiple users
Reduce min_overlap if needed
Check that annotations are being saved correctly

Pre-annotations not appearing

Verify field matches the field name in your data
Check that predictions format matches expected schema
Ensure user hasn't already annotated the item (pre-annotations only appear for un-annotated items)