Annotation Filtering

This guide covers how to filter data based on prior annotation decisions. This is useful for multi-phase annotation workflows where the output of one task feeds into another.

Common Use Cases

Triage -> Full Annotation: Filter items that were "accepted" in a rapid triage phase
Quality Control: Filter items that passed quality checks
Expert Review: Filter items flagged for expert review
Multi-stage Annotation: Chain annotation tasks together

Option 1: CLI Tool

The filter_by_annotation CLI tool filters data files based on prior annotations.

Basic Usage

# Filter for accepted items from triage
python -m potato.filter_by_annotation \
    --annotations annotation_output/ \
    --data data/items.json \
    --schema data_quality \
    --value accept \
    --output accepted_items.json

CLI Options

Option	Description
`--annotations`, `-a`	Path to annotation_output directory (required)
`--data`, `-d`	Path to original data file (JSON or JSONL)
`--schema`, `-s`	Name of the annotation schema to filter by (required)
`--value`, `-v`	Value(s) to filter for (e.g., `accept` or `accept maybe`)
`--output`, `-o`	Output file path for filtered data
`--id-key`	Key in data items containing the instance ID (default: `id`)
`--invert`	Invert filter: return items that DON'T match
`--format`	Output format: `json` or `jsonl` (default: `json`)
`--summary`	Show annotation summary instead of filtering
`--verbose`, `-V`	Enable verbose logging

Examples

Filter for multiple values:

python -m potato.filter_by_annotation \
    --annotations annotation_output/ \
    --data data/items.json \
    --schema quality \
    --value good acceptable \
    --output filtered.json

Invert filter (get rejected items):

python -m potato.filter_by_annotation \
    --annotations annotation_output/ \
    --data data/items.json \
    --schema triage \
    --value accept \
    --invert \
    --output rejected_items.json

Show annotation summary:

python -m potato.filter_by_annotation \
    --annotations annotation_output/ \
    --schema data_quality \
    --summary

Output:

Annotation summary for schema 'data_quality':
----------------------------------------
  accept: 150 (60.0%)
  reject: 75 (30.0%)
  skip: 25 (10.0%)
----------------------------------------
  Total: 250

Option 2: Config-Based Filtering

Filter data automatically during server startup using configuration.

Configuration

In your config YAML, use a dict format for data_files with filter_by_prior_annotation:

data_files:
  - path: data/items.json
    filter_by_prior_annotation:
      annotation_dir: ../triage-task/annotation_output/
      schema: data_quality
      value: accept

Configuration Options

Option	Type	Description
`annotation_dir`	string	Path to the annotation_output directory from the prior task
`schema`	string	Name of the annotation schema to filter by
`value`	string or list	Value(s) to filter for
`invert`	boolean	If true, return items that DON'T match (default: false)

Examples

Filter for single value:

data_files:
  - path: data/all_items.json
    filter_by_prior_annotation:
      annotation_dir: ../triage/annotation_output/
      schema: triage
      value: accept

Filter for multiple values:

data_files:
  - path: data/all_items.json
    filter_by_prior_annotation:
      annotation_dir: ../quality-check/annotation_output/
      schema: quality
      value:
        - good
        - acceptable

Invert filter (exclude rejected items):

data_files:
  - path: data/all_items.json
    filter_by_prior_annotation:
      annotation_dir: ../review/annotation_output/
      schema: review_status
      value: rejected
      invert: true

Workflow Example: Triage -> Full Annotation

Step 1: Create Triage Task

# triage-task/config.yaml
annotation_task_name: "Data Quality Triage"
task_dir: .

data_files:
  - data/raw_items.json

item_properties:
  id_key: id
  text_key: text

annotation_schemes:
  - annotation_type: triage
    name: data_quality
    description: Is this data suitable for annotation?
    auto_advance: true

output_annotation_dir: annotation_output

Step 2: Run Triage

python potato/flask_server.py start triage-task/config.yaml -p 8000

Annotators rapidly accept/reject/skip items.

Step 3: Create Full Annotation Task

Option A: Using CLI to create filtered data file

# Filter accepted items
python -m potato.filter_by_annotation \
    --annotations triage-task/annotation_output/ \
    --data triage-task/data/raw_items.json \
    --schema data_quality \
    --value accept \
    --output full-annotation-task/data/accepted_items.json

# full-annotation-task/config.yaml
annotation_task_name: "Full Annotation"
task_dir: .

data_files:
  - data/accepted_items.json  # Pre-filtered data

annotation_schemes:
  - annotation_type: span
    name: entities
    description: Annotate named entities
    labels:
      - PERSON
      - ORGANIZATION
      - LOCATION

output_annotation_dir: annotation_output

Option B: Using config-based filtering

# full-annotation-task/config.yaml
annotation_task_name: "Full Annotation"
task_dir: .

data_files:
  - path: ../triage-task/data/raw_items.json
    filter_by_prior_annotation:
      annotation_dir: ../triage-task/annotation_output/
      schema: data_quality
      value: accept

annotation_schemes:
  - annotation_type: span
    name: entities
    description: Annotate named entities
    labels:
      - PERSON
      - ORGANIZATION
      - LOCATION

output_annotation_dir: annotation_output

Output Format

The filtered output preserves all original fields from the input data:

Input (raw_items.json):

[
  {"id": "item_001", "text": "Hello world", "category": "greeting"},
  {"id": "item_002", "text": "Bad data", "category": "noise"},
  {"id": "item_003", "text": "Good data", "category": "content"}
]

Triage annotations: - item_001: accept - item_002: reject - item_003: accept

Output (accepted_items.json):

[
  {"id": "item_001", "text": "Hello world", "category": "greeting"},
  {"id": "item_003", "text": "Good data", "category": "content"}
]

All original fields (id, text, category) are preserved, making the filtered output immediately usable as input to another Potato task.

Python API

For programmatic use:

from potato.filter_by_annotation import (
    filter_items_by_annotation,
    get_annotation_summary,
    load_annotations_from_dir,
)

# Filter items
filtered = filter_items_by_annotation(
    annotation_dir="annotation_output/",
    data_file="data/items.json",
    schema_name="triage",
    filter_value="accept",
    id_key="id"
)

# Get summary
summary = get_annotation_summary("annotation_output/", "triage")
print(summary)  # {'accept': 100, 'reject': 50, 'skip': 25}

# Load raw annotations
annotations = load_annotations_from_dir("annotation_output/")
# Returns: {'item_001': {'triage': {'name': 'accept', 'value': 'accept'}}, ...}

Troubleshooting

No items after filtering

Check the annotation summary to see what values exist:

python -m potato.filter_by_annotation \
    --annotations annotation_output/ \
    --schema YOUR_SCHEMA \
    --summary

Items not matching

Ensure the schema name matches exactly (case-sensitive) and the annotation_dir points to the correct location.

ID mismatch

Verify the id_key matches between your data file and annotations. Both should use the same field name (default: id).