Annotation Filtering
This guide covers how to filter data based on prior annotation decisions. This is useful for multi-phase annotation workflows where the output of one task feeds into another.
Common Use Cases
- Triage -> Full Annotation: Filter items that were "accepted" in a rapid triage phase
- Quality Control: Filter items that passed quality checks
- Expert Review: Filter items flagged for expert review
- Multi-stage Annotation: Chain annotation tasks together
Option 1: CLI Tool
The filter_by_annotation CLI tool filters data files based on prior annotations.
Basic Usage
# Filter for accepted items from triage
python -m potato.filter_by_annotation \
--annotations annotation_output/ \
--data data/items.json \
--schema data_quality \
--value accept \
--output accepted_items.json
CLI Options
| Option | Description |
|---|---|
--annotations, -a |
Path to annotation_output directory (required) |
--data, -d |
Path to original data file (JSON or JSONL) |
--schema, -s |
Name of the annotation schema to filter by (required) |
--value, -v |
Value(s) to filter for (e.g., accept or accept maybe) |
--output, -o |
Output file path for filtered data |
--id-key |
Key in data items containing the instance ID (default: id) |
--invert |
Invert filter: return items that DON'T match |
--format |
Output format: json or jsonl (default: json) |
--summary |
Show annotation summary instead of filtering |
--verbose, -V |
Enable verbose logging |
Examples
Filter for multiple values:
python -m potato.filter_by_annotation \
--annotations annotation_output/ \
--data data/items.json \
--schema quality \
--value good acceptable \
--output filtered.json
Invert filter (get rejected items):
python -m potato.filter_by_annotation \
--annotations annotation_output/ \
--data data/items.json \
--schema triage \
--value accept \
--invert \
--output rejected_items.json
Show annotation summary:
python -m potato.filter_by_annotation \
--annotations annotation_output/ \
--schema data_quality \
--summary
Output:
Annotation summary for schema 'data_quality':
----------------------------------------
accept: 150 (60.0%)
reject: 75 (30.0%)
skip: 25 (10.0%)
----------------------------------------
Total: 250
Option 2: Config-Based Filtering
Filter data automatically during server startup using configuration.
Configuration
In your config YAML, use a dict format for data_files with filter_by_prior_annotation:
data_files:
- path: data/items.json
filter_by_prior_annotation:
annotation_dir: ../triage-task/annotation_output/
schema: data_quality
value: accept
Configuration Options
| Option | Type | Description |
|---|---|---|
annotation_dir |
string | Path to the annotation_output directory from the prior task |
schema |
string | Name of the annotation schema to filter by |
value |
string or list | Value(s) to filter for |
invert |
boolean | If true, return items that DON'T match (default: false) |
Examples
Filter for single value:
data_files:
- path: data/all_items.json
filter_by_prior_annotation:
annotation_dir: ../triage/annotation_output/
schema: triage
value: accept
Filter for multiple values:
data_files:
- path: data/all_items.json
filter_by_prior_annotation:
annotation_dir: ../quality-check/annotation_output/
schema: quality
value:
- good
- acceptable
Invert filter (exclude rejected items):
data_files:
- path: data/all_items.json
filter_by_prior_annotation:
annotation_dir: ../review/annotation_output/
schema: review_status
value: rejected
invert: true
Workflow Example: Triage -> Full Annotation
Step 1: Create Triage Task
# triage-task/config.yaml
annotation_task_name: "Data Quality Triage"
task_dir: .
data_files:
- data/raw_items.json
item_properties:
id_key: id
text_key: text
annotation_schemes:
- annotation_type: triage
name: data_quality
description: Is this data suitable for annotation?
auto_advance: true
output_annotation_dir: annotation_output
Step 2: Run Triage
python potato/flask_server.py start triage-task/config.yaml -p 8000
Annotators rapidly accept/reject/skip items.
Step 3: Create Full Annotation Task
Option A: Using CLI to create filtered data file
# Filter accepted items
python -m potato.filter_by_annotation \
--annotations triage-task/annotation_output/ \
--data triage-task/data/raw_items.json \
--schema data_quality \
--value accept \
--output full-annotation-task/data/accepted_items.json
# full-annotation-task/config.yaml
annotation_task_name: "Full Annotation"
task_dir: .
data_files:
- data/accepted_items.json # Pre-filtered data
annotation_schemes:
- annotation_type: span
name: entities
description: Annotate named entities
labels:
- PERSON
- ORGANIZATION
- LOCATION
output_annotation_dir: annotation_output
Option B: Using config-based filtering
# full-annotation-task/config.yaml
annotation_task_name: "Full Annotation"
task_dir: .
data_files:
- path: ../triage-task/data/raw_items.json
filter_by_prior_annotation:
annotation_dir: ../triage-task/annotation_output/
schema: data_quality
value: accept
annotation_schemes:
- annotation_type: span
name: entities
description: Annotate named entities
labels:
- PERSON
- ORGANIZATION
- LOCATION
output_annotation_dir: annotation_output
Output Format
The filtered output preserves all original fields from the input data:
Input (raw_items.json):
[
{"id": "item_001", "text": "Hello world", "category": "greeting"},
{"id": "item_002", "text": "Bad data", "category": "noise"},
{"id": "item_003", "text": "Good data", "category": "content"}
]
Triage annotations: - item_001: accept - item_002: reject - item_003: accept
Output (accepted_items.json):
[
{"id": "item_001", "text": "Hello world", "category": "greeting"},
{"id": "item_003", "text": "Good data", "category": "content"}
]
All original fields (id, text, category) are preserved, making the filtered output immediately usable as input to another Potato task.
Python API
For programmatic use:
from potato.filter_by_annotation import (
filter_items_by_annotation,
get_annotation_summary,
load_annotations_from_dir,
)
# Filter items
filtered = filter_items_by_annotation(
annotation_dir="annotation_output/",
data_file="data/items.json",
schema_name="triage",
filter_value="accept",
id_key="id"
)
# Get summary
summary = get_annotation_summary("annotation_output/", "triage")
print(summary) # {'accept': 100, 'reject': 50, 'skip': 25}
# Load raw annotations
annotations = load_annotations_from_dir("annotation_output/")
# Returns: {'item_001': {'triage': {'name': 'accept', 'value': 'accept'}}, ...}
Troubleshooting
No items after filtering
Check the annotation summary to see what values exist:
python -m potato.filter_by_annotation \
--annotations annotation_output/ \
--schema YOUR_SCHEMA \
--summary
Items not matching
Ensure the schema name matches exactly (case-sensitive) and the annotation_dir points to the correct location.
ID mismatch
Verify the id_key matches between your data file and annotations. Both should use the same field name (default: id).