Choosing the Right Annotation Type
With 36 annotation schema types, Potato covers virtually every annotation paradigm used in NLP research, LLM evaluation, survey methodology, and crowdsourcing. This guide helps you choose the right schema for your task.
Decision Flowchart
What kind of judgment do you need?
│
├─ CLASSIFY items into categories
│ ├─ One label per item?
│ │ ├─ Few options (2-10) → radio
│ │ ├─ Many options (10+) → select (dropdown)
│ │ └─ Quick accept/reject → triage
│ ├─ Multiple labels per item?
│ │ ├─ Flat list → multiselect
│ │ └─ Hierarchical tree → hierarchical_multiselect
│ └─ Per-item confidence? → Add confidence schema after primary
│
├─ RATE / SCORE items
│ ├─ Single dimension
│ │ ├─ Discrete points (1-7) → likert
│ │ ├─ Discrete slider with steps → slider
│ │ ├─ Continuous (no tick marks) → vas
│ │ └─ Acceptable range (min-max) → range_slider
│ ├─ Multiple dimensions
│ │ ├─ Same items, different criteria → multirate
│ │ ├─ Different criteria, same scales → rubric_eval
│ │ └─ Bipolar adjective pairs → semantic_differential
│ └─ Per-rating justification? → Add text schema with target_schema
│
├─ COMPARE items
│ ├─ Two items → pairwise
│ ├─ Best + worst from set → bws
│ ├─ Full ordering → ranking
│ └─ Multi-attribute profiles → conjoint
│
├─ DISTRIBUTE / ALLOCATE
│ ├─ Probability across labels → soft_label
│ └─ Fixed budget of points → constant_sum
│
├─ ANNOTATE TEXT STRUCTURE
│ ├─ Label spans in text → span
│ ├─ Relationships between spans → span_link
│ ├─ Coreference chains → coreference
│ ├─ Event triggers + arguments → event_annotation
│ ├─ Answer a question from passage → extractive_qa
│ └─ Mark errors with type/severity → error_span
│
├─ PRODUCE / EDIT TEXT
│ ├─ Free text input → text
│ └─ Edit existing text with diff → text_edit
│
├─ ORGANIZE / GROUP
│ ├─ Sort items into groups → card_sort
│ ├─ Order items → ranking
│ └─ Select from hierarchy → hierarchical_multiselect
│
└─ ANNOTATE MEDIA
├─ Images (bbox, polygon, landmarks) → image_annotation
├─ Audio (segments, labels) → audio_annotation
├─ Video (temporal segments, tracking) → video_annotation
└─ Multi-tier time-aligned → tiered_annotation
Quick Reference Table
| Type Key |
Description |
Output |
Typical Use Case |
radio |
Single-choice radio buttons |
{"label": "value"} |
Sentiment, intent classification |
multiselect |
Multiple-choice checkboxes |
{"label1": "val", "label2": "val"} |
Multi-label classification |
select |
Dropdown selection |
{"label": "value"} |
Many categories (10+) |
likert |
Discrete point scale |
{"label": "3"} |
Agreement, quality rating |
slider |
Numeric slider with steps |
{"label": "75"} |
Bounded numeric judgments |
vas |
Continuous analog scale |
{"label": "67.3"} |
Fine-grained magnitude estimation |
range_slider |
Dual-thumb range |
{"low": "30", "high": "70"} |
Acceptable range annotation |
text |
Free text input / textarea |
{"text_box": "..."} |
Open-ended responses, rationales |
number |
Numeric input field |
{"label": "42"} |
Count, quantity annotation |
span |
Text span highlighting |
via span API |
NER, POS tagging |
span_link |
Relationships between spans |
via span API |
Relation extraction |
coreference |
Coreference chains |
via span API |
Entity coreference |
event_annotation |
Event triggers + arguments |
via event API |
Event extraction |
extractive_qa |
Answer span in passage |
{"answer_text", "start", "end"} |
Reading comprehension |
error_span |
Error spans with type/severity |
{"errors": [...], "score": N} |
MQM translation evaluation |
pairwise |
Compare two items |
{"label": "A"} |
Model comparison |
bws |
Best-worst from a set |
{"best": "X", "worst": "Y"} |
Relative scaling |
ranking |
Drag-and-drop ordering |
{"order": "a,b,c"} |
Preference ranking |
conjoint |
Choose from multi-attribute profiles |
{"chosen_profile": 2} |
Attribute importance |
multirate |
Rate multiple items on a scale |
{"item1": "3", "item2": "5"} |
Batch rating |
rubric_eval |
Multi-criteria rating grid |
{"crit1": "4", "crit2": "5"} |
LLM evaluation rubrics |
semantic_differential |
Bipolar adjective scales |
{"pair1": "3"} |
Connotative meaning |
soft_label |
Probability distribution sliders |
{"label1": "60", "label2": "40"} |
Uncertainty capture |
confidence |
Confidence meta-annotation |
{"value": "4"} |
Annotator certainty |
constant_sum |
Fixed-budget point allocation |
{"label1": "30", "label2": "70"} |
Relative importance |
text_edit |
Edit text with diff tracking |
{"edited_text", "edit_distance"} |
MT post-editing |
card_sort |
Drag items into groups |
{"group1": ["a","b"]} |
Taxonomy, IA testing |
hierarchical_multiselect |
Tree-structured label selection |
{"selected": "path"} |
Deep taxonomies |
triage |
Accept/reject/skip |
{"decision": "accept"} |
Rapid data curation |
image_annotation |
Bbox, polygon, landmarks |
JSON annotation data |
Object detection |
audio_annotation |
Audio segment labeling |
JSON annotation data |
Speech, music analysis |
video_annotation |
Video temporal annotation |
JSON annotation data |
Activity recognition |
tiered_annotation |
Multi-tier time-aligned |
JSON annotation data |
ELAN-style annotation |
pure_display |
Display-only content |
(none) |
Instructions, headers |
video |
Video player |
(none) |
Video display |
tree_annotation |
Conversation tree |
JSON annotation data |
Dialogue analysis |
By Research Goal
"I need to classify items"
| If you need... |
Use |
Why |
| One label per item, few options |
radio |
Simple, supports keyboard shortcuts |
| One label, many options |
select |
Dropdown saves space |
| Quick binary decision |
triage |
Accept/reject with one click |
| Multiple labels per item |
multiselect |
Checkboxes for independent labels |
| Labels from a deep hierarchy |
hierarchical_multiselect |
Tree with search and auto-propagation |
"I need to rate or score items"
| If you need... |
Use |
Why |
| Discrete points (1-5, 1-7) |
likert |
Standard survey scale |
| Numeric value with steps |
slider |
Visual, bounded |
| Continuous, no discrete bins |
vas |
Psychophysical precision |
| An acceptable range |
range_slider |
Dual-thumb min/max |
| Rate on multiple criteria |
rubric_eval |
Grid layout, ideal for LLM eval |
| Rate multiple items on one scale |
multirate |
Items × options matrix |
| Bipolar adjective pairs |
semantic_differential |
Warm-Cold, Good-Bad scales |
"I need to compare items"
| If you need... |
Use |
Why |
| Compare exactly 2 items |
pairwise |
A vs B (binary or scale) |
| Best and worst from a set |
bws |
Best-worst scaling |
| Full preference ordering |
ranking |
Drag-and-drop reorder |
| Choose among multi-attribute profiles |
conjoint |
Attribute importance estimation |
"I need to distribute or allocate"
| If you need... |
Use |
Why |
| Probability across labels |
soft_label |
Constrained sliders summing to 100% |
| Fixed budget of points |
constant_sum |
Allocate N points across categories |
"I need to annotate text structure"
| If you need... |
Use |
Why |
| Label spans (NER, POS) |
span |
Multi-label span highlighting |
| Relationships between spans |
span_link |
Directed edges between spans |
| Coreference chains |
coreference |
Group mentions of same entity |
| Event extraction |
event_annotation |
Triggers + typed arguments |
| Answer a question from text |
extractive_qa |
SQuAD-style QA |
| Mark errors with type/severity |
error_span |
MQM quality evaluation |
"I need annotators to produce or edit text"
| If you need... |
Use |
Why |
| Free text response |
text |
Textarea with optional min_chars |
| Edit existing text with change tracking |
text_edit |
Diff visualization + edit distance |
| Justification for another annotation |
text with target_schema |
Visual grouping as rationale |
"I need to evaluate AI outputs"
| If you need... |
Use |
Why |
| Multi-criteria rubric |
rubric_eval |
MT-Bench-style evaluation |
| A vs B comparison |
pairwise |
Which response is better |
| Rank multiple outputs |
ranking |
Preference ordering |
| Identify specific errors |
error_span |
MQM-style error annotation |
| Post-edit model output |
text_edit |
Correction with diff |
| Rate confidence in evaluation |
confidence |
Meta-annotation |
"I need to capture uncertainty"
| If you need... |
Use |
Why |
| Per-annotation confidence |
confidence |
Meta-annotation schema |
| Probability distribution |
soft_label |
Label probability sliders |
| Acceptable range |
range_slider |
Min-max bounds |
| Fine-grained continuous rating |
vas |
No discrete anchoring |
Head-to-Head Comparisons
likert vs slider vs vas
- likert: Discrete points (1-7) with labeled buttons. Best for standard survey scales where distinct categories matter.
- slider: Discrete steps along a track with visible tick marks and value display. Best for bounded numeric values.
- vas: Continuous line with endpoint labels only, no tick marks. Best for magnitude estimation where you want maximum precision without anchoring to discrete values.
radio vs select vs triage
- radio: Visible buttons, supports keyboard shortcuts. Best for 2-10 options that annotators need to see.
- select: Dropdown, compact. Best for 10+ options where space matters.
- triage: Binary accept/reject with optional skip. Best for rapid data curation.
pairwise vs bws vs ranking vs conjoint
- pairwise: Compare exactly 2 items. Simplest, most reliable.
- bws: Select best + worst from 3-5 items. Efficient relative scaling.
- ranking: Full ordering of all items. Most informative but cognitively demanding.
- conjoint: Choose among multi-attribute profiles. Best for attribute importance.
multirate vs rubric_eval vs semantic_differential
- multirate: Rate multiple items (from data) on the same set of options. Rows = items, columns = options.
- rubric_eval: Rate one item on multiple criteria. Rows = criteria (from config), columns = scale points.
- semantic_differential: Rate on bipolar adjective pairs. Rows = adjective pairs, columns = scale points between poles.
- span: General multi-label span annotation (NER, POS, etc.). Multiple spans, multiple categories.
- extractive_qa: Single answer span for a specific question. One span at a time, with "unanswerable" option.
- error_span: Error spans with type taxonomy and severity. Multiple spans, each with type + severity + quality score.
text vs text_edit
- text: Free text input for new content (responses, rationales, translations from scratch).
- text_edit: Edit existing text with change tracking (post-editing, correction, simplification).
soft_label vs constant_sum
- soft_label: Probability distribution. Sliders auto-normalize to 100%. For label uncertainty.
- constant_sum: Fixed budget allocation. Manual balancing. For relative importance judgments.
multiselect vs hierarchical_multiselect
- multiselect: Flat list of checkboxes. For independent labels without hierarchy.
- hierarchical_multiselect: Tree with expand/collapse, search, auto-propagation. For deep taxonomies.
ranking vs card_sort
- ranking: Order items along one dimension (preference, relevance).
- card_sort: Group items into categories (topic, similarity). No ordering within groups.
Combining Schemas
Potato supports multiple annotation schemas per task. Common patterns:
Classification + Confidence
annotation_schemes:
- annotation_type: radio
name: sentiment
labels: ["Positive", "Negative", "Neutral"]
- annotation_type: confidence
name: confidence
target_schema: sentiment
Classification + Rationale
annotation_schemes:
- annotation_type: radio
name: toxicity
labels: ["Toxic", "Not toxic"]
- annotation_type: text
name: rationale
description: "Why did you choose this label?"
target_schema: toxicity
min_chars: 10
show_char_count: true
collapsible: true
multiline: true
rows: 3
Multi-Dimensional LLM Evaluation
annotation_schemes:
- annotation_type: rubric_eval
name: quality
criteria:
- name: helpfulness
- name: accuracy
- name: safety
scale_points: 5
show_overall: true
- annotation_type: confidence
name: eval_confidence
target_schema: quality
- annotation_type: text
name: justification
description: "Explain your ratings"
target_schema: quality
min_chars: 20
show_char_count: true
multiline: true
rows: 4
Error Annotation + Post-Edit
annotation_schemes:
- annotation_type: error_span
name: errors
error_types:
- name: Accuracy
subtypes: ["Omission", "Mistranslation"]
- name: Fluency
subtypes: ["Grammar", "Spelling"]
show_score: true
- annotation_type: text_edit
name: correction
source_field: "mt_output"
show_diff: true