Choosing the Right Annotation Type

With 36 annotation schema types, Potato covers virtually every annotation paradigm used in NLP research, LLM evaluation, survey methodology, and crowdsourcing. This guide helps you choose the right schema for your task.

Decision Flowchart

What kind of judgment do you need?
│
├─ CLASSIFY items into categories
│  ├─ One label per item?
│  │  ├─ Few options (2-10) → radio
│  │  ├─ Many options (10+) → select (dropdown)
│  │  └─ Quick accept/reject → triage
│  ├─ Multiple labels per item?
│  │  ├─ Flat list → multiselect
│  │  └─ Hierarchical tree → hierarchical_multiselect
│  └─ Per-item confidence? → Add confidence schema after primary
│
├─ RATE / SCORE items
│  ├─ Single dimension
│  │  ├─ Discrete points (1-7) → likert
│  │  ├─ Discrete slider with steps → slider
│  │  ├─ Continuous (no tick marks) → vas
│  │  └─ Acceptable range (min-max) → range_slider
│  ├─ Multiple dimensions
│  │  ├─ Same items, different criteria → multirate
│  │  ├─ Different criteria, same scales → rubric_eval
│  │  └─ Bipolar adjective pairs → semantic_differential
│  └─ Per-rating justification? → Add text schema with target_schema
│
├─ COMPARE items
│  ├─ Two items → pairwise
│  ├─ Best + worst from set → bws
│  ├─ Full ordering → ranking
│  └─ Multi-attribute profiles → conjoint
│
├─ DISTRIBUTE / ALLOCATE
│  ├─ Probability across labels → soft_label
│  └─ Fixed budget of points → constant_sum
│
├─ ANNOTATE TEXT STRUCTURE
│  ├─ Label spans in text → span
│  ├─ Relationships between spans → span_link
│  ├─ Coreference chains → coreference
│  ├─ Event triggers + arguments → event_annotation
│  ├─ Answer a question from passage → extractive_qa
│  └─ Mark errors with type/severity → error_span
│
├─ PRODUCE / EDIT TEXT
│  ├─ Free text input → text
│  └─ Edit existing text with diff → text_edit
│
├─ ORGANIZE / GROUP
│  ├─ Sort items into groups → card_sort
│  ├─ Order items → ranking
│  └─ Select from hierarchy → hierarchical_multiselect
│
└─ ANNOTATE MEDIA
   ├─ Images (bbox, polygon, landmarks) → image_annotation
   ├─ Audio (segments, labels) → audio_annotation
   ├─ Video (temporal segments, tracking) → video_annotation
   └─ Multi-tier time-aligned → tiered_annotation

Quick Reference Table

Type Key	Description	Output	Typical Use Case
`radio`	Single-choice radio buttons	`{"label": "value"}`	Sentiment, intent classification
`multiselect`	Multiple-choice checkboxes	`{"label1": "val", "label2": "val"}`	Multi-label classification
`select`	Dropdown selection	`{"label": "value"}`	Many categories (10+)
`likert`	Discrete point scale	`{"label": "3"}`	Agreement, quality rating
`slider`	Numeric slider with steps	`{"label": "75"}`	Bounded numeric judgments
`vas`	Continuous analog scale	`{"label": "67.3"}`	Fine-grained magnitude estimation
`range_slider`	Dual-thumb range	`{"low": "30", "high": "70"}`	Acceptable range annotation
`text`	Free text input / textarea	`{"text_box": "..."}`	Open-ended responses, rationales
`number`	Numeric input field	`{"label": "42"}`	Count, quantity annotation
`span`	Text span highlighting	via span API	NER, POS tagging
`span_link`	Relationships between spans	via span API	Relation extraction
`coreference`	Coreference chains	via span API	Entity coreference
`event_annotation`	Event triggers + arguments	via event API	Event extraction
`extractive_qa`	Answer span in passage	`{"answer_text", "start", "end"}`	Reading comprehension
`error_span`	Error spans with type/severity	`{"errors": [...], "score": N}`	MQM translation evaluation
`pairwise`	Compare two items	`{"label": "A"}`	Model comparison
`bws`	Best-worst from a set	`{"best": "X", "worst": "Y"}`	Relative scaling
`ranking`	Drag-and-drop ordering	`{"order": "a,b,c"}`	Preference ranking
`conjoint`	Choose from multi-attribute profiles	`{"chosen_profile": 2}`	Attribute importance
`multirate`	Rate multiple items on a scale	`{"item1": "3", "item2": "5"}`	Batch rating
`rubric_eval`	Multi-criteria rating grid	`{"crit1": "4", "crit2": "5"}`	LLM evaluation rubrics
`semantic_differential`	Bipolar adjective scales	`{"pair1": "3"}`	Connotative meaning
`soft_label`	Probability distribution sliders	`{"label1": "60", "label2": "40"}`	Uncertainty capture
`confidence`	Confidence meta-annotation	`{"value": "4"}`	Annotator certainty
`constant_sum`	Fixed-budget point allocation	`{"label1": "30", "label2": "70"}`	Relative importance
`text_edit`	Edit text with diff tracking	`{"edited_text", "edit_distance"}`	MT post-editing
`card_sort`	Drag items into groups	`{"group1": ["a","b"]}`	Taxonomy, IA testing
`hierarchical_multiselect`	Tree-structured label selection	`{"selected": "path"}`	Deep taxonomies
`triage`	Accept/reject/skip	`{"decision": "accept"}`	Rapid data curation
`image_annotation`	Bbox, polygon, landmarks	JSON annotation data	Object detection
`audio_annotation`	Audio segment labeling	JSON annotation data	Speech, music analysis
`video_annotation`	Video temporal annotation	JSON annotation data	Activity recognition
`tiered_annotation`	Multi-tier time-aligned	JSON annotation data	ELAN-style annotation
`pure_display`	Display-only content	(none)	Instructions, headers
`video`	Video player	(none)	Video display
`tree_annotation`	Conversation tree	JSON annotation data	Dialogue analysis

By Research Goal

"I need to classify items"

If you need...	Use	Why
One label per item, few options	`radio`	Simple, supports keyboard shortcuts
One label, many options	`select`	Dropdown saves space
Quick binary decision	`triage`	Accept/reject with one click
Multiple labels per item	`multiselect`	Checkboxes for independent labels
Labels from a deep hierarchy	`hierarchical_multiselect`	Tree with search and auto-propagation

"I need to rate or score items"

If you need...	Use	Why
Discrete points (1-5, 1-7)	`likert`	Standard survey scale
Numeric value with steps	`slider`	Visual, bounded
Continuous, no discrete bins	`vas`	Psychophysical precision
An acceptable range	`range_slider`	Dual-thumb min/max
Rate on multiple criteria	`rubric_eval`	Grid layout, ideal for LLM eval
Rate multiple items on one scale	`multirate`	Items × options matrix
Bipolar adjective pairs	`semantic_differential`	Warm-Cold, Good-Bad scales

"I need to compare items"

If you need...	Use	Why
Compare exactly 2 items	`pairwise`	A vs B (binary or scale)
Best and worst from a set	`bws`	Best-worst scaling
Full preference ordering	`ranking`	Drag-and-drop reorder
Choose among multi-attribute profiles	`conjoint`	Attribute importance estimation

"I need to distribute or allocate"

If you need...	Use	Why
Probability across labels	`soft_label`	Constrained sliders summing to 100%
Fixed budget of points	`constant_sum`	Allocate N points across categories

"I need to annotate text structure"

If you need...	Use	Why
Label spans (NER, POS)	`span`	Multi-label span highlighting
Relationships between spans	`span_link`	Directed edges between spans
Coreference chains	`coreference`	Group mentions of same entity
Event extraction	`event_annotation`	Triggers + typed arguments
Answer a question from text	`extractive_qa`	SQuAD-style QA
Mark errors with type/severity	`error_span`	MQM quality evaluation

"I need annotators to produce or edit text"

If you need...	Use	Why
Free text response	`text`	Textarea with optional min_chars
Edit existing text with change tracking	`text_edit`	Diff visualization + edit distance
Justification for another annotation	`text` with `target_schema`	Visual grouping as rationale

"I need to evaluate AI outputs"

If you need...	Use	Why
Multi-criteria rubric	`rubric_eval`	MT-Bench-style evaluation
A vs B comparison	`pairwise`	Which response is better
Rank multiple outputs	`ranking`	Preference ordering
Identify specific errors	`error_span`	MQM-style error annotation
Post-edit model output	`text_edit`	Correction with diff
Rate confidence in evaluation	`confidence`	Meta-annotation

"I need to capture uncertainty"

If you need...	Use	Why
Per-annotation confidence	`confidence`	Meta-annotation schema
Probability distribution	`soft_label`	Label probability sliders
Acceptable range	`range_slider`	Min-max bounds
Fine-grained continuous rating	`vas`	No discrete anchoring

Head-to-Head Comparisons

likert vs slider vs vas

likert: Discrete points (1-7) with labeled buttons. Best for standard survey scales where distinct categories matter.
slider: Discrete steps along a track with visible tick marks and value display. Best for bounded numeric values.
vas: Continuous line with endpoint labels only, no tick marks. Best for magnitude estimation where you want maximum precision without anchoring to discrete values.

radio vs select vs triage

radio: Visible buttons, supports keyboard shortcuts. Best for 2-10 options that annotators need to see.
select: Dropdown, compact. Best for 10+ options where space matters.
triage: Binary accept/reject with optional skip. Best for rapid data curation.

pairwise vs bws vs ranking vs conjoint

pairwise: Compare exactly 2 items. Simplest, most reliable.
bws: Select best + worst from 3-5 items. Efficient relative scaling.
ranking: Full ordering of all items. Most informative but cognitively demanding.
conjoint: Choose among multi-attribute profiles. Best for attribute importance.

multirate vs rubric_eval vs semantic_differential

multirate: Rate multiple items (from data) on the same set of options. Rows = items, columns = options.
rubric_eval: Rate one item on multiple criteria. Rows = criteria (from config), columns = scale points.
semantic_differential: Rate on bipolar adjective pairs. Rows = adjective pairs, columns = scale points between poles.

span vs extractive_qa vs error_span

span: General multi-label span annotation (NER, POS, etc.). Multiple spans, multiple categories.
extractive_qa: Single answer span for a specific question. One span at a time, with "unanswerable" option.
error_span: Error spans with type taxonomy and severity. Multiple spans, each with type + severity + quality score.

text vs text_edit

text: Free text input for new content (responses, rationales, translations from scratch).
text_edit: Edit existing text with change tracking (post-editing, correction, simplification).

soft_label vs constant_sum

soft_label: Probability distribution. Sliders auto-normalize to 100%. For label uncertainty.
constant_sum: Fixed budget allocation. Manual balancing. For relative importance judgments.

multiselect vs hierarchical_multiselect

multiselect: Flat list of checkboxes. For independent labels without hierarchy.
hierarchical_multiselect: Tree with expand/collapse, search, auto-propagation. For deep taxonomies.

ranking vs card_sort

ranking: Order items along one dimension (preference, relevance).
card_sort: Group items into categories (topic, similarity). No ordering within groups.

Combining Schemas

Potato supports multiple annotation schemas per task. Common patterns:

Classification + Confidence

annotation_schemes:
  - annotation_type: radio
    name: sentiment
    labels: ["Positive", "Negative", "Neutral"]
  - annotation_type: confidence
    name: confidence
    target_schema: sentiment

Classification + Rationale

annotation_schemes:
  - annotation_type: radio
    name: toxicity
    labels: ["Toxic", "Not toxic"]
  - annotation_type: text
    name: rationale
    description: "Why did you choose this label?"
    target_schema: toxicity
    min_chars: 10
    show_char_count: true
    collapsible: true
    multiline: true
    rows: 3

Multi-Dimensional LLM Evaluation

annotation_schemes:
  - annotation_type: rubric_eval
    name: quality
    criteria:
      - name: helpfulness
      - name: accuracy
      - name: safety
    scale_points: 5
    show_overall: true
  - annotation_type: confidence
    name: eval_confidence
    target_schema: quality
  - annotation_type: text
    name: justification
    description: "Explain your ratings"
    target_schema: quality
    min_chars: 20
    show_char_count: true
    multiline: true
    rows: 4

Error Annotation + Post-Edit

annotation_schemes:
  - annotation_type: error_span
    name: errors
    error_types:
      - name: Accuracy
        subtypes: ["Omission", "Mistranslation"]
      - name: Fluency
        subtypes: ["Grammar", "Spelling"]
    show_score: true
  - annotation_type: text_edit
    name: correction
    source_field: "mt_output"
    show_diff: true