Ranking / Drag-and-Drop

The ranking schema lets annotators reorder a list of items by dragging them into their preferred sequence. Unlike Best-Worst Scaling which samples from a pool, ranking presents all candidate items simultaneously and elicits a complete ordering. This is suitable when the full item set is small enough to compare holistically (typically 3–8 items).

Overview

Annotators see a vertical list of items with drag handles. They reorder the list from best (top) to worst (bottom) by dragging items to their desired position. Optionally, items at equal rank can be grouped as ties.

Key differences from Best-Worst Scaling:

Feature Ranking BWS
Items per annotation All at once Subset (tuple)
Suitable pool size 3–8 items Any size
Ties allowed Optional No
Output Complete order Best/worst pair
Annotation effort Higher Lower

Research Basis

  • Kiritchenko, S., & Mohammad, S. M. (2017). "Best-Worst Scaling More Reliable than Rating Scales: A Case Study on Sentiment Intensity Annotation." ACL 2017. Compares full ranking, BWS, and rating scales; full ranking is reliable for small item sets but does not scale to large pools.
  • Thurstone, L. L. (1927). "A Law of Comparative Judgment." Psychological Review 34(4). Foundational comparative judgment theory underlying ranking and paired comparison methods.

Configuration

Options

Option Default Description
annotation_type Must be ranking
name Schema identifier (required)
description Task instruction
labels List of item names to rank (required, minimum 2)
allow_ties false Allow annotators to place items at equal rank
items_key null Data field containing dynamic items (overrides labels)
sequential_key_binding false Enable keyboard shortcut reordering
label_requirement.required false Require a complete ranking before proceeding

YAML Example — Static Item List

annotation_schemes:
  - annotation_type: ranking
    name: response_quality
    description: "Drag the responses to rank them from best (top) to worst (bottom)."
    allow_ties: false
    labels:
      - Response A
      - Response B
      - Response C
      - Response D
    label_requirement:
      required: true

YAML Example — Dynamic Items from Data

When the items to rank come from the instance data (different items per row):

annotation_schemes:
  - annotation_type: ranking
    name: translation_rank
    description: "Rank these machine translations from most to least fluent."
    items_key: translations
    allow_ties: true

With corresponding data:

{"id": "t001", "source": "The cat sat on the mat.", "translations": ["Le chat était assis sur le tapis.", "Le chat s'est assis sur le tapis.", "Chat sur tapis."]}

Ties Configuration

annotation_schemes:
  - annotation_type: ranking
    name: summary_quality
    description: "Rank these summaries. Use the tie button to group equally good summaries."
    allow_ties: true
    labels:
      - Summary A
      - Summary B
      - Summary C

Output Format

The final rank order is stored as a comma-separated string of item identifiers, from rank 1 (best) to rank N (worst):

{
  "response_quality": {
    "rank_order": "B,D,A,C"
  }
}

When ties are enabled, tied items are grouped with a = separator:

{
  "response_quality": {
    "rank_order": "B,D=A,C"
  }
}

This means B is ranked 1st, D and A are tied at 2nd, and C is ranked 4th.

Use Cases

  • LLM response evaluation — rank multiple model outputs for quality, relevance, or safety
  • Translation ranking — order machine translation hypotheses by fluency or adequacy
  • Summarization evaluation — rank document summaries by informativeness or conciseness
  • Argument strength — order arguments from most to least persuasive
  • Search result relevance — annotate the relevance ranking of retrieved documents
  • RLHF preference data — collect full rankings as richer training signal than pairwise

Troubleshooting

Annotators find full ranking of 6+ items tiring: Consider switching to Best-Worst Scaling for large item pools. BWS achieves similar statistical efficiency with much lower per-item cognitive load.

Dynamic items from data have different lengths per instance: Use items_key with allow_ties: false and ensure your data preprocessing produces consistent list lengths if downstream analysis requires it.