Best-Worst Scaling (BWS)
Best-Worst Scaling (also called MaxDiff) is an annotation method where annotators select the best and worst items from a set of K items (typically 4). This produces more reliable rankings than direct rating scales and is widely used in NLP for tasks like sentiment intensity measurement, word similarity, and semantic evaluation.
Overview
Potato's BWS implementation:
- Loads individual items from your data file (one item per row)
- Generates tuples by randomly sampling K items per tuple
- Presents tuples to annotators who select the best and worst items
- Computes scores using counting, Bradley-Terry, or Plackett-Luce methods
Quick Start
python potato/flask_server.py start examples/classification/best-worst-scaling/config.yaml -p 8000
Configuration
Data File
Each row is a single item with an ID and text:
{"id": "s001", "text": "I absolutely love this product!"}
{"id": "s002", "text": "It's okay, nothing special."}
{"id": "s003", "text": "Terrible experience, would not recommend."}
Config YAML
annotation_task_name: "Sentiment Intensity BWS"
task_dir: .
data_files:
- data/sentiment_pool.jsonl
output_annotation_dir: annotation_output/bws/
item_properties:
id_key: id
text_key: text
# BWS tuple generation
bws_config:
tuple_size: 4 # Items per tuple (2-26)
num_tuples: null # Auto-calculate if null
seed: 42 # Random seed
min_item_appearances: 8 # Minimum appearances per item across tuples
scoring:
method: counting # counting | bradley_terry | plackett_luce
# Annotation schema
annotation_schemes:
- annotation_type: bws
name: sentiment_bws
description: "Sentiment Intensity"
best_description: "Which sentence expresses the MOST positive sentiment?"
worst_description: "Which sentence expresses the LEAST positive sentiment?"
tuple_size: 4
sequential_key_binding: true
user_config:
allow_all_users: true
BWS Config Options
| Option | Default | Description |
|---|---|---|
tuple_size |
4 | Number of items per tuple. Must be >= 2. |
num_tuples |
auto | Number of tuples to generate. If null, auto-calculated. |
seed |
42 | Random seed for reproducible tuple generation. |
min_item_appearances |
2 * tuple_size |
Minimum times each item appears across tuples. Used for auto-calculation. |
scoring.method |
counting |
Default scoring method: counting, bradley_terry, or plackett_luce. |
Schema Options
| Option | Default | Description |
|---|---|---|
best_description |
"Which is BEST?" | Question text for best selection. |
worst_description |
"Which is WORST?" | Question text for worst selection. |
tuple_size |
4 | Must match bws_config.tuple_size. |
sequential_key_binding |
true | Enable keyboard shortcuts (1-9 for best, a-z for worst). |
Tuple Generation
How Tuples Are Generated
Items are sampled without replacement within each tuple (no duplicates in a single tuple) but with replacement across tuples (items appear in multiple tuples). This follows standard BWS methodology.
Auto-Calculation Formula
When num_tuples is null, the number of tuples is calculated as:
num_tuples = ceil(pool_size * min_item_appearances / tuple_size)
With defaults (min_item_appearances = 2 * tuple_size):
| Pool Size | Tuple Size | Tuples Generated |
|---|---|---|
| 20 | 4 | 40 |
| 50 | 4 | 100 |
| 100 | 4 | 200 |
| 100 | 5 | 200 |
Controlling Annotator Overlap
Shared tuples — all annotators see the same tuples:
assignment_strategy:
name: fixed_order
max_annotations_per_item: -1
Unique tuples — different annotators see different subsets:
bws_config:
num_tuples: 200 # Generate more tuples
assignment_strategy:
name: random
max_annotations_per_item: 1
Scoring Methods
1. Counting (Default)
score(item) = (best_count - worst_count) / appearance_count
- Range: [-1, 1]
- No dependencies required
- Simple, transparent, and deterministic
- Standard baseline in BWS literature
2. Bradley-Terry
Converts each BWS annotation to pairwise comparisons: - Best item beats every other item (K-1 comparisons per annotation) - Every item beats the worst item (K-1 comparisons per annotation)
Fits a Bradley-Terry model via choix.ilsr_pairwise(). Produces log-scale strength parameters.
Requires: pip install choix
3. Plackett-Luce
Converts BWS to partial rankings: - [best] > [middle items] > [worst]
Fits a Plackett-Luce model. Most statistically sophisticated option.
Requires: pip install choix
Scoring via CLI
# Counting method (no dependencies)
python -m potato.bws_scoring --config config.yaml --method counting
# Bradley-Terry (requires choix)
python -m potato.bws_scoring --config config.yaml --method bradley_terry
# Custom output path
python -m potato.bws_scoring --config config.yaml --method counting --output scores.tsv
Output Format
The output file (bws_scores.tsv) contains:
item_id text score best_count worst_count appearances rank
s006 Amazing quality and great... 0.875000 7 0 8 1
s001 I absolutely love this... 0.625000 6 1 8 2
Scoring via Admin Dashboard
When BWS is configured, the admin dashboard shows a BWS Scoring tab with:
- Summary statistics (total items, annotations, method)
- A "Generate Scores" button
- Method selector (counting / Bradley-Terry / Plackett-Luce)
- Results table with item scores, ranks, and counts
Clicking "Generate Scores" computes scores and writes bws_scores.tsv to the output directory.
Keyboard Shortcuts
When sequential_key_binding: true:
| Key | Action |
|---|---|
1 |
Select item A as best |
2 |
Select item B as best |
3 |
Select item C as best |
4 |
Select item D as best |
a |
Select item A as worst |
b |
Select item B as worst |
c |
Select item C as worst |
d |
Select item D as worst |
Annotation Storage
BWS annotations are stored using Potato's standard label model:
{
"sentiment_bws": {
"best": "B",
"worst": "D"
}
}
Where "B" and "D" are position labels corresponding to items in the tuple.