HuggingFace Datasets Integration

Load Potato annotations directly as HuggingFace DatasetDict or pandas DataFrame objects — no Hub round-trip required.

Installation

pip install potato-annotation[huggingface]
# or
pip install datasets>=2.14.0

Quick Start

from potato import load_as_dataset, load_annotations

# Load as HuggingFace DatasetDict
ds = load_as_dataset("path/to/config.yaml")
print(ds)
# DatasetDict({
#     annotations: Dataset({ features: [...], num_rows: 150 })
#     spans: Dataset({ features: [...], num_rows: 42 })
#     items: Dataset({ features: [...], num_rows: 50 })
# })

# Access individual splits
for row in ds["annotations"]:
    print(row["instance_id"], row["user_id"])

# Load as pandas DataFrame (lighter weight)
df = load_annotations("path/to/config.yaml")
print(df.head())

API Reference

`load_as_dataset(config_path, include_spans=True, include_items=True)`

Returns a datasets.DatasetDict with up to three splits:

Split	Description
`annotations`	One row per (instance, user) pair with label columns
`spans`	One row per span annotation (start, end, label, text)
`items`	One row per data item with all original fields

Parameters:

config_path (str): Path to the Potato YAML config file
include_spans (bool): Include the spans split (default: True)
include_items (bool): Include the items split (default: True)

Raises:

ImportError if datasets is not installed
FileNotFoundError if config file does not exist
ValueError if no annotations are found

`load_annotations(config_path)`

Returns a pandas.DataFrame with one row per (instance, user) annotation pair.

Columns: instance_id, user_id, plus one column per annotation schema. Complex values (dicts, lists) are JSON-serialized.

Parameters:

config_path (str): Path to the Potato YAML config file

Raises:

FileNotFoundError if config file does not exist
ValueError if no annotations are found

Example Workflow

from potato import load_as_dataset

# Load completed annotations
ds = load_as_dataset("examples/classification/single-choice/config.yaml")

# Compute inter-annotator agreement
from datasets import Features
annotations = ds["annotations"].to_pandas()
agreement = annotations.groupby("instance_id")["sentiment"].nunique()
print(f"Items with full agreement: {(agreement == 1).sum()}")

# Push to Hub for sharing
ds.push_to_hub("your-org/my-annotations", private=True)

# Or use with a training pipeline
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenized = ds["annotations"].map(
    lambda x: tokenizer(x["text"], truncation=True, padding=True),
    batched=True
)

Relationship to HuggingFace Export

The load_as_dataset() function uses the same data extraction logic as the --format huggingface CLI export, but returns data in-memory instead of pushing to the Hub.

# CLI export (pushes to Hub)
python -m potato.export --config config.yaml --format huggingface --output your-org/dataset

# Python API (in-memory)
ds = load_as_dataset("config.yaml")

Export Formats — all available export formats
HuggingFace Export — push to Hub via CLI