HuggingFace Datasets Integration
Load Potato annotations directly as HuggingFace DatasetDict or pandas DataFrame objects — no Hub round-trip required.
Installation
pip install potato-annotation[huggingface]
# or
pip install datasets>=2.14.0
Quick Start
from potato import load_as_dataset, load_annotations
# Load as HuggingFace DatasetDict
ds = load_as_dataset("path/to/config.yaml")
print(ds)
# DatasetDict({
# annotations: Dataset({ features: [...], num_rows: 150 })
# spans: Dataset({ features: [...], num_rows: 42 })
# items: Dataset({ features: [...], num_rows: 50 })
# })
# Access individual splits
for row in ds["annotations"]:
print(row["instance_id"], row["user_id"])
# Load as pandas DataFrame (lighter weight)
df = load_annotations("path/to/config.yaml")
print(df.head())
API Reference
load_as_dataset(config_path, include_spans=True, include_items=True)
Returns a datasets.DatasetDict with up to three splits:
| Split | Description |
|---|---|
annotations |
One row per (instance, user) pair with label columns |
spans |
One row per span annotation (start, end, label, text) |
items |
One row per data item with all original fields |
Parameters:
config_path(str): Path to the Potato YAML config fileinclude_spans(bool): Include thespanssplit (default:True)include_items(bool): Include theitemssplit (default:True)
Raises:
ImportErrorifdatasetsis not installedFileNotFoundErrorif config file does not existValueErrorif no annotations are found
load_annotations(config_path)
Returns a pandas.DataFrame with one row per (instance, user) annotation pair.
Columns: instance_id, user_id, plus one column per annotation schema. Complex values (dicts, lists) are JSON-serialized.
Parameters:
config_path(str): Path to the Potato YAML config file
Raises:
FileNotFoundErrorif config file does not existValueErrorif no annotations are found
Example Workflow
from potato import load_as_dataset
# Load completed annotations
ds = load_as_dataset("examples/classification/single-choice/config.yaml")
# Compute inter-annotator agreement
from datasets import Features
annotations = ds["annotations"].to_pandas()
agreement = annotations.groupby("instance_id")["sentiment"].nunique()
print(f"Items with full agreement: {(agreement == 1).sum()}")
# Push to Hub for sharing
ds.push_to_hub("your-org/my-annotations", private=True)
# Or use with a training pipeline
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenized = ds["annotations"].map(
lambda x: tokenizer(x["text"], truncation=True, padding=True),
batched=True
)
Relationship to HuggingFace Export
The load_as_dataset() function uses the same data extraction logic as the --format huggingface CLI export, but returns data in-memory instead of pushing to the Hub.
# CLI export (pushes to Hub)
python -m potato.export --config config.yaml --format huggingface --output your-org/dataset
# Python API (in-memory)
ds = load_as_dataset("config.yaml")
Related Documentation
- Export Formats — all available export formats
- HuggingFace Export — push to Hub via CLI