Skip to content

Using HuggingFace Models in Potato

Potato calls a language model in three places, and all three use the same endpoint configuration shape. This guide shows how to point any of them at a model hosted on HuggingFace — whether through the serverless Inference API, a dedicated Inference Endpoint, a self-hosted TGI/vLLM server, or a fully local model.

Feature What the LLM does Config block
AI hints Suggests labels inline in the annotation UI ai_support
Solo mode Auto-labels items while you calibrate (human-in-the-loop) solo_mode.labeling_models[]
Judge calibration Acts as an LLM judge you align against blind human labels judge_calibration.models[]

See also: AI support · Solo mode · Judge calibration.


Two ways to reach a HuggingFace model

Uses huggingface_hub.InferenceClient, so it works with the HuggingFace serverless Inference API and with dedicated Inference Endpoints. An API token is required.

endpoint_type: huggingface
model: meta-llama/Llama-3.2-3B-Instruct   # any chat/instruct model on the Hub
api_key: ${HF_TOKEN}                       # required — your HF access token
temperature: 0.7
max_tokens: 150
timeout: 30

Path B — OpenAI-compatible base_url

HuggingFace (and TGI/vLLM/Ollama) expose OpenAI-compatible /v1 endpoints, so you can also use endpoint_type: openai with a custom base_url. Handy when you already run an inference server or want the HF router.

# HuggingFace router (OpenAI-compatible)
endpoint_type: openai
model: meta-llama/Llama-3.2-3B-Instruct
api_key: ${HF_TOKEN}
base_url: https://router.huggingface.co/v1

# Self-hosted TGI / vLLM
# endpoint_type: openai
# model: meta-llama/Llama-3.2-3B-Instruct
# api_key: EMPTY
# base_url: http://your-server:8080/v1

# Fully local (vLLM / Ollama OpenAI shim)
# endpoint_type: vllm
# model: Qwen/Qwen3-4B
# base_url: http://localhost:8000/v1

Both paths accept the same generation keys: model, api_key, base_url (Path B), temperature, max_tokens, timeout.

Get a token: create a read-scoped token at huggingface.co/settings/tokens and export it: export HF_TOKEN=hf_xxx. Potato expands ${HF_TOKEN} from the environment.


Wiring it into each feature

1. AI hints (ai_support)

ai_support:
  enabled: true
  endpoint_type: huggingface
  ai_config:
    model: meta-llama/Llama-3.2-3B-Instruct
    api_key: ${HF_TOKEN}
    temperature: 0.7
    max_tokens: 150
    include:
      all: true          # enable hints for every scheme
  cache_config:
    disk_cache:
      enabled: true
      path: annotation_output/ai_cache.json

You can keep the endpoint details in a separate gitignored file instead of inline:

ai_support:
  enabled: true
  ai_config_file: ai-config.yaml   # holds endpoint_type/model/api_key/base_url
  ai_config:
    temperature: 0.7
    max_tokens: 150
    include: {all: true}

2. Solo mode (solo_mode.labeling_models[])

Each entry is an endpoint; list more than one for fallback/ensembling.

solo_mode:
  enabled: true
  labeling_models:
    - endpoint_type: huggingface
      model: meta-llama/Llama-3.1-8B-Instruct
      api_key: ${HF_TOKEN}
      temperature: 0.1
      max_tokens: 1000
  uncertainty:
    strategy: direct_confidence

3. Judge calibration (judge_calibration.models[])

judge_calibration:
  enabled: true
  prompt: |
    You are an impartial expert annotator. Classify the sentiment as exactly one of:
    positive, negative, neutral.
  models:
    - endpoint_type: huggingface
      model: meta-llama/Llama-3.1-8B-Instruct
      api_key: ${HF_TOKEN}
      temperature: 0.7      # must be > 0 so repeated samples vary
      max_tokens: 1000
  k_samples: 5
  schemas: [sentiment]

Local vs. hosted — which path?

You want… Use
Zero infra, just a token Path A, serverless Inference API
Guaranteed throughput / a pinned model Path A against a dedicated Inference Endpoint URL
You already run TGI or vLLM Path B with that server's /v1 base_url
Fully offline / no data leaves the machine endpoint_type: ollama or vllm with a localhost base_url
A model not on HF OpenAI/Anthropic/Gemini endpoints (see AI support)

Running on HuggingFace Spaces

When Potato runs as a Space, set HF_TOKEN as a Space secret (Settings → Variables and secrets). The same ${HF_TOKEN} references above then resolve inside the container, so the AI-assisted demos work without any code change. See the Spaces deployment guide.


Troubleshooting

  • "Hugging Face API key is required" — the native endpoint needs api_key; set HF_TOKEN (Path A) or api_key: EMPTY against a local server (Path B).
  • 403 / model not served — not every model is available on the serverless API. Pick a served chat model, or stand up a dedicated Inference Endpoint / local server.
  • Slow first hint — serverless models cold-start. Enable cache_config.prefetch to warm upcoming items, or use a dedicated endpoint.
  • Malformed JSON from the model — prefer instruct/chat models; small base models often ignore the structured-output schema.