Model Arena¶

Send one prompt to N models side by side, compare their responses, and pick the best — building a win-rate leaderboard. The arena is provider-agnostic: every model is built through Potato's AIEndpointFactory, so you can compare OpenAI, Anthropic, Ollama, vLLM, Gemini, OpenRouter, … in the same view, not just one vendor. (This generalizes the older Anthropic-vision-only live agent.)

Enabling¶

arena:
  enabled: true
  models:
    - {label: "GPT-4o",  endpoint_type: openai,    model: gpt-4o}
    - {label: "Claude",  endpoint_type: anthropic, model: claude-sonnet-4-6}
    - {label: "Llama",   endpoint_type: ollama,    model: llama3.2, base_url: http://localhost:11434}

Each entry maps to an AIEndpointFactory config (endpoint_type, model, base_url, temperature, and an optional ai_config for keys/params). When enabled, the admin dashboard shows an Arena link.

How it works¶

Enter a prompt → it's sent to every model concurrently. One model failing (bad key, provider down) never blocks the others — its card shows the error.
Responses render side by side, each with per-model latency.
Click Pick as best → records a preference and updates the leaderboard.

Leaderboard: Bradley-Terry + Elo (not just win-rate)¶

A raw win-rate treats beating a weak model the same as beating a strong one. The arena instead ranks models by a Bradley-Terry score (a maximum-likelihood strength that accounts for who you beat) and also reports an Elo rating updated after every comparison. Win-rate is still shown for reference.

A bare winner counts as that model beating every other model in the run.
A full ranking (["A", "B", "C"]) expands into all pairwise outcomes.

Both metrics need no extra config — they appear once you record preferences.

Export DPO preference data¶

Every "Pick as best" is a human preference, so the arena doubles as a DPO data-collection surface. Export DPO (button on the leaderboard, or GET /admin/arena/api/export_dpo) returns one {prompt, chosen, rejected} triple per winner-vs-loser pair where both response texts are available — ready for preference fine-tuning (DPO/KTO).

API¶

Method	Path	Purpose
POST	`/admin/arena/api/run`	`{prompt}` → per-model responses (`label, response, latency_ms, error`)
POST	`/admin/arena/api/preference`	`{prompt, winner, ranking?}` → record a pick
GET	`/admin/arena/api/leaderboard`	Bradley-Terry score + Elo + wins/comparisons/win-rate per model
GET	`/admin/arena/api/export_dpo`	human preferences as DPO `{prompt, chosen, rejected}` pairs

curl -X POST localhost:8000/admin/arena/api/run -H "X-API-Key: <key>" \
  -H "Content-Type: application/json" -d '{"prompt": "Explain RLHF in one sentence."}'

Example¶

examples/agent-traces/model-arena/ is a runnable demo (configure real models in its config.yaml).

Live Agent Interaction — observe a single live browser agent
Datasets & Experiments — for offline, dataset-scale comparison
Pairwise Comparison — annotate A/B preferences in the main flow