Skip to content

CI Evaluation (pytest plugin)

Run Potato evaluations inside your own pytest suite and gate CI on aggregate score thresholds — so a prompt/model change that regresses quality fails the build, the same way a unit test does. This is the "operationalize eval" layer on top of the evaluators and datasets/experiments.

Install

The plugin ships with Potato and auto-loads once installed:

pip install -e .          # registers the `potato_eval` pytest plugin

Without installing, load it explicitly: pytest -p potato.testing.pytest_plugin.

Write eval tests

Mark a test @pytest.mark.potato_eval and request the potato_eval fixture:

import pytest
from potato.testing import expect

@pytest.mark.potato_eval
@pytest.mark.parametrize("case", CASES, ids=[c["q"] for c in CASES])
def test_agent(case, potato_eval):
    out = my_agent(case["q"])
    potato_eval.log_inputs({"question": case["q"]})
    potato_eval.log_outputs(out)
    potato_eval.log_reference_outputs(case["expected"])

    potato_eval.log_feedback("correct", 1.0 if out == case["expected"] else 0.0)
    potato_eval.log_feedback("similarity", 1.0 - expect.edit_distance(out, case["expected"]).value)

    expect(out).to_contain(case["expected"])   # per-case hard assertion
  • potato_eval fixture: log_inputs, log_outputs, log_reference_outputs, log_feedback(key, score).
  • expect(...) fluent assertions: .to_equal, .to_contain, .to_be_less_than, .to_be_greater_than, .to_be_between, .to_be_close_to, plus expect.edit_distance(a, b) and expect.embedding_distance(a, b).

log_feedback scores are aggregated across all eval tests (mean per key); a failed expect fails that individual test as usual.

Gate the build

pytest tests/eval/ \
  --potato-threshold correct=0.8 \
  --potato-threshold similarity=0.7 \
  --potato-experiment agent-regression
Option Effect
--potato-threshold KEY=MIN Fail the run if mean(KEY) < MIN. Repeatable.
--potato-experiment DATASET Record the run as an Experiment (so it shows in the comparison view).
--potato-no-sync Skip experiment recording.

A summary prints at the end:

============================== Potato evaluation ===============================
cases: 3
  correct: 1.000  (threshold 0.8)
  similarity: 1.000  (threshold 0.7)
recorded experiment: ci-0001

If a threshold is violated the run exits non-zero (failing the CI job) and prints THRESHOLD FAILED: <key> = <actual> < <min>.

Environment

Var Meaning
POTATO_EVAL_STORE Directory for recorded experiments (default ./eval_store).
POTATO_EVAL_STORAGE file (default) or sqlite.
POTATO_EVAL_SUITE Experiment name label.
POTATO_EVAL_EXPERIMENT Default dataset for --potato-experiment.

Recorded experiments are plain files — upload them as a CI artifact to track scores across runs, or point them at the same store your Potato server reads.

GitHub Actions

A ready-to-copy workflow is in examples/agent-traces/ci-eval/ci_workflow_example.yml — it runs the suite on every PR, gates on thresholds, and uploads the experiment records as an artifact.

Example

pytest examples/agent-traces/ci-eval/test_agent_eval.py \
  -p potato.testing.pytest_plugin \
  --potato-threshold correct=0.8 --potato-experiment demo-suite