CI Evaluation (pytest plugin)¶
Run Potato evaluations inside your own pytest suite and gate CI on aggregate score thresholds — so a prompt/model change that regresses quality fails the build, the same way a unit test does. This is the "operationalize eval" layer on top of the evaluators and datasets/experiments.
Install¶
The plugin ships with Potato and auto-loads once installed:
pip install -e . # registers the `potato_eval` pytest plugin
Without installing, load it explicitly: pytest -p potato.testing.pytest_plugin.
Write eval tests¶
Mark a test @pytest.mark.potato_eval and request the potato_eval fixture:
import pytest
from potato.testing import expect
@pytest.mark.potato_eval
@pytest.mark.parametrize("case", CASES, ids=[c["q"] for c in CASES])
def test_agent(case, potato_eval):
out = my_agent(case["q"])
potato_eval.log_inputs({"question": case["q"]})
potato_eval.log_outputs(out)
potato_eval.log_reference_outputs(case["expected"])
potato_eval.log_feedback("correct", 1.0 if out == case["expected"] else 0.0)
potato_eval.log_feedback("similarity", 1.0 - expect.edit_distance(out, case["expected"]).value)
expect(out).to_contain(case["expected"]) # per-case hard assertion
potato_evalfixture:log_inputs,log_outputs,log_reference_outputs,log_feedback(key, score).expect(...)fluent assertions:.to_equal,.to_contain,.to_be_less_than,.to_be_greater_than,.to_be_between,.to_be_close_to, plusexpect.edit_distance(a, b)andexpect.embedding_distance(a, b).
log_feedback scores are aggregated across all eval tests (mean per key); a
failed expect fails that individual test as usual.
Gate the build¶
pytest tests/eval/ \
--potato-threshold correct=0.8 \
--potato-threshold similarity=0.7 \
--potato-experiment agent-regression
| Option | Effect |
|---|---|
--potato-threshold KEY=MIN |
Fail the run if mean(KEY) < MIN. Repeatable. |
--potato-experiment DATASET |
Record the run as an Experiment (so it shows in the comparison view). |
--potato-no-sync |
Skip experiment recording. |
A summary prints at the end:
============================== Potato evaluation ===============================
cases: 3
correct: 1.000 (threshold 0.8)
similarity: 1.000 (threshold 0.7)
recorded experiment: ci-0001
If a threshold is violated the run exits non-zero (failing the CI job) and prints
THRESHOLD FAILED: <key> = <actual> < <min>.
Environment¶
| Var | Meaning |
|---|---|
POTATO_EVAL_STORE |
Directory for recorded experiments (default ./eval_store). |
POTATO_EVAL_STORAGE |
file (default) or sqlite. |
POTATO_EVAL_SUITE |
Experiment name label. |
POTATO_EVAL_EXPERIMENT |
Default dataset for --potato-experiment. |
Recorded experiments are plain files — upload them as a CI artifact to track scores across runs, or point them at the same store your Potato server reads.
GitHub Actions¶
A ready-to-copy workflow is in
examples/agent-traces/ci-eval/ci_workflow_example.yml — it runs the suite on
every PR, gates on thresholds, and uploads the experiment records as an artifact.
Example¶
pytest examples/agent-traces/ci-eval/test_agent_eval.py \
-p potato.testing.pytest_plugin \
--potato-threshold correct=0.8 --potato-experiment demo-suite