Retrieval evaluation harness¶

AMX ships a small, offline-runnable evaluation harness for the docs RAG pipeline. Every retrieval-architecture change is gated against a committed baseline so a code change that quietly degrades answer quality fails CI rather than landing on main.

What it measures¶

The harness ingests a synthetic six-document plain-text corpus (orders / customers / products / inventory / glossary / ETL — the content is Markdown-formatted but the files are saved as .txt so ingest uses the dependency-free TextLoader) into a fresh RAGStore, runs 20 gold-set questions through the live retrieval surface, and reports source-level metrics:

Metric	Reads
`hit@1`, `hit@3`, `hit@5`	Did the right document show up in the top-N?
`MRR`	At what rank did the first correct document appear?
`precision@5`	What fraction of the top-5 unique sources are relevant?
`nDCG@5`	Rank-weighted precision (logarithmic discount).
`keyword_recall`	Did the retrieved chunks contain the expected answer terms?

The hits-to-sources projection deduplicates by filename while preserving rank order — five chunks from the same file count once when judging whether the document surfaced, regardless of how many of its chunks landed in the window.

Where it lives¶

Path	Role
`tests/eval/metrics.py`	Pure scoring functions (`hit@k`, `MRR`, `nDCG@k`, …).
`tests/eval/runner.py`	End-to-end driver against the real `RAGStore`.
`tests/eval/fixtures/docs/`	Synthetic plain-text corpus (Markdown-formatted content, `.txt` extension).
`tests/eval/fixtures/docs_gold.jsonl`	Gold set: 20 question / expected-source / expected-content triples.
`tests/eval/baselines/docs_baseline.json`	The committed CI floor.
`tests/eval/test_baselines.py`	The CI gate.
`tests/eval/generate_baselines.py`	Regenerates the baseline JSON.

How to run¶

The harness runs as part of the standard test suite:

pytest tests/eval/

Two tests in particular:

test_docs_eval_runner_smoke — always on. Asserts the runner executes end-to-end and surfaces a relevant document for at least half of the gold-set questions. Catches a broken runner before the baseline test ever sees it.
test_docs_eval_no_regression_vs_baseline — skipped if the baseline file is missing (useful during initial scaffolding); otherwise asserts no regression vs the committed baseline.

Total runtime is ~6 s on the CI matrix. The harness uses the bundled MiniLM embedder so there is no network call and no model download — it honours AMX_NO_NETWORK=1.

CI gates¶

The baseline test fails the build if any of the following regress beyond their tolerance:

Metric	Rule	Tolerance
`hit@3`	new ≥ baseline	0 pp (hard floor)
`precision@5`	new ≥ baseline − 0.02	2 pp
`MRR`	new ≥ baseline − 0.03	3 pp
`nDCG@5`, `keyword_recall`	tracked, not gated	—

The hard floor on hit@3 is the load-bearing assertion: it does not tolerate any regression. The other tolerances absorb small variance from reranker tweaks without letting a real degradation slip through.

Updating the baseline¶

When a PR changes retrieval behaviour intentionally — a new embedder, a new chunker, hybrid retrieval, a re-ranker — regenerate the baseline and commit the new JSON in the same PR so the metric delta is visible to reviewers:

python -m tests.eval.generate_baselines --print
git add tests/eval/baselines/docs_baseline.json
git commit -m "feat(retrieval): switch to hybrid FTS5+vector

eval delta:
  hit@3:        0.90 -> 0.95 (+0.05)
  MRR:          0.80 -> 0.86 (+0.06)
  precision@5:  0.27 -> 0.31 (+0.04)
"

The CI workflow does not auto-update the baseline. A baseline change is always an explicit author action with a reason in the commit message.

Adding to the gold set¶

The synthetic corpus is intentionally small so the harness runs in seconds. To strengthen coverage:

Add a plain-text file to tests/eval/fixtures/docs/ (.txt extension keeps CI offline; use .md only if your environment has the markdown PyPI package).
Add the relevant question rows to tests/eval/fixtures/docs_gold.jsonl — each row is one JSON object with id, question, expected_sources (list of filenames), and optional expected_answer_contains (list of tokens).
Regenerate the baseline.
Commit the corpus addition, the gold-set addition, and the new baseline together — the baseline-gate test detects gold-set size changes and fails if they're not accompanied by a rebaseline.

Per-corpus evaluation¶

The bundled corpus is synthetic on purpose — it must run offline in CI. To evaluate against a real corpus locally:

from pathlib import Path
from tests.eval.runner import run_docs_eval

report = run_docs_eval(
    persist_dir=Path("/tmp/eval-chroma"),
    fixture_dir=Path("./my-docs"),
    gold_path=Path("./my-gold.jsonl"),
)
print(report.to_baseline_dict())

The metric logic is re-usable; only the fixture_dir and gold_path arguments change. To compare embedding providers on the same corpus, swap the cfg.embedding_docs.kind setting in between runs — or cfg.embedding_code.kind when evaluating code retrieval — and report per-query deltas alongside the aggregate. The two sides are independent, so a docs-side change does not affect code retrieval and vice versa.