Retrieval evaluation harness¶
AMX ships a small, offline-runnable evaluation harness for the docs
RAG pipeline. Every retrieval-architecture change is gated against a
committed baseline so a code change that quietly degrades answer
quality fails CI rather than landing on main.
What it measures¶
The harness ingests a synthetic six-document plain-text corpus
(orders / customers / products / inventory / glossary / ETL — the
content is Markdown-formatted but the files are saved as .txt so
ingest uses the dependency-free TextLoader) into a fresh
RAGStore, runs 20 gold-set questions through the live retrieval
surface, and reports source-level metrics:
| Metric | Reads |
|---|---|
hit@1, hit@3, hit@5 |
Did the right document show up in the top-N? |
MRR |
At what rank did the first correct document appear? |
precision@5 |
What fraction of the top-5 unique sources are relevant? |
nDCG@5 |
Rank-weighted precision (logarithmic discount). |
keyword_recall |
Did the retrieved chunks contain the expected answer terms? |
The hits-to-sources projection deduplicates by filename while preserving rank order — five chunks from the same file count once when judging whether the document surfaced, regardless of how many of its chunks landed in the window.
Where it lives¶
| Path | Role |
|---|---|
tests/eval/metrics.py |
Pure scoring functions (hit@k, MRR, nDCG@k, …). |
tests/eval/runner.py |
End-to-end driver against the real RAGStore. |
tests/eval/fixtures/docs/ |
Synthetic plain-text corpus (Markdown-formatted content, .txt extension). |
tests/eval/fixtures/docs_gold.jsonl |
Gold set: 20 question / expected-source / expected-content triples. |
tests/eval/baselines/docs_baseline.json |
The committed CI floor. |
tests/eval/test_baselines.py |
The CI gate. |
tests/eval/generate_baselines.py |
Regenerates the baseline JSON. |
How to run¶
The harness runs as part of the standard test suite:
Two tests in particular:
test_docs_eval_runner_smoke— always on. Asserts the runner executes end-to-end and surfaces a relevant document for at least half of the gold-set questions. Catches a broken runner before the baseline test ever sees it.test_docs_eval_no_regression_vs_baseline— skipped if the baseline file is missing (useful during initial scaffolding); otherwise asserts no regression vs the committed baseline.
Total runtime is ~6 s on the CI matrix. The harness uses the bundled
MiniLM embedder so there is no network call and no model download —
it honours AMX_NO_NETWORK=1.
CI gates¶
The baseline test fails the build if any of the following regress beyond their tolerance:
| Metric | Rule | Tolerance |
|---|---|---|
hit@3 |
new ≥ baseline | 0 pp (hard floor) |
precision@5 |
new ≥ baseline − 0.02 | 2 pp |
MRR |
new ≥ baseline − 0.03 | 3 pp |
nDCG@5, keyword_recall |
tracked, not gated | — |
The hard floor on hit@3 is the load-bearing assertion: it does not
tolerate any regression. The other tolerances absorb small variance
from reranker tweaks without letting a real degradation slip through.
Updating the baseline¶
When a PR changes retrieval behaviour intentionally — a new embedder, a new chunker, hybrid retrieval, a re-ranker — regenerate the baseline and commit the new JSON in the same PR so the metric delta is visible to reviewers:
python -m tests.eval.generate_baselines --print
git add tests/eval/baselines/docs_baseline.json
git commit -m "feat(retrieval): switch to hybrid FTS5+vector
eval delta:
hit@3: 0.90 -> 0.95 (+0.05)
MRR: 0.80 -> 0.86 (+0.06)
precision@5: 0.27 -> 0.31 (+0.04)
"
The CI workflow does not auto-update the baseline. A baseline change is always an explicit author action with a reason in the commit message.
Adding to the gold set¶
The synthetic corpus is intentionally small so the harness runs in seconds. To strengthen coverage:
- Add a plain-text file to
tests/eval/fixtures/docs/(.txtextension keeps CI offline; use.mdonly if your environment has themarkdownPyPI package). - Add the relevant question rows to
tests/eval/fixtures/docs_gold.jsonl— each row is one JSON object withid,question,expected_sources(list of filenames), and optionalexpected_answer_contains(list of tokens). - Regenerate the baseline.
- Commit the corpus addition, the gold-set addition, and the new baseline together — the baseline-gate test detects gold-set size changes and fails if they're not accompanied by a rebaseline.
Per-corpus evaluation¶
The bundled corpus is synthetic on purpose — it must run offline in CI. To evaluate against a real corpus locally:
from pathlib import Path
from tests.eval.runner import run_docs_eval
report = run_docs_eval(
persist_dir=Path("/tmp/eval-chroma"),
fixture_dir=Path("./my-docs"),
gold_path=Path("./my-gold.jsonl"),
)
print(report.to_baseline_dict())
The metric logic is re-usable; only the fixture_dir and gold_path
arguments change. To compare embedding providers on the same corpus,
swap the cfg.embedding_docs.kind setting in between runs — or
cfg.embedding_code.kind when evaluating code retrieval — and report
per-query deltas alongside the aggregate. The two sides are independent,
so a docs-side change does not affect code retrieval and vice versa.