/history & /usage¶
Every /run, /run-apply, /ask, and /apply lands in the local SQLite store at
~/.amx/history.db. /history is the read interface; /usage summarises token counts
over a window.
What's persisted¶
/analyze runhistory (status, mode, duration, backend / provider / model, scope)- Token usage (summary + per-step records)
- Approved / skipped metadata results
- Run failures (error text)
- App events (profile switches, run status, apply outcomes, …)
- All LLM-generated alternatives per column / table per run — every merged suggestion set is saved before human review so you can revisit and change your mind at any time.
- Apply events (
apply_eventstable, AMX 0.13+) — one row per successful COMMENT write with the prior text, the new text, the run id, and the user / host / profile. Powers/history rollbackand Studio's Audit page.
/history namespace¶
| Command | Description |
|---|---|
/list [-n N] |
Recent runs (includes Duration(s) and Model(s)) |
/show <run_id> |
Full run JSON (scope, metrics, tokens, results, errors) |
/stats |
Aggregate stats + search lifecycle counts |
/events [-n N] |
App events (profile switches, run status, apply outcomes, …) |
/results <run_id> |
All saved LLM alternatives for a past run |
/review <run_id> |
Re-evaluate alternatives interactively |
/rollback <run_id> |
Restore the COMMENTs that this run overwrote |
/compare [RUN_IDS…] [flags] |
Pivot runs side-by-side |
Re-reviewing past runs¶
/history review <run_id> # walk every column again
/history review <run_id> --unevaluated-only # only columns you skipped
/history review <run_id> --apply # short-circuit to writing on accept
Useful when:
- You ran the agents weeks ago and your domain knowledge has improved.
- A column you skipped now has clearer evidence (new code / docs ingested since).
- You want to compare suggestions from two different LLM profiles side-by-side before committing.
/rollback¶
/history rollback <run_id> undoes a past /apply by restoring the
COMMENT each affected asset had immediately before the run wrote
to it.
amx /history rollback 42 # interactive (preview + confirm)
amx /history rollback 42 --yes # scripted; skip the prompt
Backed by the apply_events audit table — every successful
COMMENT write records the prior text alongside the new one, so
rollback restores whatever was on the asset before, not just
"what AMX wrote". The DBA's hand-typed comment, an export-tool's
default text, a previous AMX run's output — all valid sources to
roll back to.
What rollback shows you first¶
═══ Rollback run #42 ═══
Found 3 apply event(s); 3 restorable, 0 skipped (original unknown).
Will restore
Asset Current (will be replaced) Restoring to
core.transactions.posting Posting date encoded as YYYY… Posting date (manual; legacy)
core.transactions.amount Amount in transaction currency… Total amount in cents
core.transactions.eff_dt Effective date the row landed… Warehouse arrival date
Restore 3 comment(s) by overwriting current values? [y/N]:
Skipped rows (old_comment unknown)¶
Some rows surface as skipped: the audit row's old_comment is
NULL. Two situations produce this:
- The apply ran before the audit log started capturing pre-write values (anything before AMX 0.13).
- The active backend's adapter doesn't expose a comment-read API, so AMX could not capture the prior text.
Rollback never invents text — skipped rows are reported in the summary and left untouched. To recover them, restore from a DB backup or rerun the original DBA script.
Replay order¶
When a single run wrote to the same asset multiple times (rare but possible — e.g. a chained schema → table → column meta-apply), rollback replays in reverse time order. The last write unwinds first so the asset ends up holding whatever it had before the run started.
Failure handling¶
The rollback runs inside one engine.begin() transaction.
Per-row failures are reported but do not abort the rest:
✓ core.transactions.posting
✗ core.transactions.amount: COMMENT requires schema USAGE
✓ core.transactions.eff_dt
⚠ Restored 2 of 3; 1 failed.
The failed rows stay on the run's apply_events so a retry after
fixing the privilege grant resumes from the same audit trail.
/compare¶
/history compare pivots multiple runs side-by-side across four Rich tables — run
summary, run settings, per-column descriptions, aggregate metrics — and adds an academic
Quality metrics panel that scores actual correctness rather than the LLM's
self-confidence (which logprob_score alone measures, with well-known overconfidence
bias).
Tables¶
- Run summary — identity (profiles, model, duration, approval rate). Highlights the dimension that varies between runs.
- Run settings — prompt detail, language, batch size, n alternatives, dedup / missing-only flags, review strategy. Exactly which knobs you tuned between runs.
- Per-column results — top description + confidence band +
logprob_score+ tokens. Best logprob per row in green. - Aggregate metrics — timing + tokens + confidence distribution. Best per row bolded.
- Quality metrics (new) — chrF, ROUGE-L, schema grounding, length appropriateness, type-token ratio, optional embedding agreement and LLM-judge win-rate. See Quality framework below.
Studio modal¶
Clicking Compare in amx /studio opens the picker at /runs/compare:
- Paged picker — sticky-header page-size selector (10 / 20 / 50 / 100, persisted in
localStorage), Prev / Next, "Clear selection". Search + kind filters (analyze / rerun / generate / ask) compose with paging. - Modal results — pick at least 2 runs, click Compare, the comparison opens in a full-width Dialog over the picker. The picker stays visible underneath so you can iterate on the selection. Previous behaviour (result rendered below the picker) was removed in 0.15.
- Set baseline — once a run is picked, a small Set baseline button appears next
to it. Click to pin that run as the academic ground-truth baseline for reference-based
metrics (chrF, ROUGE-L, BERTScore). Click again to unpin. Mirrors
--ground-truth-runon the CLI. - Run deeper analysis — the modal footer carries a
Sparklesbutton that triggers Tier 1 (sentence-transformer embeddings) + Tier 2 (LLM-as-judge tournament). A cost-preview Dialog confirms before any LLM token is spent. The result replaces the Tier 0 view in the same modal. - Ask AMX — modal footer button. Closes the modal and seeds a chat at
/askwith the comparison context preloaded; the LLM uses the newcompare_runstool to fetch detail itself if it needs more. - Download PDF — landscape A4 dark-themed report, AMX logo on every page,
warm-stone palette identical to the modal,
Methodssection with full bibliographic citations.
CLI flags¶
| Flag | Description |
|---|---|
--last N |
Compare the last N runs |
--schema NAME |
Restrict to one schema |
--table NAME |
Restrict to one table |
--column NAME |
Restrict to one column |
--command analyze.run\|search.ask\|all |
Filter by command type |
--by auto\|llm_profile\|doc_profile\|code_profile\|llm_model\|db_profile |
Group by dimension |
--diff |
Word-level highlights vs the leftmost run |
--csv FILE |
Also write the comparison as CSV |
--md FILE |
Also write as markdown |
--json FILE |
Also write as JSON |
--quality basic\|full\|none |
Quality metric tier (default basic = Tier 0) |
--ground-truth-run ID |
Pin one of the runs as the academic baseline |
JSON output pairs cleanly with pandas / Jupyter. The shape is documented in the AMX repo
under tests/eval/README.md. The keys schema_version, run_summary, per_column, and
aggregate_metrics are stable.
Quality metric framework¶
/history compare historically picked a "winner" by highest logprob_score. Logprob is
the LLM's self-confidence; it correlates with overconfidence bias and tells you nothing
about whether the description is actually correct. The Quality framework replaces that
with three tiers of academic metrics, opt-in by cost.
Reference resolution waterfall¶
Reference-based metrics (chrF, ROUGE-L, BERTScore, Levenshtein) need a ground truth. AMX walks four sources in order:
- User pin —
--ground-truth-run IDon the CLI, "Set baseline" radio in Studio. - Live DB COMMENT —
COMMENT ON COLUMN/COMMENT ON TABLEfrom the active DB profile. SQL-standard, the most authoritative ground-truth proxy when the team has already documented the column upstream. - Catalog applied — most recent
apply_eventsrow for the same asset (the last description AMX wrote to the DB). - None — reference-based metrics short-circuit cleanly. Reference-free metrics (length, type-token ratio, schema grounding, embedding agreement, LLM judge) still run.
The Studio modal Quality card shows a one-line resolution summary so you know whether the chrF / ROUGE numbers had a real ground truth or fell back to a baseline run.
Tier 0 — offline, deterministic, free¶
Always on with --quality basic (default).
| Metric | Reference required | Citation | Library |
|---|---|---|---|
| Length appropriateness | no | (heuristic) | stdlib |
| Type-token ratio (TTR) | no | Templin 1957 | stdlib |
| Schema grounding | no | Jaccard 1912 token containment | stdlib |
| chrF | yes | Popović 2015 | sacrebleu |
| ROUGE-L | yes | Lin 2004 | rouge-score |
| Levenshtein edit distance | yes | Levenshtein 1966 | difflib |
Tier 1 — local sentence embeddings (free, opt-in)¶
Fired by --quality full on the CLI or "Run deeper analysis" in Studio.
- Embedding agreement matrix — for each asset, pairwise cosine similarity between the runs' descriptions. High = the run agrees with the consensus; low = outlier.
- Semantic schema grounding — cosine similarity between the description embedding
and a synthetic schema-anchor embedding (
table.column (dtype)). - BERTScore (optional) — Zhang et al. 2020. Heavier (~400MB model). Opt-in via
the
bertscoreextra.
pip install amx-cli[quality,local-embeddings] # all-MiniLM-L6-v2 default
pip install amx-cli[quality,bertscore] # + BERTScore
Tier 2 — LLM-as-judge (opt-in, consumes tokens on the active LLM)¶
G-Eval pairwise tournament (Liu et al. 2023; Prometheus 2 — Kim et al. 2024 — uses the
same evaluator family). For each asset and each pair (run_a, run_b), the active LLM
returns:
Per-run win-rate (wins / pairings) is the headline aggregate. Cost rolls into the
run's tokens_json. Cached by (run_a, run_b, asset) so duplicate calls don't re-bill.
A typical 50-column × 3-run comparison runs ~150 judge calls; on gpt-4o-mini that's
roughly $0.01–$0.02.
Examples¶
Compare the last three runs with the default Tier 0 quality panel:
Pin run 60 as the ground truth and run the full Tier 1+2 pipeline:
Export to JSON for downstream analysis:
Ask AMX integration¶
Natural-language compare via the compare_runs LLM tool. Two examples:
The agent calls compare_runs(run_ids=[58, 59], quality_tier=1) and explains why each
run wins per metric — not just the numbers. Sample response style:
58 wins on schema grounding (0.84 vs 0.52, Jaccard 1912) because its descriptions¶
reference both the column name and dtype, whereas #59 stays generic. #59 wins on chrF (Popović 2015) by a small margin against the live DB COMMENT — closer to the existing wording the team agreed on.
The agent first calls list_past_runs(table="address") to resolve candidate IDs, then
compare_runs with the matching set.
Academic methods¶
Bibliographic references for the Quality framework. The Studio modal renders this as a
collapsed footnote under the Quality card; the PDF report prints it as a Methods
section at the bottom.
- chrF — Popović, M. (2015). chrF: character n-gram F-score for automatic MT evaluation. WMT 2015. https://aclanthology.org/W15-3049/
- ROUGE-L — Lin, C.-Y. (2004). ROUGE: A Package for Automatic Evaluation of Summaries. ACL workshop. https://aclanthology.org/W04-1013/
- BERTScore — Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2020). BERTScore: Evaluating Text Generation with BERT. ICLR 2020. https://arxiv.org/abs/1904.09675
- G-Eval — Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., & Zhu, C. (2023). G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. EMNLP 2023. https://arxiv.org/abs/2303.16634
- Prometheus 2 — Kim, S., Suk, J., Longpre, S., et al. (2024). Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models. EMNLP
- https://arxiv.org/abs/2405.01535
- Type-token ratio — Templin, M. C. (1957). Certain Language Skills in Children. University of Minnesota Press.
- Levenshtein distance — Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10(8).
- Jaccard similarity (schema grounding) — Jaccard, P. (1912). The Distribution of the Flora in the Alpine Zone. New Phytologist, 11(2), 37–50.
/usage¶
Reads from ~/.amx/history.db only — no network calls. The summary breaks down
prompt and completion tokens per LLM profile and per model, so you can see which models
your team uses most.
Where it lives on disk¶
The SQLite schema is part of the public contract — additive migrations within a major version, column types and meanings stable. See Python API for the full guarantees.
Sharing history across a team¶
By default ~/.amx/history.db is per-machine. Enable shared mode
to dual-write every run, result, and event to a backend the team already owns. Reads still
come from local SQLite — cross-machine read views are slated for a follow-up minor.