Benchmarks¶
Status — planned
AMX has not yet published external benchmark numbers for AI-generated column descriptions. This page is here for transparency: it explains the evaluation gap honestly, names the protocol AMX will adopt, and lays out a contribution path so the community can help close it.
The state of the field¶
There is no widely-accepted public benchmark for AI-generated database descriptions. The two related academic benchmarks measure something else:
| Benchmark | Measures | Relevance to AMX |
|---|---|---|
| BIRD | Text-to-SQL execution accuracy across 12,751 questions / 95 databases | Tests querying a documented schema, not generating the documentation |
| Spider | Cross-domain text-to-SQL on 200 databases | Same — assumes documentation already exists |
Neither evaluates "is this generated description faithful to the column's real meaning?" — which is what AMX produces. The closest existing protocol is the one published by Databricks in their 2024 AI Comments preview: 62 schemas drawn from production workloads, two human reviewers + an LLM judge, measuring human preference rate of generated descriptions vs. the prior baseline. We treat that paper as the de facto methodology.
What AMX does report internally¶
Every /run already records a per-column logprob-derived confidence
band:
| Band | Definition |
|---|---|
verified |
Token-level confidence ≥ 0.95 across the entire description; cross-checked against profile + RAG + code evidence |
high_likelihood |
Confidence ≥ 0.85; majority of evidence sources agreed |
possible |
Confidence ≥ 0.50; at least one evidence source agreed |
weak_hypothesis |
Below 0.50; surfaced for review with no auto-accept |
These bands are internal calibration, not public benchmarks. They help the human reviewer triage what to read first; they don't tell an external buyer how AMX compares against, say, Snowflake Cortex on the same 62 schemas.
The planned evaluation¶
The Databricks-style protocol AMX will publish:
- Dataset — 50–80 production schemas, anonymised, drawn from at least four backends (PostgreSQL, Snowflake, BigQuery, Databricks) and three domains (e.g. e-commerce, healthcare, ERP-style enterprise). Open release of the schema-and-ground-truth pairs as a benchmark dataset in its own right.
- Baselines — at minimum: AMX (with
gpt-4o,claude-sonnet-4, andgemini-2.0-flash), Snowflake CortexAI_GENERATE_TABLE_DESCon its supported subset, Databricks AI Comments via Catalog Explorer, BigQuery Gemini Insights. - Reviewers — two domain experts per schema, blind to the source tool. A third tie-break reviewer when the first two disagree.
- LLM judge — a fixed
gpt-4o-2024-11-20snapshot scoring each description on a 5-point rubric (faithfulness, specificity, clarity, actionability, no-hallucination). - Metrics — preference rate vs. the original
COMMENT ONtext (where one exists) and pairwise win-rate against each baseline. - Reproducibility — full prompts, model versions, and reviewer guidelines published alongside the numbers. Anyone with the dataset should be able to re-run the protocol against their own tool.
Help close the gap¶
If you'd like to contribute to AMX's first benchmark — schema donations, expert reviewers in a specific domain, or co-authoring the methodology paper — open a discussion at github.com/omeryasirkucuk/amx/discussions.