Benchmarks¶

Status — planned

AMX has not yet published external benchmark numbers for AI-generated column descriptions. This page is here for transparency: it explains the evaluation gap honestly, names the protocol AMX will adopt, and lays out a contribution path so the community can help close it.

The state of the field¶

There is no widely-accepted public benchmark for AI-generated database descriptions. The two related academic benchmarks measure something else:

Benchmark	Measures	Relevance to AMX
BIRD	Text-to-SQL execution accuracy across 12,751 questions / 95 databases	Tests querying a documented schema, not generating the documentation
Spider	Cross-domain text-to-SQL on 200 databases	Same — assumes documentation already exists

Neither evaluates "is this generated description faithful to the column's real meaning?" — which is what AMX produces. The closest existing protocol is the one published by Databricks in their 2024 AI Comments preview: 62 schemas drawn from production workloads, two human reviewers + an LLM judge, measuring human preference rate of generated descriptions vs. the prior baseline. We treat that paper as the de facto methodology.

What AMX does report internally¶

Every /run already records a per-column logprob-derived confidence band:

Band	Definition
`verified`	Token-level confidence ≥ 0.95 across the entire description; cross-checked against profile + RAG + code evidence
`high_likelihood`	Confidence ≥ 0.85; majority of evidence sources agreed
`possible`	Confidence ≥ 0.50; at least one evidence source agreed
`weak_hypothesis`	Below 0.50; surfaced for review with no auto-accept

These bands are internal calibration, not public benchmarks. They help the human reviewer triage what to read first; they don't tell an external buyer how AMX compares against, say, Snowflake Cortex on the same 62 schemas.

The planned evaluation¶

The Databricks-style protocol AMX will publish:

Dataset — 50–80 production schemas, anonymised, drawn from at least four backends (PostgreSQL, Snowflake, BigQuery, Databricks) and three domains (e.g. e-commerce, healthcare, ERP-style enterprise). Open release of the schema-and-ground-truth pairs as a benchmark dataset in its own right.
Baselines — at minimum: AMX (with gpt-4o, claude-sonnet-4, and gemini-2.0-flash), Snowflake Cortex AI_GENERATE_TABLE_DESC on its supported subset, Databricks AI Comments via Catalog Explorer, BigQuery Gemini Insights.
Reviewers — two domain experts per schema, blind to the source tool. A third tie-break reviewer when the first two disagree.
LLM judge — a fixed gpt-4o-2024-11-20 snapshot scoring each description on a 5-point rubric (faithfulness, specificity, clarity, actionability, no-hallucination).
Metrics — preference rate vs. the original COMMENT ON text (where one exists) and pairwise win-rate against each baseline.
Reproducibility — full prompts, model versions, and reviewer guidelines published alongside the numbers. Anyone with the dataset should be able to re-run the protocol against their own tool.

Help close the gap¶

If you'd like to contribute to AMX's first benchmark — schema donations, expert reviewers in a specific domain, or co-authoring the methodology paper — open a discussion at github.com/omeryasirkucuk/amx/discussions.