BigQuery¶
Configure BigQuery as an AMX backend to introspect datasets, tables, views, materialized
views, and routines, and write reviewed descriptions back via
ALTER TABLE … SET OPTIONS(description = '…') and column OPTIONS. This page walks
through registering a project profile, picking between Application Default Credentials
and a service-account JSON, and applying your first batch of generated descriptions
without an unbounded byte scan.
Prerequisites¶
- AMX installed (
pip install amx-cli). Thegoogle-cloud-bigqueryandsqlalchemy-bigquerydrivers are included by default. - A GCP project with the BigQuery API enabled.
- One of:
- ADC (recommended for laptops / GKE / Cloud Run) —
gcloud auth application-default loginalready run, OR a workload-identity binding in place. - Service-account JSON — a key file with
BigQuery Data Viewer(read),BigQuery Job User(run jobs), andBigQuery Metadata Editor(write descriptions) roles on the target dataset(s).
- ADC (recommended for laptops / GKE / Cloud Run) —
- A default dataset is optional but recommended — without one, every command pauses for the dataset picker.
- An active LLM profile (or skip ahead with
/add-llm-profile).
Step-by-step¶
1. Open the AMX REPL¶
2. Add a database profile¶
Pick bigquery from the backend menu:
Select database backend (engine):
...
bigquery GCP project + dataset - table/column descriptions via OPTIONS
...
> bigquery
3. Answer the connection prompts¶
GCP project ID (e.g. my-company-prod): acme-analytics-prod
Default dataset (optional, e.g. analytics): sales_curated
Service account JSON path (optional, e.g. /etc/gcp/sa.json — uses ADC if empty):
Notes on each field:
- Project ID — required. The bare project ID (no
projects/prefix, no number). - Default dataset — optional. Filling it in scopes
/schemasand/runso commands don't pause to ask. Leave blank for cross-dataset discovery. - Service account JSON path — optional. Leave blank to use ADC (which picks up
gclouduser creds, a Compute Engine service account, or a workload identity automatically). Fill in the path only if you need a specific key file — useful for CI runners.
ADC vs service-account JSON — which to pick
Use ADC when AMX runs on your laptop (gcloud auth application-default login once and you're done) or inside any Google-managed runtime that has a service account attached (Cloud Run, GKE with Workload Identity, Compute Engine).
Use a service-account JSON only when AMX runs in a non-Google CI runner (GitHub Actions, GitLab self-hosted) where you need to ship the credential as a secret. Mount it as a file, set the path in the wizard, and never check it into the repo.
4. Activate and confirm¶
> /use-db prod-bq
✓ Active DB profile → prod-bq [bigquery] acme-analytics-prod / sales_curated
> /connect
Testing prod-bq... ✓ connected (project: acme-analytics-prod, ADC: gcloud user me@acme.com, latency: 184 ms)
If you provided a service-account JSON, the line reads ADC: file /etc/gcp/sa.json (sa-amx@acme-analytics-prod.iam.gserviceaccount.com).
5. Inspect datasets and tables¶
> /schemas
dataset tables views matviews routines
sales_curated 47 12 3 8
sales_raw 124 0 0 0
finance_curated 19 6 1 2
> /tables sales_curated
dataset name kind rows size_gb comment?
sales_curated customer TABLE 2,450,118 1.4 no
sales_curated customer_address TABLE 2,450,118 0.9 no
sales_curated inventory_daily TABLE 45,318,234 28.2 yes
size_gb is the on-disk billable size, not row count. AMX uses it to refuse full-mode
profiling on tables that would scan more than your max_bytes_billed setting.
6. Pick a profiling mode that fits your byte budget¶
> /db profiling-mode sampled
✓ Active mode → sampled (TABLESAMPLE SYSTEM, 5000 rows)
> /db inspect
Active backend: bigquery (acme-analytics-prod / sales_curated)
Profiling mode: sampled
Estimated bytes per /run sweep: ~120 MB across 47 tables
Max bytes billed (configured): 10 GB
Recommended modes for BigQuery:
metadata— INFORMATION_SCHEMA only. Zero bytes scanned. Use for cross-dataset inventory.sampled—TABLESAMPLE SYSTEM. Bills the sampled bytes (typically a few MB per table). Default for description drafting.full— full table scan. Bill at the per-table size. Reserve for tables under ~1 GB or when you genuinely need exact distinct counts.
7. Run and apply¶
> /run sales_curated.customer
[Profile] TABLESAMPLE SYSTEM on sales_curated.customer ... ok (1.1 s, 1.2 MB billed)
[LLM] drafting 18 column descriptions ... ok (high: 12, medium: 4, low: 2)
> /apply
Write 18 description(s) to acme-analytics-prod.sales_curated? [y/N]: y
ALTER TABLE acme-analytics-prod.sales_curated.customer SET OPTIONS(description='Master customer record …');
ALTER TABLE acme-analytics-prod.sales_curated.customer
ALTER COLUMN c_customer_sk SET OPTIONS(description='Surrogate key …');
... (18/18 written)
✓ /apply finished. Audit trail: /history show <run-id>
Sample config¶
db_profiles:
prod-bq:
backend: bigquery
project: acme-analytics-prod
dataset: sales_curated
credentials_path: "" # ADC; or "/etc/gcp/sa.json" for a key file
profiling_mode: sampled
profiling_sample_size: 5000
max_bytes_billed: 10737418240 # 10 GB safety net
active_db_profile: prod-bq
Verify¶
> /connect— reports the active project and the resolved ADC source. If the line readsADC: noneyou have neither user creds nor a key file; re-rungcloud auth application-default loginor setcredentials_path:.> /db inspect— shows the configuredmax_bytes_billedceiling so you know the safety net is in place before the first/run.> amx doctor— checks the BigQuery driver loaded and the profile is reachable (a 401/403 there usually means the IAM grants are missing).
Troubleshooting¶
| Symptom | Cause | Fix |
|---|---|---|
google.auth.exceptions.DefaultCredentialsError: Could not automatically determine credentials. |
No ADC and no service-account path | gcloud auth application-default login OR set credentials_path: to a service-account JSON |
403 Access Denied: Project … User does not have bigquery.jobs.create |
Account/SA lacks BigQuery Job User |
Grant roles/bigquery.jobUser on the project (or use a different SA) |
403 Access Denied: Table … User does not have bigquery.tables.update |
Account/SA can read but not write descriptions | Grant roles/bigquery.metadataEditor on the dataset |
Quota exceeded: Your project exceeded quota for free query bytes scanned |
Default project has the free-tier ceiling | Switch to a billed project, or set profiling_mode: metadata for the discovery sweep |
400 Cannot read non-existent column … mid-run |
Sampled rows from a partition with a different schema | Run with profiling_mode: metadata first to confirm column lists per partition; then re-run on the latest partition explicitly |
Bytes billed exceeded the maximum (10737418240) |
A full-mode /run hit your max_bytes_billed ceiling |
Switch the table to sampled or raise the ceiling for the duration of the sweep |
What's next¶
- Profiling modes —
metadatafor inventory,sampledfor description drafting,fullonly when you need exact stats. - Batch mode for LLM calls — pair with BigQuery
metadatamode for dirt-cheap whole-warehouse description sweeps. - Run & Apply — the review wizard, including how
/applyhandles partial failures.