BigQuery¶
Configure BigQuery as an AMX backend to introspect datasets, tables, views, materialized
views, and routines, and write reviewed descriptions back via
ALTER TABLE … SET OPTIONS(description = '…') and column OPTIONS. This page walks
through registering a project profile, picking between Application Default Credentials
and a service-account JSON, and applying your first batch of generated descriptions
without an unbounded byte scan.
Prerequisites¶
- AMX installed (
pip install amx-cli). Thegoogle-cloud-bigqueryandsqlalchemy-bigquerydrivers are installed on first use (needs network); only DuckDB ships with the base install. - A GCP project with the BigQuery API enabled.
- One of:
- ADC (recommended for laptops / GKE / Cloud Run) —
gcloud auth application-default loginalready run, OR a workload-identity binding in place. - Service-account JSON — a key file with
BigQuery Data Viewer(read),BigQuery Job User(run jobs), andBigQuery Metadata Editor(write descriptions) roles on the target dataset(s).
- ADC (recommended for laptops / GKE / Cloud Run) —
- A default dataset is optional but recommended — without one, every command pauses for the dataset picker.
- An active LLM profile (or skip ahead with
/add-llm-profile).
Step-by-step¶
1. Open the AMX REPL¶
2. Add a database profile¶
Pick bigquery from the backend menu:
Select database backend (engine):
...
bigquery GCP project + dataset - table/column descriptions via OPTIONS
...
> bigquery
3. Answer the connection prompts¶
GCP project ID (e.g. my-company-prod): acme-analytics-prod
Default dataset (optional, e.g. analytics): sales_curated
Service account JSON path (optional, e.g. /etc/gcp/sa.json — uses ADC if empty):
Notes on each field:
- Project ID — required. The bare project ID (no
projects/prefix, no number). - Default dataset — optional. Filling it in scopes
/schemasand/runso commands don't pause to ask. Leave blank for cross-dataset discovery. - Service account JSON path — optional. Leave blank to use ADC (which picks up
gclouduser creds, a Compute Engine service account, or a workload identity automatically). Fill in the path only if you need a specific key file — useful for CI runners.
ADC vs service-account JSON — which to pick
Use ADC when AMX runs on your laptop (gcloud auth application-default login once and you're done) or inside any Google-managed runtime that has a service account attached (Cloud Run, GKE with Workload Identity, Compute Engine).
Use a service-account JSON only when AMX runs in a non-Google CI runner (GitHub Actions, GitLab self-hosted) where you need to ship the credential as a secret. Mount it as a file, set the path in the wizard, and never check it into the repo.
4. Activate and confirm¶
> /use-db prod-bq
✓ Active DB profile → prod-bq [bigquery] acme-analytics-prod / sales_curated
> /connect
Testing prod-bq... ✓ connected (project: acme-analytics-prod, ADC: gcloud user me@acme.com, latency: 184 ms)
If you provided a service-account JSON, the line reads ADC: file /etc/gcp/sa.json (sa-amx@acme-analytics-prod.iam.gserviceaccount.com).
5. Inspect datasets and tables¶
> /schemas
dataset tables views matviews routines
sales_curated 47 12 3 8
sales_raw 124 0 0 0
finance_curated 19 6 1 2
> /tables sales_curated
dataset name kind rows size_gb comment?
sales_curated customer TABLE 2,450,118 1.4 no
sales_curated customer_address TABLE 2,450,118 0.9 no
sales_curated inventory_daily TABLE 45,318,234 28.2 yes
size_gb is the on-disk billable size, not row count. Use it to decide which tables are
small enough for a full-mode scan and which should stay on sampled or metadata.
6. Pick a profiling mode that fits your byte budget¶
> /profiling sampled
✓ Active mode → sampled (TABLESAMPLE SYSTEM, 5000 rows)
> /inspect
Active backend: bigquery (acme-analytics-prod / sales_curated)
Profiling mode: sampled
Estimated bytes per /run sweep: ~120 MB across 47 tables
Recommended modes for BigQuery:
metadata— INFORMATION_SCHEMA only. Zero bytes scanned. Use for cross-dataset inventory.sampled—TABLESAMPLE SYSTEM. Bills the sampled bytes (typically a few MB per table). Default for description drafting.full— full table scan. Bill at the per-table size. Reserve for tables under ~1 GB or when you genuinely need exact distinct counts.
7. Run and apply¶
> /run sales_curated.customer
[Profile] TABLESAMPLE SYSTEM on sales_curated.customer ... ok (1.1 s, 1.2 MB billed)
[LLM] drafting 18 column descriptions ... ok (high: 12, medium: 4, low: 2)
> /apply
Write 18 description(s) to acme-analytics-prod.sales_curated? [y/N]: y
ALTER TABLE acme-analytics-prod.sales_curated.customer SET OPTIONS(description='Master customer record …');
ALTER TABLE acme-analytics-prod.sales_curated.customer
ALTER COLUMN c_customer_sk SET OPTIONS(description='Surrogate key …');
... (18/18 written)
✓ /apply finished. Audit trail: /history show <run-id>
Sample config¶
db_profiles:
prod-bq:
backend: bigquery
project: acme-analytics-prod
dataset: sales_curated
credentials_path: "" # ADC; or "/etc/gcp/sa.json" for a key file
profiling_mode: sampled
profiling_sample_size: 5000
active_db_profile: prod-bq
Verify¶
> /connect— reports the active project and the resolved ADC source. If the line readsADC: noneyou have neither user creds nor a key file; re-rungcloud auth application-default loginor setcredentials_path:.> /inspect— shows the active profiling mode and the estimated bytes per/runsweep so you know roughly how much will be scanned before the first/run.> /doctor— checks the BigQuery driver loaded and the profile is reachable (a 401/403 there usually means the IAM grants are missing).
Troubleshooting¶
| Symptom | Cause | Fix |
|---|---|---|
google.auth.exceptions.DefaultCredentialsError: Could not automatically determine credentials. |
No ADC and no service-account path | gcloud auth application-default login OR set credentials_path: to a service-account JSON |
403 Access Denied: Project … User does not have bigquery.jobs.create |
Account/SA lacks BigQuery Job User |
Grant roles/bigquery.jobUser on the project (or use a different SA) |
403 Access Denied: Table … User does not have bigquery.tables.update |
Account/SA can read but not write descriptions | Grant roles/bigquery.metadataEditor on the dataset |
Quota exceeded: Your project exceeded quota for free query bytes scanned |
Default project has the free-tier ceiling | Switch to a billed project, or set profiling_mode: metadata for the discovery sweep |
400 Cannot read non-existent column … mid-run |
Sampled rows from a partition with a different schema | Run with profiling_mode: metadata first to confirm column lists per partition; then re-run on the latest partition explicitly |
A full-mode /run scans far more bytes than expected |
Large table profiled in full mode |
Switch the table to sampled or metadata with /profiling before the sweep |