Skip to content
Home Reference Data Sources Documents

Documents

The RAG Agent reads your documentation — PDFs, Word docs, Markdown handbooks, Excel data dictionaries — and feeds the relevant chunks to the LLM as evidence when drafting column descriptions. A column whose name is a cryptic abbreviation suddenly becomes documentable when the team's onboarding PDF defines what the abbreviation means. This page walks through registering a doc profile, ingesting files into the RAG index, searching the index, and recovering when results come back empty.

Prerequisites

  • AMX installed.
  • A folder, file, or list of files you want indexed (PDF / DOCX / MD / TXT / XLSX supported).
  • An active LLM profile (used for the embedding step, then for the answer step in /ask).

Step-by-step

1. Register a document profile

> /add-doc-profile
Profile name: data-handbook
Paths (one per line, blank to finish):
  /Users/me/Documents/data-platform-handbook.pdf
  /Users/me/Documents/data-warehouse-runbook.docx
  /Users/me/internal-docs/data-glossary/
[blank to finish]:
✓ Registered doc profile 'data-handbook' (3 source paths)

Folders are walked recursively for the supported extensions. Mix files and folders freely; AMX deduplicates internally.

2. Ingest the files

> /index
[1/4] Walking sources ...........................  ok (3 paths, 47 files discovered)
[2/4] Extracting text ...........................  ok (1.4M chars across 47 files)
[3/4] Chunking ..................................  ok (412 chunks, 800-1200 tokens each)
[4/4] Embedding (openai/text-embedding-3-small)..  ok (412 chunks, $0.012, 6.2 s)
✓ /index finished. Profile 'data-handbook' is ready for /ask and /run RAG.

Chunks are stored in the same Chroma store as the database catalog (see Search catalog for the full data flow). Each chunk keeps its source path + page number so citations resolve to the original file.

3. Inspect what got chunked

> /index
Profile: data-handbook (active)
3 source paths · 47 files · 412 chunks · 1.2 MB embeddings

Chunks by source:
  data-platform-handbook.pdf       142 chunks (pp. 1–47)
  data-warehouse-runbook.docx       89 chunks
  data-glossary/                   181 chunks across 28 files

Top topics (from chunk titles):
  - "Customer Master — definition" — 12 chunks
  - "Order lifecycle" — 18 chunks
  - "Revenue recognition policy" — 9 chunks

If a topic you'd expect is missing, the file probably wasn't picked up — see "Empty search results" below.

4. Search the index directly

> /search-docs "what does x_legacy_status mean"
Top-5 results (hybrid score · vector + lexical, reranker + MMR applied):
  0.184  data-platform-handbook.pdf:p.34   "Legacy status mapping"
         "Status values 1–7 map to the v3 system's customer state machine. 1=active,
          2=pending, 3=frozen, 4=dormant, 5=closed, 6=fraud, 7=migration-pending. The
          mapping is preserved in marts/customer.sql for backward compat..."

  0.224  data-glossary/customer.md          "Status flags"
         "x_legacy_status (deprecated since v4) — see legacy mapping in handbook §4.2."

  ...

/search-docs is the lower-level command — /ask runs the same retrieval and then asks the LLM to answer using only the retrieved chunks.

Hybrid retrieval pipeline

Document retrieval is four stages rather than a pure vector search:

  1. Lexical recall. An FTS5 sidecar maintained next to the Chroma index returns the top-K matches for the raw query string. This catches exact-string hits — model names, error codes, identifiers — that an embedding can miss.
  2. Vector recall. The same query is embedded and the top-K semantic matches come back from Chroma. This catches concept hits — synonyms, paraphrases.
  3. RRF fusion. Reciprocal Rank Fusion merges the two ranked lists into one. RRF is rank-only — it doesn't care about the underlying scores, so the FTS5 and vector scales never need to be reconciled.
  4. Cross-encoder rerank (opt-in). A small cross-encoder model re-scores the top of the fused list using query + chunk together. Slower but materially more accurate; enable it in config with rerank: cross-encoder. The reranker output then passes through MMR diversity reordering so the final list does not return five chunks of the same paragraph from the same file.

The chunker itself is format-dispatching: Markdown is split on header boundaries so chunks fall on natural sections rather than arbitrary character offsets; PDFs and DOCX fall back to a token-window chunker. A gold-set evaluation runs on every retrieval-config change and a CI gate blocks regressions on recall@5 and rerank quality.

5. Run with doc evidence

> /run sales.customer
[Profile] sampled scan on sales.customer ... ok
[RAG]     6 of 18 columns matched documentation chunks; embedding ... ok
[Code]    no code profile active — skipping
[LLM]     drafting 18 column descriptions with doc evidence ... ok
          confidence: high 17 · medium 1 · low 0

Compare to the same /run without the doc profile active — columns whose names match glossary entries jump straight to high confidence.

Empty search results — the recovery path

> /search-docs "retention policy"
No results for 'retention policy'.

Possible reasons (in order of likelihood):
  1. The phrase isn't in any indexed chunk. Try a broader query or check /index.
  2. The relevant file isn't indexed. /index lists files; add the missing path with /add-doc-profile.
  3. Embedding-model mismatch since last /index. Re-run /index.

For each case:

  • Case 1 — re-phrase. Embeddings match concepts, not exact strings, but a very abstract query still needs at least some lexical anchor.
  • Case 2/index lists every indexed file. If your retention policy lives in an unincluded folder, /add-doc-profile extends the profile and /index picks up the new files (only the new ones — incremental).
  • Case 3 — when you change the embedding model, the existing index can't be queried with the new model. Re-run /index to re-embed everything.

Sample config

doc_profiles:
  data-handbook:
    paths:
      - /Users/me/Documents/data-platform-handbook.pdf
      - /Users/me/Documents/data-warehouse-runbook.docx
      - /Users/me/internal-docs/data-glossary/
active_doc_profile: data-handbook

Chunking and file-extension handling are built in; there are no per-profile chunk_size, chunk_overlap, or extensions keys.

Verify

  1. > /index — confirms files / chunks counts and top topics.
  2. > /search-docs "<known phrase>" — confirms retrieval works for a phrase you definitely indexed.
  3. > /run <table> — the [RAG] log lines confirm the RAG agent ran and contributed evidence.

Troubleshooting

Symptom Cause Fix
/index skips PDFs silently pypdf (or the chosen PDF lib) couldn't parse — encrypted / scanned-image PDF OCR the PDF first, then re-run /index to see the per-file results
/search-docs returns garbage on a specific phrase Embedding model is too small for technical terms Switch to text-embedding-3-large and re-run /index
Citations point to the wrong page in a PDF PDF has unusual page numbering (front matter unnumbered) Cosmetic — citations use the absolute page number; cross-check against the file's TOC
/index is slow on every run Source folder is on a network mount with high stat() latency Move sources locally or to a faster mount; AMX touches every file's mtime to decide what to re-ingest
Out-of-disk after several rebuilds Chroma index isn't garbage-collected on incremental rebuilds Re-run /index (drops the old index) or rm -rf ~/.amx/chroma and re-run /index
/ask answer doesn't use the docs even though they're indexed Active doc profile isn't the one ingested > /use-doc data-handbook then > /ask …