RAG architecture¶

AMX retrieves grounding evidence through three independent pipelines — Document RAG, Code RAG, and Catalog Search — that share a common four-layer shape: embed, chunk, retrieve, assemble.

Each pipeline answers a different question:

Pipeline	Backing collection	What it answers
Document RAG	`amx_docs`	"What do my ingested docs say about this column / report / concept?"
Code RAG	`amx_code`	"Where in the codebase is this column written, read, or transformed?"
Catalog Search	`amx_search_<profile>`	"Which tables / columns match this question?" — backs `/ask` and `/search`.

The pipelines never see each other's chunks. They run in parallel and their results meet at the orchestrator (see Architecture).

The four layers¶

flowchart LR
    Q[User question] --> Embed
    subgraph Pipeline
        direction LR
        Embed[Embedding layer] --> Chunk[Chunking]
        Chunk --> Retrieve[Retrieval + rerank]
        Retrieve --> Assemble[Prompt assembly]
    end
    Assemble --> LLM[LLM call]

1. Embedding layer¶

Pluggable per pipeline via /embeddings. Document RAG and code RAG carry independent providers — cfg.embedding_docs drives docs ingestion and cfg.embedding_code drives code RAG — so a code-specialised encoder can power code retrieval while a prose-tuned model serves the documentation side. Three provider kinds are supported:

Kind	When to use
`minilm`	Document RAG default; offline, fast, 384-dim, good baseline English retrieval.
`openai_compatible`	Any OpenAI-style `/embeddings` endpoint (OpenAI, Azure, vLLM, llama.cpp). Best quality for general English prose.
`sentence_transformers`	Local HuggingFace model. Used by the code-specialised default below and for any other custom embedder.

Switch a side from the CLI with /embeddings docs <kind> or /embeddings code <kind>, or from Studio → Settings → Embeddings.

Defaults per pipeline:

Document RAG — MiniLM-L6-v2 (384-dim), bundled, zero-config.
Code RAG — jinaai/jina-embeddings-v2-base-code (768-dim, ~161 MB, code-trained) when sentence-transformers is installed; falls back to MiniLM when it isn't. Identifier-heavy, snake_case, and CamelCase queries are measurably better on the code-trained encoder. Install the extra to opt in:
```
pip install "amx-cli[local-embeddings]"
```
Users without the extra get MiniLM plus a one-time WARNING in the log on first /code search naming the install command.
Catalog Search — same as Document RAG (reads cfg.embedding_docs).

Each collection records its embedding_provider, embedding_model, and embedding_dim in Chroma metadata on creation. Reopening with a different identity raises EmbeddingProviderMismatch (Document RAG), CodeEmbeddingMismatch (Code RAG), or CollectionIdentityMismatch (Catalog Search) — never silent re-embedding. Recovery commands per pipeline: /docs reindex, /code-refresh, /search rebuild.

2. Chunking¶

Document RAG dispatches by file extension via amx.docs.splitters.get_splitter:

Extension	Splitter	Notes
`.md` / `.markdown`	Markdown-header-aware (two-stage)	Splits by `#`/`##`/`###` headers; records heading path on each chunk's `h1`/`h2`/`h3` metadata. Long sections are further chunked to fit the budget; header metadata propagates onto every sub-chunk. The heading line stays in the chunk body so the LLM sees the structure too.
`.txt`, `.pdf`, `.csv`, `.docx`, `.html`, `.py`, ...	`RecursiveCharacterTextSplitter`	Default. Structural separator hierarchy `["\n\n", "\n", ". ", " ", ""]` — paragraph → line → sentence → word → character. 1000 chars per chunk, 200-char overlap.
Unknown extension	Default (fallback)	Never raises `KeyError`.

The header metadata is the channel prompt assembly uses for citation strings ("orders.md → h2: total_amount") — see Prompt assembly below. Chunks from non-Markdown extensions never have h1/h2/h3 keys; downstream code that reads them treats absence as "no structural hint available."

Code RAG is AST-aware for Python: one chunk per function or class, with start_line and end_line preserved so citations point at the exact lines. Jupyter notebooks chunk one cell at a time. Other source languages fall back to a 4000-character recursive splitter.

Catalog Search does not chunk in the document sense — each catalog entity (a table or column) is its own "chunk" with structured metadata.

3. Retrieval + rerank¶

Pipeline	Vector	Lexical	Fusion	Rerank	Diversity
Document RAG	Chroma cosine, top-k over-fetched to `max(k, min(4k, 40))`	SQLite FTS5 (BM25), Porter unicode61 tokeniser, same top-k pool	Reciprocal Rank Fusion (k=60) over the two channels	Heuristic (default): `distance + token_overlap + explanatory_terms − header_penalty`. Opt-in: cross-encoder (`cross-encoder/ms-marco-MiniLM-L-6-v2` ~80 MB, English) or `BAAI/bge-reranker-v2-m3` (~568 MB, multilingual) when `cfg.docs.rerank.kind` is set.	MMR (λ=0.7) over the reranked pool, demotes near-duplicate chunks
Code RAG	Chroma cosine, top-k over-fetched	Identifier-token overlap	Additive weighted	`distance + 2.5 × keyword_overlap`	—
Catalog Search	Chroma cosine per profile	SQLite FTS5 (BM25)	Additive weighted	Hybrid + source-kind weighting (manual ≫ reviewed ≫ generated) + confidence bonus	—

The opt-in cross-encoder rerank requires pip install "amx-cli[local-embeddings]". It replaces the heuristic when active; on any failure (extra not installed, model download blocked, prediction error) it logs a structured warning and returns the heuristic's ordering unchanged. The cross-encoder is a quality upgrade, never a single point of failure.

For Document RAG specifically: every Chroma upsert mirrors the same chunk into a SQLite FTS5 sidecar at <persist_dir>/docs_fts.sqlite. Returning users get hybrid retrieval on next RAGStore open via a one-time backfill that seeds the FTS table from existing Chroma chunks; no manual reindex required. Queries that produce no alphanumeric tokens (or hit a sidecar error) fall back to vector-only — backward-compatible.

The Catalog Search path also runs a two-pass LLM query planning step that classifies the question (question_class), surfaces entity hints, and may translate a non-English question into English search queries before retrieval.

4. Prompt assembly¶

The RAG Agent assembles the retrieved chunks into the user message sent to the LLM through three steps:

Truncate + edges-first reorder. Chunks arrive in descending relevance from rerank. assemble_chunks(chunks, k) takes the top-k and reorders them so the highest-relevance chunks anchor both ends of the prompt — [c1, c3, c5, c6, c4, c2] for k=6 — pushing mid-scorers into the attention-dead middle. This combats the "Lost in the Middle" failure documented in Liu et al. (2023). k is rag_max_chunks (configurable per prompt-detail preset: 5 / 8 / 12 / 15).
Citation header per chunk. Every chunk body gets a one-line prefix:
```
[orders.md | section=total_amount] (rel=1.34)
<chunk body>
```
The header reads source.basename + h2/h3/h1 from chunk metadata (produced by the Markdown-aware splitter) and the rerank score. Plain-text chunks degrade gracefully to [plain.txt]. The header gives the LLM a scannable summary even when attention is weakest mid-prompt and gives downstream citation extraction a stable channel to mine.
Per-model input budget. The input window is the smaller of litellm.get_model_info(model)["max_input_tokens"] - output - 256 (real per-provider window) and the legacy max(1000, max_tokens * 3) heuristic (used as a fallback for unknown / proxy models). Stops AMX from over-stuffing a small-context Mistral or under-using a 200k-token Claude.

Per-chunk extractive compaction (first ~1200 characters + last ~300 characters of each chunk) still runs after these steps when the assembled context still exceeds the budget — that part is unchanged.

Defaults at a glance¶

Knob	Default	Where to change
Docs RAG embedder (`cfg.embedding_docs`)	`minilm-l6-v2`	`/embeddings docs <kind>`
Code RAG embedder (`cfg.embedding_code`)	`jinaai/jina-embeddings-v2-base-code` if `amx-cli[local-embeddings]` is installed; else `minilm-l6-v2` (with one-time warning)	install the extra, or `/embeddings code <kind>`
Catalog Search embedder	reads `cfg.embedding_docs`	`/embeddings docs <kind>`
Docs chunk size / overlap	1000 chars / 200 chars	hardcoded today; configurable in the chunking roadmap
Code chunk strategy	AST for Python, 4000-char for other code	hardcoded
Top-k retrieved	5	`/docs search --results N` for ad-hoc; preset-driven for `/run`
Chunks fed to LLM	8 (default preset)	prompt-detail preset

Tuning recommendations¶

English prose corpora: default MiniLM is fine. For higher quality without leaving offline, switch to bge-small-en-v1.5 via /embeddings docs local BAAI/bge-small-en-v1.5 (the docs side; the code side is independent).
Code-heavy corpora: install amx-cli[local-embeddings] to pick up the code-specialised default. Already opted in if you installed the extra — Code RAG uses it automatically with no config change.
Long-form questions: bump rag_max_chunks via the detailed / full prompt preset.

Future tuning¶

The retrieval architecture above is considered complete for AMX's English-only data-catalog use case. Each retrieval-quality change shipped against the retrieval evaluation harness — the committed baseline (hit@3 = 0.85, hit@5 = 0.95, MRR = 0.78) is the floor every subsequent change must clear.

A small number of low-priority follow-ups remain available on the dispatcher seam for users with corpora that exercise them:

Token-counted chunk budgets. The default splitter still counts characters; switching to tiktoken-based length functions aligns chunk sizing with the LLM context budget downstream.
cfg.docs.chunking config knobs. Chunk size, overlap, and strategy are hardcoded today; exposing them would let users with code-heavy or PDF-heavy corpora tune without forking.
Chunker signature in collection metadata. A sibling of the embedding-identity check that forces a reingest when the chunker config changes.
Format-specific splitters for .py / .csv / .pdf. Reuse the existing AST chunker for .py files entering via /docs ingest; row-group splitter for CSV; layout-aware PDF.

Two larger ideas were explicitly evaluated and dropped:

Query rewriting + metadata filters. Aimed at synonym / terminology gaps and non-English → English-corpus translation. AMX's data-catalog scope is English-only by design; adding an LLM call to every retrieval (+500-2000 ms per query) for a problem AMX doesn't have is the wrong trade-off.
Shared retrieval-core refactor. Pure refactor whose payoff is amortised across future retrieval PRs. With the rollout complete and no new retrieval features planned, the refactor has no payback window.