Ollama and OpenAI-compatible local endpoints¶

For air-gapped use, on-prem deployments, or just keeping all data local, AMX supports two local-LLM paths:

Ollama native (provider: ollama) — uses the Ollama-specific API at http://localhost:11434.
OpenAI-compatible local (provider: local) — uses the OpenAI Chat Completions wire format against any local server (vLLM, LM Studio, Text Generation Inference, llama.cpp's server, Ollama's OpenAI-mode).

Ollama native¶

/add-llm-profile ollama_local

Fields:

Provider: ollama
Model id: the Ollama model name, e.g. llama3.1:70b, qwen2.5:32b, gemma2:27b
Base URL: http://localhost:11434 (no /v1)
API key: any non-empty string (Ollama ignores it)

Make sure the model is pulled first:

ollama pull llama3.1:70b

OpenAI-compatible local¶

For vLLM / LM Studio / TGI / llama.cpp:

/add-llm-profile local_vllm

Fields:

Provider: local
Model id: the model name as the server exposes it
Base URL: http://localhost:8000/v1 (vLLM) / http://localhost:11434/v1 (Ollama OpenAI-mode) / etc.
API key: any non-empty string

This is the right choice when:

You need OpenAI-compatible logprobs (vLLM exposes them, Ollama native does not).
You want to use prompts and tools designed for the OpenAI Chat Completions schema unchanged.
You're running a fine-tuned model behind a vLLM serving stack.

Recommended local models¶

For metadata generation, the model needs to be:

Good at structured (JSON) output.
Able to follow long-form prompts (the Profile Agent batch can be 4-8K tokens).
Big enough to do reasonable inference (≥ 30B parameters in practice).

Tested combinations:

Model	Quality	Notes
`qwen2.5:32b`	Good	Solid JSON adherence, fast
`llama3.1:70b`	Very good	Slow on consumer hardware; use for high-stakes
`gemma2:27b`	OK	Decent baseline, occasional JSON drift
`deepseek-coder-v2:16b`	Mixed	Strong on code-heavy schemas, weaker on business semantics

Smaller models (≤ 13B) tend to invent foreign-key relationships and produce verbose, low-confidence output. They work for evaluation but aren't suitable for production metadata.

Logprobs¶

Ollama native does not return logprobs. AMX falls back to whole-response confidence for ollama-native profiles.
vLLM / LM Studio / OpenAI-compatible returns logprobs when the server is started with the right flags. Confidence calibration works as for OpenAI direct.

Embeddings¶

For fully-offline AMX, also configure local embeddings:

/embeddings Local

This uses local sentence-transformers. Run /search rebuild after switching to re-embed the catalog.

Known gotchas¶

Local servers often have small default context windows. For wide-table profiling, set the server's context window high (vLLM --max-model-len, llama.cpp -c) and reduce /llm-batch-size until prompts fit.
Ollama's native API doesn't support Anthropic-style tool calls. Use OpenAI-compatible mode if you want the Search Agent's tool-loop behaviour against an Ollama-served model.
Self-signed TLS on the local server: set REQUESTS_CA_BUNDLE to your CA bundle.