LLM Providers
Provider Interface
All providers return a uniform Stream of Event values — text deltas, thinking deltas, tool calls, and usage. The agent’s consumeStream function normalizes these into the internal Message format, making the agent completely provider-agnostic.
CompletionRequest
The BeforeProviderRequest extension hook receives this struct as JSON and can modify any field before it is sent to the provider — useful for overriding temperature, trimming the tool list, or adjusting MaxTokens per request.
ProviderInfo
Info() is called once at startup. The service uses ContextWindow to trigger compaction when the conversation grows too large. HasImages controls whether the TUI offers image attachment UI.
ModelLister
All five adapters implement ModelLister. When --list-models is passed, the CLI casts the active provider to ModelLister and prints the result. Each adapter queries the appropriate API:
| Provider | Query mechanism |
|---|---|
ollama | GET /api/tags |
llamacpp | GET /v1/models |
openai | GET /v1/models |
anthropic | GET /v1/models |
google | Gemini model list API |
Supported Providers
| Provider | Backend |
|---|---|
ollama | Local Ollama server (HTTP) |
llamacpp | llama.cpp server (HTTP, OpenAI-compatible) |
openai | OpenAI API or any OpenAI-compatible endpoint |
anthropic | Anthropic Messages API |
google | Google Gemini API |
Each adapter lives in internal/llm/ and translates the provider’s wire format into the uniform Stream abstraction.
Feature Matrix
| Provider | Tools | Images | Thinking | Context Window |
|---|---|---|---|---|
ollama | ✓ | ✓ | model-dependent | 4096 (default) |
llamacpp | ✓ | ✗ | ✗ | from server n_ctx |
openai | ✓ | ✓ | reasoning models | model-dependent |
anthropic | ✓ | ✓ | ✓ extended | model-dependent |
google | ✓ | ✓ | ✗ | 1,000,000+ |
Per-Provider Notes
Ollama
The Ollama adapter uses the /api/chat endpoint with streaming enabled. Context window defaults to 4096 when not reported by the server. Thinking is supported on models that emit <think> tokens (e.g. qwq, deepseek-r1) — sharur surfaces these as EventThinkingDelta events by detecting the tag boundaries in the stream.
llama.cpp
Uses the OpenAI-compatible /v1/chat/completions endpoint. The context window (n_ctx) is queried from the server at startup. Image attachments are not supported because llama.cpp’s OpenAI endpoint does not accept multipart vision payloads in the standard format.
OpenAI
Uses the standard /v1/chat/completions streaming endpoint. Any server implementing this API — vLLM, LM Studio, Groq, Together AI — can be used by setting openAIBaseURL. Reasoning models (o3, o4-mini) emit reasoning_content deltas that are surfaced as EventThinkingDelta.
Anthropic
Uses the Messages API (/v1/messages) with streaming. Extended thinking is activated when req.Thinking is medium or high:
- medium — 10,000-token thinking budget
- high — 20,000-token thinking budget
The API requires temperature: 1.0 when extended thinking is enabled; the adapter sets this automatically and overrides any user-supplied temperature for that request.
Uses the Gemini generateContent API via the google.golang.org/genai client library. Gemini 1.5 Pro and later have context windows of 1M+ tokens; compaction is rarely triggered for typical sessions.
Adding a Provider
Implement the Provider interface in internal/llm/yourprovider.go and register it in internal/config/factory.go. Implement ModelLister to enable --list-models. The adapter receives a fully-formed CompletionRequest; it is responsible for translating Message.ToolCalls and Message.Images into the target API’s format.