Ollama can serve models locally on your machine or proxy larger releases through the Ollama Cloud service. VT Code integrates with both deployment modes so you can keep lightweight workflows offline while bursting to the cloud for heavier jobs.
- Ollama installed and running locally (download)
- Optional: Ollama Cloud account with an API key for remote models
- At least one model pulled locally or in your cloud workspace (e.g.,
ollama pull llama3:8borollama pull gpt-oss:120b-cloud)
-
Install Ollama: Download from ollama.com and follow platform-specific instructions
-
Start Ollama server: Run
ollama servein a terminal -
Pull a model: Choose and download a model to use:
# Popular coding models ollama pull llama3:8b ollama pull codellama:7b ollama pull mistral:7b ollama pull qwen3:1.7b ollama pull deepseek-coder:6.7b ollama pull phind-coder:34b # List available local models ollama list
OLLAMA_BASE_URL(optional): Custom Ollama endpoint (defaults tohttp://localhost:11434). Set tohttps://ollama.comto send requests directly to Ollama Cloud.OLLAMA_API_KEY(optional): Required when connecting to Ollama Cloud. Not needed for purely local workloads.
Set up vtcode.toml in your project root:
[agent]
provider = "ollama" # Ollama provider
default_model = "llama3:8b" # Any locally available model
# Note: API key only required when targeting Ollama Cloud
[tools]
default_policy = "prompt" # Safety: "allow", "prompt", or "deny"
[tools.policies]
read_file = "allow" # Always allow file reading
write_file = "prompt" # Prompt before modifications
run_pty_cmd = "prompt" # Prompt before commandsVT Code supports custom Ollama models through the interactive model picker or directly via CLI:
# Using the interactive model picker (select "custom-ollama")
vtcode
# Direct CLI usage with custom model
vtcode --provider ollama --model mistral:7b ask "Review this code"
vtcode --provider ollama --model codellama:7b ask "Explain this function"
vtcode --provider ollama --model gpt-oss-20b ask "Help with this implementation"
vtcode --provider ollama --model gpt-oss:120b-cloud ask "Plan this large migration"The /model picker now lists the core Ollama catalog so you can choose them without typing IDs:
gpt-oss:20b(local)gpt-oss:120b-clouddeepseek-v3.1:671b-cloudkimi-k2.5:cloudqwen3:1.7bqwen3-coder:480b-cloudglm-4.6:cloudglm-4.7:cloudminimax-m2.7:cloudminimax-m2.5:cloudnemotron-3-super:cloud
These entries appear beneath the Ollama provider section alongside the "Custom Ollama model" option.
VT Code includes support for OpenAI's open-source models that can be run via Ollama locally or through the cloud preview:
gpt-oss-20b: Open-source 20B parameter model from OpenAI (local)gpt-oss:120b-cloud: Cloud-hosted 120B parameter model managed by Ollama
To use these models:
# Pull the model first (local or cloud)
ollama pull gpt-oss-20b
ollama pull gpt-oss:120b-cloud
# Use in VT Code
vtcode --provider ollama --model gpt-oss-20b ask "Code review this function"
vtcode --provider ollama --model gpt-oss:120b-cloud ask "Assist with this architecture review"Ollama's API exposes OpenAI-compatible tool calling as well as the web search helpers. VT Code now forwards tool definitions to Ollama and surfaces any tool_calls responses from the model. A typical workflow looks like this:
- Define tools in
vtcode.toml(or via slash commands) with JSON schemas that match your functions. For example, exposeweb_searchandweb_fetchso the agent can call Ollama's hosted knowledge tools. - The agent will stream back
tool_callswith structured arguments. VT Code automatically routes each call to the configured tool runner and includes the results astoolmessages in the follow-up request. - Ollama's responses can include multiple tools per turn. VT Code enforces
tool_call_idrequirements for reliability while still letting the model decide when to call a tool.
Because the provider now understands these payloads you can mix Ollama's native utilities with your existing MCP toolchain.
Thinking-capable models such as gpt-oss and qwen3 emit a dedicated thinking channel (docs). Set the reasoning effort to medium or high (e.g., vtcode --reasoning high) or configure reasoning_effort = "high" in vtcode.toml and VT Code forwards the appropriate think parameter (low/medium/high for GPT-OSS, boolean for Qwen). During streaming runs you will now see separate "Reasoning" lines followed by the final answer tokens so you can inspect or hide the trace as needed.
Ollama continues to support incremental streaming (docs), and VT Code uses it by default. Combine reasoning with streaming to watch the model deliberate before it produces the final response.
When you have an Ollama API key you can target the managed endpoint without running a local server:
export OLLAMA_API_KEY="sk-..."
export OLLAMA_BASE_URL="https://ollama.com"
vtcode --provider ollama --model gpt-oss:120b-cloud ask "Summarize this spec"VT Code automatically attaches the bearer token to requests when the API key is present.
- "Connection refused" errors: Ensure Ollama server is running (
ollama serve) or thatOLLAMA_BASE_URLpoints to a reachable endpoint - Model not found: Ensure the requested model has been pulled (
ollama pull MODEL_NAME) - Unauthorized (401) errors: Set
OLLAMA_API_KEYwhen targeting Ollama Cloud - Performance issues: Consider model size - larger models require more RAM
- Memory errors: For large local models like gpt-oss-120b, ensure sufficient RAM (64GB+ recommended)
Verify Ollama is working correctly:
# Test basic Ollama functionality
ollama run llama3:8b
# Test via API call
curl http://localhost:11434/api/tags- Local models don't require internet connection
- Performance varies significantly based on model size and local hardware
- Larger models (30B+) require substantial RAM (32GB+) for reasonable performance
- Smaller models (7B-13B) work well on consumer hardware with 16GB+ RAM