A production-ready LLM agent that combines Retrieval Augmented Generation (RAG) with live Kubernetes cluster metrics. Built with FastAPI, Ollama, Qdrant, and Kubernetes integration.
- π RAG System: Vector-based document retrieval using Qdrant and Ollama embeddings
- βΈοΈ K8s Integration: Real-time cluster metrics (CPU, memory, pods, nodes)
- π Unified API: Single FastAPI server combining RAG and Kubernetes queries
- π³ Container Ready: Full Docker and Kubernetes deployment support
- π MCP Server: Model Context Protocol server for Claude Desktop integration
- π― Flexible Architecture: Deploy as unified server, sidecar, or standalone services
- Architecture
- Quick Start
- Installation
- Deployment Options
- Usage Examples
- API Reference
- Project Structure
- Configuration
- Contributing
- License
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β K8s-Aware RAG Agent β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββββββ ββββββββββββββββ βββββββββββββββ β
β β FastAPI βββββΆβ Qdrant β β Kubernetes β β
β β Server β β Vector DB β β Cluster β β
β ββββββββ¬ββββββββ ββββββββββββββββ ββββββββ¬βββββββ β
β β β β
β β ββββββββββββββββ β β
β βββββββββββββΆβ Ollama β β β
β β (LLM/Embed) β β β
β ββββββββββββββββ β β
β β β
β RAG Queries βββββββββββ¬ββββββββββββββββββ β
β β β
β K8s Metrics βββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- LLM Agent (
src/main.py): Basic RAG implementation with FastAPI - Unified Server (
src/unified_server.py): Combined RAG + K8s metrics server - MCP Server (
src/k8s_mcp_server.py): Standalone K8s metrics via MCP protocol - Examples (
examples/k8s_rag_example.py): Cluster-aware RAG query examples
- Python 3.11+
- Docker & Docker Compose (optional for local development)
- Kubernetes cluster (Minikube, Kind, or cloud provider)
- kubectl configured
- Metrics Server installed in your K8s cluster
git clone https://github.com/Jonsy13/ollama-k8s-rag.git
cd ollama-k8s-rag
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt# Start Ollama (in separate terminal)
ollama serve
# Pull required models
ollama pull tinyllama
ollama pull all-minilm
# Start Qdrant (Docker)
docker run -d -p 6333:6333 qdrant/qdrant
# Run the agent
uvicorn src.main:app --reload --host 0.0.0.0 --port 8000# Health check
curl http://localhost:8000/health
# Ingest a document
curl -X POST http://localhost:8000/ingest \
-H "Content-Type: application/json" \
-d '{"text": "Kubernetes is a container orchestration platform.", "metadata": {"topic": "k8s"}}'
# Query the RAG system
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"prompt": "What is Kubernetes?", "top_k": 3}'# Install Python dependencies
pip install -r requirements.txt
# Set environment variables (optional)
export OLLAMA_URL="http://localhost:11434/api/generate"
export QDRANT_URL="http://localhost:6333"# Coming soon - docker-compose.yml for local stack
docker-compose up -dSee Deployment Options below.
Best for: Production deployments where RAG needs cluster context
# Apply RBAC
kubectl apply -f k8s/k8s-mcp-rbac.yaml
# Deploy the stack
kubectl apply -f k8s/namespace.yaml
kubectl apply -f k8s/pvc.yaml
kubectl apply -f k8s/vectorDB.yaml
kubectl apply -f k8s/ollama.yaml
kubectl apply -f k8s/llm-agent.yaml
# Test the deployment
kubectl port-forward -n llm-chaos svc/llm-agent 8000:8000
curl http://localhost:8000/k8s/cluster/cpuBest for: Using Claude Desktop to query your cluster
# Run locally
python src/k8s_mcp_server.py
# Configure Claude Desktop
# Edit: ~/Library/Application Support/Claude/claude_desktop_config.json
{
"mcpServers": {
"k8s-cluster": {
"command": "python",
"args": ["/path/to/src/k8s_mcp_server.py"]
}
}
}Best for: Separate concerns with shared pod lifecycle
kubectl apply -f k8s/llm-agent-with-mcp.yamlBest for: Independent scaling of services
kubectl apply -f k8s/k8s-mcp-server.yaml
kubectl apply -f k8s/llm-agent.yamlπ Detailed deployment instructions: See docs/DEPLOYMENT_STEPS.md
import httpx
import asyncio
async def query_agent():
client = httpx.AsyncClient()
response = await client.post(
"http://localhost:8000/query",
json={"prompt": "Explain Python programming", "top_k": 3}
)
result = response.json()
print(result["response"])
asyncio.run(query_agent())from examples.k8s_rag_example import enhanced_rag_query
import asyncio
async def main():
# Automatically includes K8s metrics when relevant
result = await enhanced_rag_query(
"What's my cluster CPU usage right now?"
)
print(result["context"])
asyncio.run(main())# CPU usage
curl http://localhost:8000/k8s/cluster/cpu
# Memory usage
curl http://localhost:8000/k8s/cluster/memory
# List pods
curl http://localhost:8000/k8s/pods?namespace=default
# Cluster info
curl http://localhost:8000/k8s/cluster/info# Demo cluster-aware queries
python examples/k8s_rag_example.py 1
# Ingest cluster documentation
python examples/k8s_rag_example.py 2
# Single custom query
python examples/k8s_rag_example.py 3Ingest a document into the vector database.
Request Body:
{
"text": "Your document text here",
"metadata": {
"category": "programming",
"topic": "python"
}
}Response:
{
"message": "Document ingested",
"id": "uuid-here",
"text_length": 150
}Query the RAG system.
Request Body:
{
"prompt": "What is Kubernetes?",
"top_k": 3
}Response:
{
"query": "What is Kubernetes?",
"matches": [...],
"response": "Kubernetes is..."
}Health check endpoint.
Response:
{
"status": "ok",
"k8s_enabled": true
}Get cluster-wide CPU usage.
Response:
{
"cluster_cpu": {
"total_usage_cores": 2.5,
"total_capacity_cores": 8.0,
"utilization_percent": 31.25
},
"nodes": [...]
}Get cluster-wide memory usage.
Response:
{
"cluster_memory": {
"total_usage_gi": 4.2,
"total_capacity_gi": 16.0,
"utilization_percent": 26.25
},
"nodes": [...]
}List pods with optional filtering.
Query Parameters:
namespace(string): Namespace to query (default: "all")label_selector(string): Label selector (e.g., "app=nginx")
Response:
{
"count": 5,
"pods": [
{
"name": "pod-name",
"namespace": "default",
"status": "Running",
"node": "node-1",
"ip": "10.244.0.5"
}
]
}Get general cluster information.
Response:
{
"version": "v1.28.0",
"nodes_count": 3,
"namespaces_count": 12,
"k8s_enabled": true
}k8s-rag-agent/
βββ README.md # This file
βββ requirements.txt # Python dependencies
βββ Dockerfile # Container image definition
βββ .gitignore # Git ignore rules
β
βββ src/ # Source code
β βββ __init__.py
β βββ main.py # Basic RAG agent
β βββ unified_server.py # Unified RAG + K8s server
β βββ k8s_mcp_server.py # Standalone MCP server
β
βββ examples/ # Usage examples
β βββ k8s_rag_example.py # Cluster-aware RAG demo
β
βββ k8s/ # Kubernetes manifests
β βββ namespace.yaml # llm-chaos namespace
β βββ pvc.yaml # Persistent volume claims
β βββ vectorDB.yaml # Qdrant deployment
β βββ ollama.yaml # Ollama deployment
β βββ llm-agent.yaml # LLM agent deployment
β βββ k8s-mcp-rbac.yaml # RBAC permissions
β βββ k8s-mcp-server.yaml # Standalone MCP server
β βββ llm-agent-with-mcp.yaml # Agent + MCP sidecar
β
βββ docs/ # Documentation
βββ DEPLOYMENT_STEPS.md # Detailed deployment guide
| Variable | Default | Description |
|---|---|---|
OLLAMA_URL |
http://ollama:11434/api/generate |
Ollama generation endpoint |
OLLAMA_EMBED_URL |
http://ollama:11434/api/embeddings |
Ollama embeddings endpoint |
QDRANT_URL |
http://qdrant:6333 |
Qdrant vector database URL |
COLLECTION_NAME |
rag_memory |
Qdrant collection name |
The agent requires the following permissions:
get,list,watchon nodesget,list,watchon pods (all namespaces)- Access to metrics.k8s.io API group
See k8s/k8s-mcp-rbac.yaml for full RBAC configuration.
Required models:
- tinyllama: LLM for text generation
- all-minilm: Embedding model (384 dimensions)
ollama pull tinyllama
ollama pull all-minilm# Unit tests
pytest tests/
# Integration tests
kubectl port-forward -n llm-chaos svc/llm-agent 8000:8000
python examples/k8s_rag_example.py 1# Check all pods are running
kubectl get pods -n llm-chaos
# Check services
kubectl get svc -n llm-chaos
# Test health endpoint
kubectl port-forward -n llm-chaos svc/llm-agent 8000:8000
curl http://localhost:8000/health
# Test K8s integration
curl http://localhost:8000/k8s/cluster/infoCause: RBAC not configured or kubeconfig missing
Fix:
kubectl apply -f k8s/k8s-mcp-rbac.yaml
kubectl rollout restart deployment/llm-agent -n llm-chaosCause: Metrics server not installed in cluster
Fix:
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
kubectl wait --for=condition=available --timeout=60s deployment/metrics-server -n kube-systemCause: Ollama service not ready or wrong URL
Fix:
# Check Ollama pod
kubectl get pods -n llm-chaos -l app=ollama
# Check logs
kubectl logs -n llm-chaos -l app=ollama
# Verify models are loaded
kubectl exec -n llm-chaos -it <ollama-pod> -- ollama listContributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
# Install dev dependencies
pip install -r requirements-dev.txt
# Run linters
black src/ examples/
flake8 src/ examples/
mypy src/
# Run tests
pytest tests/ -vThis project is licensed under the MIT License - see the LICENSE file for details.
- FastAPI - Modern Python web framework
- Ollama - Local LLM inference
- Qdrant - Vector database
- Kubernetes - Container orchestration
- MCP Protocol - Model Context Protocol
If you find this project useful, please consider giving it a star! β
Built with β€οΈ for the Kubernetes and AI community