🤖 Kubernetes-Aware RAG Agent

A production-ready LLM agent that combines Retrieval Augmented Generation (RAG) with live Kubernetes cluster metrics. Built with FastAPI, Ollama, Qdrant, and Kubernetes integration.

🌟 Features

📚 RAG System: Vector-based document retrieval using Qdrant and Ollama embeddings
☸️ K8s Integration: Real-time cluster metrics (CPU, memory, pods, nodes)
🔄 Unified API: Single FastAPI server combining RAG and Kubernetes queries
🐳 Container Ready: Full Docker and Kubernetes deployment support
🔌 MCP Server: Model Context Protocol server for Claude Desktop integration
🎯 Flexible Architecture: Deploy as unified server, sidecar, or standalone services

🏗️ Architecture

┌─────────────────────────────────────────────────────────────┐
│                   K8s-Aware RAG Agent                       │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌──────────────┐    ┌──────────────┐    ┌─────────────┐ │
│  │   FastAPI    │───▶│   Qdrant     │    │  Kubernetes │ │
│  │    Server    │    │  Vector DB   │    │   Cluster   │ │
│  └──────┬───────┘    └──────────────┘    └──────┬──────┘ │
│         │                                         │        │
│         │            ┌──────────────┐            │        │
│         └───────────▶│    Ollama    │            │        │
│                      │  (LLM/Embed) │            │        │
│                      └──────────────┘            │        │
│                                                   │        │
│         RAG Queries ◄─────────┬─────────────────┘        │
│                               │                           │
│         K8s Metrics ◄─────────┘                          │
│                                                           │
└───────────────────────────────────────────────────────────┘

Components

LLM Agent (src/main.py): Basic RAG implementation with FastAPI
Unified Server (src/unified_server.py): Combined RAG + K8s metrics server
MCP Server (src/k8s_mcp_server.py): Standalone K8s metrics via MCP protocol
Examples (examples/k8s_rag_example.py): Cluster-aware RAG query examples

🚀 Quick Start

Prerequisites

Python 3.11+
Docker & Docker Compose (optional for local development)
Kubernetes cluster (Minikube, Kind, or cloud provider)
kubectl configured
Metrics Server installed in your K8s cluster

1. Clone and Install

git clone https://github.com/Jonsy13/ollama-k8s-rag.git
cd ollama-k8s-rag

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

2. Run Locally (Basic Agent)

# Start Ollama (in separate terminal)
ollama serve

# Pull required models
ollama pull tinyllama
ollama pull all-minilm

# Start Qdrant (Docker)
docker run -d -p 6333:6333 qdrant/qdrant

# Run the agent
uvicorn src.main:app --reload --host 0.0.0.0 --port 8000

3. Test the API

# Health check
curl http://localhost:8000/health

# Ingest a document
curl -X POST http://localhost:8000/ingest \
  -H "Content-Type: application/json" \
  -d '{"text": "Kubernetes is a container orchestration platform.", "metadata": {"topic": "k8s"}}'

# Query the RAG system
curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is Kubernetes?", "top_k": 3}'

📦 Installation

Option 1: Local Development

# Install Python dependencies
pip install -r requirements.txt

# Set environment variables (optional)
export OLLAMA_URL="http://localhost:11434/api/generate"
export QDRANT_URL="http://localhost:6333"

Option 2: Docker Compose

# Coming soon - docker-compose.yml for local stack
docker-compose up -d

Option 3: Kubernetes Deployment

See Deployment Options below.

🎯 Deployment Options

Option 1: Unified Server (Recommended)

Best for: Production deployments where RAG needs cluster context

# Apply RBAC
kubectl apply -f k8s/k8s-mcp-rbac.yaml

# Deploy the stack
kubectl apply -f k8s/namespace.yaml
kubectl apply -f k8s/pvc.yaml
kubectl apply -f k8s/vectorDB.yaml
kubectl apply -f k8s/ollama.yaml
kubectl apply -f k8s/llm-agent.yaml

# Test the deployment
kubectl port-forward -n llm-chaos svc/llm-agent 8000:8000
curl http://localhost:8000/k8s/cluster/cpu

Option 2: MCP Server for Claude Desktop

Best for: Using Claude Desktop to query your cluster

# Run locally
python src/k8s_mcp_server.py

# Configure Claude Desktop
# Edit: ~/Library/Application Support/Claude/claude_desktop_config.json
{
  "mcpServers": {
    "k8s-cluster": {
      "command": "python",
      "args": ["/path/to/src/k8s_mcp_server.py"]
    }
  }
}

Option 3: Sidecar Container

Best for: Separate concerns with shared pod lifecycle

kubectl apply -f k8s/llm-agent-with-mcp.yaml

Option 4: Separate Deployments

Best for: Independent scaling of services

kubectl apply -f k8s/k8s-mcp-server.yaml
kubectl apply -f k8s/llm-agent.yaml

📖 Detailed deployment instructions: See docs/DEPLOYMENT_STEPS.md

💡 Usage Examples

Basic RAG Query

import httpx
import asyncio

async def query_agent():
    client = httpx.AsyncClient()
    response = await client.post(
        "http://localhost:8000/query",
        json={"prompt": "Explain Python programming", "top_k": 3}
    )
    result = response.json()
    print(result["response"])

asyncio.run(query_agent())

Cluster-Aware RAG Query

from examples.k8s_rag_example import enhanced_rag_query
import asyncio

async def main():
    # Automatically includes K8s metrics when relevant
    result = await enhanced_rag_query(
        "What's my cluster CPU usage right now?"
    )
    print(result["context"])

asyncio.run(main())

Get Cluster Metrics

# CPU usage
curl http://localhost:8000/k8s/cluster/cpu

# Memory usage
curl http://localhost:8000/k8s/cluster/memory

# List pods
curl http://localhost:8000/k8s/pods?namespace=default

# Cluster info
curl http://localhost:8000/k8s/cluster/info

Run Example Script

# Demo cluster-aware queries
python examples/k8s_rag_example.py 1

# Ingest cluster documentation
python examples/k8s_rag_example.py 2

# Single custom query
python examples/k8s_rag_example.py 3

📚 API Reference

RAG Endpoints

`POST /ingest`

Ingest a document into the vector database.

Request Body:

{
  "text": "Your document text here",
  "metadata": {
    "category": "programming",
    "topic": "python"
  }
}

Response:

{
  "message": "Document ingested",
  "id": "uuid-here",
  "text_length": 150
}

`POST /query`

Query the RAG system.

Request Body:

{
  "prompt": "What is Kubernetes?",
  "top_k": 3
}

Response:

{
  "query": "What is Kubernetes?",
  "matches": [...],
  "response": "Kubernetes is..."
}

`GET /health`

Health check endpoint.

Response:

{
  "status": "ok",
  "k8s_enabled": true
}

Kubernetes Endpoints

`GET /k8s/cluster/cpu`

Get cluster-wide CPU usage.

Response:

{
  "cluster_cpu": {
    "total_usage_cores": 2.5,
    "total_capacity_cores": 8.0,
    "utilization_percent": 31.25
  },
  "nodes": [...]
}

`GET /k8s/cluster/memory`

Get cluster-wide memory usage.

Response:

{
  "cluster_memory": {
    "total_usage_gi": 4.2,
    "total_capacity_gi": 16.0,
    "utilization_percent": 26.25
  },
  "nodes": [...]
}

`GET /k8s/pods`

List pods with optional filtering.

Query Parameters:

namespace (string): Namespace to query (default: "all")
label_selector (string): Label selector (e.g., "app=nginx")

Response:

{
  "count": 5,
  "pods": [
    {
      "name": "pod-name",
      "namespace": "default",
      "status": "Running",
      "node": "node-1",
      "ip": "10.244.0.5"
    }
  ]
}

`GET /k8s/cluster/info`

Get general cluster information.

Response:

{
  "version": "v1.28.0",
  "nodes_count": 3,
  "namespaces_count": 12,
  "k8s_enabled": true
}

📁 Project Structure

k8s-rag-agent/
├── README.md                 # This file
├── requirements.txt          # Python dependencies
├── Dockerfile               # Container image definition
├── .gitignore              # Git ignore rules
│
├── src/                    # Source code
│   ├── __init__.py
│   ├── main.py            # Basic RAG agent
│   ├── unified_server.py  # Unified RAG + K8s server
│   └── k8s_mcp_server.py  # Standalone MCP server
│
├── examples/              # Usage examples
│   └── k8s_rag_example.py # Cluster-aware RAG demo
│
├── k8s/                   # Kubernetes manifests
│   ├── namespace.yaml     # llm-chaos namespace
│   ├── pvc.yaml          # Persistent volume claims
│   ├── vectorDB.yaml     # Qdrant deployment
│   ├── ollama.yaml       # Ollama deployment
│   ├── llm-agent.yaml    # LLM agent deployment
│   ├── k8s-mcp-rbac.yaml # RBAC permissions
│   ├── k8s-mcp-server.yaml      # Standalone MCP server
│   └── llm-agent-with-mcp.yaml  # Agent + MCP sidecar
│
└── docs/                  # Documentation
    └── DEPLOYMENT_STEPS.md # Detailed deployment guide

⚙️ Configuration

Environment Variables

Variable	Default	Description
`OLLAMA_URL`	`http://ollama:11434/api/generate`	Ollama generation endpoint
`OLLAMA_EMBED_URL`	`http://ollama:11434/api/embeddings`	Ollama embeddings endpoint
`QDRANT_URL`	`http://qdrant:6333`	Qdrant vector database URL
`COLLECTION_NAME`	`rag_memory`	Qdrant collection name

Kubernetes RBAC

The agent requires the following permissions:

get, list, watch on nodes
get, list, watch on pods (all namespaces)
Access to metrics.k8s.io API group

See k8s/k8s-mcp-rbac.yaml for full RBAC configuration.

Ollama Models

Required models:

tinyllama: LLM for text generation
all-minilm: Embedding model (384 dimensions)

ollama pull tinyllama
ollama pull all-minilm

🧪 Testing

Run Tests

# Unit tests
pytest tests/

# Integration tests
kubectl port-forward -n llm-chaos svc/llm-agent 8000:8000
python examples/k8s_rag_example.py 1

Verify Deployment

# Check all pods are running
kubectl get pods -n llm-chaos

# Check services
kubectl get svc -n llm-chaos

# Test health endpoint
kubectl port-forward -n llm-chaos svc/llm-agent 8000:8000
curl http://localhost:8000/health

# Test K8s integration
curl http://localhost:8000/k8s/cluster/info

🐛 Troubleshooting

"Kubernetes client not initialized"

Cause: RBAC not configured or kubeconfig missing

Fix:

kubectl apply -f k8s/k8s-mcp-rbac.yaml
kubectl rollout restart deployment/llm-agent -n llm-chaos

"Metrics server not available"

Cause: Metrics server not installed in cluster

Fix:

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
kubectl wait --for=condition=available --timeout=60s deployment/metrics-server -n kube-system

Ollama connection errors

Cause: Ollama service not ready or wrong URL

Fix:

# Check Ollama pod
kubectl get pods -n llm-chaos -l app=ollama

# Check logs
kubectl logs -n llm-chaos -l app=ollama

# Verify models are loaded
kubectl exec -n llm-chaos -it <ollama-pod> -- ollama list

🤝 Contributing

Contributions are welcome! Please follow these steps:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Development Setup

# Install dev dependencies
pip install -r requirements-dev.txt

# Run linters
black src/ examples/
flake8 src/ examples/
mypy src/

# Run tests
pytest tests/ -v

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

FastAPI - Modern Python web framework
Ollama - Local LLM inference
Qdrant - Vector database
Kubernetes - Container orchestration
MCP Protocol - Model Context Protocol

🌟 Star History

If you find this project useful, please consider giving it a star! ⭐

Built with ❤️ for the Kubernetes and AI community

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
docs		docs
examples		examples
k8s		k8s
src		src
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🤖 Kubernetes-Aware RAG Agent

🌟 Features

📋 Table of Contents

🏗️ Architecture

Components

🚀 Quick Start

Prerequisites

1. Clone and Install

2. Run Locally (Basic Agent)

3. Test the API

📦 Installation

Option 1: Local Development

Option 2: Docker Compose

Option 3: Kubernetes Deployment

🎯 Deployment Options

Option 1: Unified Server (Recommended)

Option 2: MCP Server for Claude Desktop

Option 3: Sidecar Container

Option 4: Separate Deployments

💡 Usage Examples

Basic RAG Query

Cluster-Aware RAG Query

Get Cluster Metrics

Run Example Script

📚 API Reference

RAG Endpoints

POST /ingest

POST /query

GET /health

Kubernetes Endpoints

GET /k8s/cluster/cpu

GET /k8s/cluster/memory

GET /k8s/pods

GET /k8s/cluster/info

📁 Project Structure

⚙️ Configuration

Environment Variables

Kubernetes RBAC

Ollama Models

🧪 Testing

Run Tests

Verify Deployment

🐛 Troubleshooting

"Kubernetes client not initialized"

"Metrics server not available"

Ollama connection errors

🤝 Contributing

Development Setup

📄 License

🙏 Acknowledgments

🌟 Star History

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`POST /ingest`

`POST /query`

`GET /health`

`GET /k8s/cluster/cpu`

`GET /k8s/cluster/memory`

`GET /k8s/pods`

`GET /k8s/cluster/info`

Packages