LineageGraph: Semantic Data Lineage Engine

A local-first, zero-cost AI-powered system for querying and understanding data lineage using natural language. Built with local LLMs, vector search, and graph databases.

🎯 Overview

LineageGraph enables users to ask natural language questions about data dependencies and lineage, such as:

"What feeds into the revenue dashboard?"
"Which tables are upstream dependencies of revenue_daily?"
"Can you trace the complete data flow from orders to revenue?"

The system combines:

Vector Search (DuckDB) for semantic similarity matching
Graph Database (PostgreSQL) for dependency traversal
Local LLM (Ollama) for natural language understanding
LangGraph Agent for intelligent query planning and execution

✨ Features

🔍 Natural Language Queries: Ask questions about data lineage in plain English
🧠 Intelligent Agent: LangGraph-based agent that plans, investigates, and synthesizes answers
📊 GraphRAG: Grounds answers in structured graph data for accuracy
🔎 Vector Search: Semantic search over table descriptions and metadata
🎨 Interactive Frontend: React-based UI for querying and visualization
📈 Evaluation Suite: Comprehensive test harness with golden dataset
🔬 Observability: OpenTelemetry tracing for debugging and monitoring
💰 Zero Cost: Runs entirely locally, no API costs

🏗️ Architecture

graph TB
    subgraph Frontend["Frontend Layer"]
        UI[React UI<br/>Query Interface<br/>Port 5173]
    end
    
    subgraph Backend["Backend Layer"]
        API[FastAPI<br/>REST API<br/>Port 8000]
    end
    
    subgraph Agent["Agent Layer"]
        LangGraph[LangGraph Agent<br/>State Machine]
    end
    
    subgraph Services["Services"]
        LLM[Ollama LLM<br/>Mistral 7B]
        Vector[DuckDB<br/>Vector Search]
        Graph[PostgreSQL<br/>Graph Database]
    end
    
    UI -->|HTTP REST| API
    API --> LangGraph
    LangGraph --> LLM
    LangGraph --> Vector
    LangGraph --> Graph
    
    style Frontend fill:#e1f5ff
    style Backend fill:#fff4e1
    style Agent fill:#ffe1f5
    style Services fill:#e1ffe1

For detailed architecture documentation, see:

System Overview - Visual architecture diagrams
Architecture Documentation - Detailed system architecture
Component Reference - Component details

🚀 Quick Start

Prerequisites

Python 3.11+
Node.js 18+
PostgreSQL 15+
Ollama (for local LLM)
Homebrew (macOS) or equivalent package manager

Installation

Clone the repository:

git clone https://github.com/yxshwanth/LieageGraph.git
cd LineageGraph

Install Python dependencies:

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Install frontend dependencies:
```
cd frontend
npm install
cd ..
```

Set up services:

# Start PostgreSQL
brew services start postgresql@15

# Start Ollama
brew services start ollama

# Download Mistral model
ollama pull mistral

Load sample data:

source venv/bin/activate
python src/graph/loader.py
python src/vector/loader.py

Running the Application

Option 1: Using the management script (recommended)

# Start infrastructure services
./scripts/manage.sh start

# Start backend (in terminal 1)
source venv/bin/activate
python src/main.py

# Start frontend (in terminal 2)
cd frontend
npm run dev

Option 2: Using Make

# Start infrastructure
make start

# Start backend
make backend

# Start frontend (in another terminal)
make frontend

Option 3: Manual

# Terminal 1: Backend
source venv/bin/activate
python src/main.py

# Terminal 2: Frontend
cd frontend
npm run dev

The application will be available at:

Frontend: http://localhost:5173
Backend API: http://localhost:8000
API Docs: http://localhost:8000/docs

📖 Usage

API Endpoint

curl -X POST http://localhost:8000/api/query \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What feeds into the revenue dashboard?",
    "depth": 3
  }'

Agent API

from src.agents.graph import run_agent

result = run_agent("What feeds into the revenue dashboard?", verbose=True)
print(result["final_answer"])

Frontend

Open http://localhost:5173 in your browser and use the query interface to ask questions about data lineage.

🧪 Testing

Run the test suite:

# All tests
pytest tests/ -v

# Unit tests
make test-unit

# Integration tests
make test-integration

# Evaluation pipeline
pytest tests/test_evaluation_pipeline.py -v

📁 Project Structure

LineageGraph/
├── src/
│   ├── agents/          # LangGraph agent implementation
│   │   ├── graph.py      # Agent graph definition
│   │   ├── nodes.py      # Agent nodes (plan, investigate, synthesize)
│   │   ├── tools.py      # Agent tools (vector search, graph queries)
│   │   └── state.py      # Agent state management
│   ├── graph/            # Graph database layer
│   │   ├── schema.py     # PostgreSQL schema and queries
│   │   └── loader.py     # Sample data loader
│   ├── vector/           # Vector search layer
│   │   ├── database.py   # DuckDB vector store
│   │   ├── embeddings.py # Sentence-transformers embedder
│   │   └── loader.py     # Sample data loader
│   └── main.py           # FastAPI application
├── frontend/              # React frontend
│   ├── src/
│   │   ├── App.jsx       # Main app component
│   │   └── components/   # UI components
├── tests/                 # Test suite
│   ├── test_agent_*.py   # Agent tests
│   ├── test_evaluation_pipeline.py  # Evaluation tests
│   └── data/             # Golden dataset
├── docs/                  # Documentation
├── scripts/               # Utility scripts
│   └── manage.sh         # Service management
└── requirements.txt       # Python dependencies

🔧 Configuration

Environment Variables

# Database connection
export DATABASE_URL="postgresql://postgres:postgres@localhost/semantic_lineage"

# Enable OpenTelemetry tracing
export TRACING_ENABLED=true

Service Management

See SERVICE_MANAGEMENT.md for detailed service management instructions.

📊 Evaluation

The project includes a comprehensive evaluation harness:

Golden Dataset: 20+ test cases covering various query types
Metrics: Pass rate, node recall, answer relevance
Thresholds: 70% pass rate, 70% node recall, 65% answer relevance

Run evaluation:

pytest tests/test_evaluation_pipeline.py -v

🛠️ Development

Adding New Tools

Define the tool in src/agents/tools.py:

@tool("my_new_tool")
def my_new_tool(param: str) -> Dict[str, Any]:
    """Tool description"""
    # Implementation
    return {"success": True, "result": ...}

Add to ALL_TOOLS in src/agents/tools.py
The agent will automatically discover and use it

Adding New Data

Graph data: Use src/graph/loader.py as a template
Vector data: Use src/vector/loader.py as a template

🚢 CI/CD

GitHub Actions automatically runs:

Unit tests
Integration tests
Evaluation pipeline (optional, slow)

See .github/workflows/test.yml for details.

📚 Documentation

System Overview - Visual architecture diagrams and system overview
Architecture Documentation - Detailed system architecture with Mermaid diagrams
Component Reference - Detailed component documentation
Quick Start Guide - Step-by-step setup instructions
Service Management - Service management guide
Agent Tracing - OpenTelemetry tracing usage

🤝 Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📝 License

This project is open source and available under the MIT License.

🙏 Acknowledgments

Ollama for local LLM inference
LangGraph for agent orchestration
DuckDB for vector search
PostgreSQL for graph storage

📧 Support

For questions, issues, or contributions, please open an issue on GitHub.

Built with ❤️ for zero-cost AI systems

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github/workflows		.github/workflows
docs		docs
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
Local-Architecture.md		Local-Architecture.md
Makefile		Makefile
PART1_SUMMARY.md		PART1_SUMMARY.md
README.md		README.md
SERVICE_MANAGEMENT.md		SERVICE_MANAGEMENT.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LineageGraph: Semantic Data Lineage Engine

🎯 Overview

✨ Features

🏗️ Architecture

🚀 Quick Start

Prerequisites

Installation

Running the Application

📖 Usage

API Endpoint

Agent API

Frontend

🧪 Testing

📁 Project Structure

🔧 Configuration

Environment Variables

Service Management

📊 Evaluation

🛠️ Development

Adding New Tools

Adding New Data

🚢 CI/CD

📚 Documentation

🤝 Contributing

📝 License

🙏 Acknowledgments

📧 Support

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LineageGraph: Semantic Data Lineage Engine

🎯 Overview

✨ Features

🏗️ Architecture

🚀 Quick Start

Prerequisites

Installation

Running the Application

📖 Usage

API Endpoint

Agent API

Frontend

🧪 Testing

📁 Project Structure

🔧 Configuration

Environment Variables

Service Management

📊 Evaluation

🛠️ Development

Adding New Tools

Adding New Data

🚢 CI/CD

📚 Documentation

🤝 Contributing

📝 License

🙏 Acknowledgments

📧 Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages