- System Overview
- Architecture Layers
- Component Details
- Data Flow
- Agent System
- Storage Layer
- API Design
- Performance Characteristics
LineageGraph is a semantic data lineage engine that enables natural language queries about data dependencies. The system is designed to run entirely locally with zero external API costs.
- Local-First: All components run locally (LLM, databases, services)
- Zero Cost: No paid APIs or cloud services required
- Modular: Components can be swapped (e.g., Ollama → OpenAI, DuckDB → Pinecone)
- Observable: Built-in OpenTelemetry tracing for debugging
graph TB
subgraph Presentation["Presentation Layer"]
Frontend[React Frontend<br/>Vite Dev Server<br/>Port 5173]
end
subgraph Application["Application Layer"]
API[FastAPI Backend<br/>REST API<br/>Port 8000]
end
subgraph Agent["Agent Orchestration Layer"]
AgentGraph[LangGraph Agent<br/>State Machine]
Plan[Plan Node]
Investigate[Investigate Node]
Tool[Tool Node]
Synthesize[Synthesize Node]
AgentGraph --> Plan
Plan --> Investigate
Investigate --> Tool
Tool --> Synthesize
end
subgraph Services["Service Layer"]
LLM[Ollama LLM<br/>Mistral 7B<br/>Port 11434]
Vector[DuckDB<br/>Vector Search<br/>File-based]
Graph[PostgreSQL<br/>Graph Database<br/>Port 5432]
Tracing[Jaeger<br/>OpenTelemetry<br/>Port 16686]
end
Frontend -->|HTTP REST| API
API --> AgentGraph
AgentGraph --> LLM
AgentGraph --> Vector
AgentGraph --> Graph
AgentGraph -.->|Traces| Tracing
style Frontend fill:#e1f5ff
style API fill:#fff4e1
style AgentGraph fill:#ffe1f5
style LLM fill:#e1ffe1
style Vector fill:#f5e1ff
style Graph fill:#ffe1e1
sequenceDiagram
participant User
participant Frontend
participant API
participant Agent
participant VectorDB
participant GraphDB
participant LLM
User->>Frontend: Natural Language Query
Frontend->>API: POST /api/query
API->>Agent: Initialize Agent State
Agent->>Agent: Plan Node (Analyze Query)
Agent->>LLM: Generate Investigation Plan
Agent->>Agent: Investigate Node (Select Tools)
Agent->>LLM: Determine Tool Selection
Agent->>VectorDB: search_vector_db(query)
VectorDB-->>Agent: Relevant Tables
Agent->>GraphDB: get_table_dependencies(table_id)
GraphDB-->>Agent: Dependency Tree
Agent->>Agent: Synthesize Node
Agent->>LLM: Generate Final Answer
LLM-->>Agent: Natural Language Response
Agent-->>API: Agent State with Answer
API-->>Frontend: JSON Response
Frontend-->>User: Display Results
Technology: React 18 + Vite 5
Components:
QueryInterface: Natural language query input with validationResultsDisplay: Formatted answer display with confidence scoresLineageVisualization: Graph visualization (planned)
Responsibilities:
- User interface for querying
- Displaying results in readable format
- Visualizing lineage graphs
- Error handling and loading states
Location: frontend/src/
Technology: FastAPI 0.109.0 (Python)
Endpoints:
GET /health: Health check endpointPOST /api/query: Main query endpoint with request validation
Responsibilities:
- HTTP request/response handling
- Input validation using Pydantic models
- Error handling and exception management
- CORS configuration for development
Location: src/main.py
Technology: LangGraph 0.0.32
Agent Graph Structure:
stateDiagram-v2
[*] --> START
START --> Plan: Initialize State
Plan --> Investigate: Plan Created
Investigate --> Tool: Tool Selected
Tool --> CheckContinue: Tool Executed
CheckContinue --> Investigate: More Tools Needed
CheckContinue --> Synthesize: Enough Information
Synthesize --> Done: Answer Generated
Done --> [*]
note right of Plan
Analyzes query
Creates execution plan
Determines required tools
end note
note right of Investigate
Selects appropriate tools
Prepares tool inputs
Decides investigation strategy
end note
note right of Tool
Executes selected tools
Handles tool results
Updates agent state
end note
note right of Synthesize
Combines tool results
Generates final answer
Calculates confidence score
end note
Nodes:
-
Plan Node (
plan_node)- Analyzes user query using LLM
- Creates execution plan
- Determines required tools and investigation strategy
-
Investigate Node (
investigate_node)- Selects appropriate tools based on plan
- Prepares tool inputs
- Decides investigation strategy
-
Tool Execution (
tool_node)- Executes selected tools
- Handles tool results
- Updates agent state with results
-
Synthesize Node (
synthesize_node)- Combines tool results
- Generates final answer using LLM
- Calculates confidence score
State Management:
AgentState: Typed state object with all agent information- Fields:
user_query,current_step,plan,next_tool,tool_results,final_answer,confidence_score,step_count,tool_calls_made
Location: src/agents/
The agent has access to six specialized tools:
graph LR
subgraph Tools["Agent Tools"]
T1[search_vector_db<br/>Semantic Search]
T2[get_table_dependencies<br/>Graph Traversal]
T3[validate_lineage_path<br/>Path Validation]
T4[get_node_metadata<br/>Node Information]
T5[trace_data_flow<br/>Flow Tracing]
T6[check_data_freshness<br/>Freshness Check]
end
subgraph DataStores["Data Stores"]
VDB[(DuckDB<br/>Vector DB)]
GDB[(PostgreSQL<br/>Graph DB)]
end
T1 --> VDB
T2 --> GDB
T3 --> GDB
T4 --> GDB
T5 --> GDB
T6 --> GDB
style T1 fill:#e1f5ff
style T2 fill:#fff4e1
style T3 fill:#ffe1f5
style T4 fill:#e1ffe1
style T5 fill:#f5e1ff
style T6 fill:#ffe1e1
Tool Details:
-
search_vector_db
- Semantic search over table descriptions
- Returns relevant tables with similarity scores
- Uses DuckDB vector search with cosine similarity
-
get_table_dependencies
- Recursive graph query for upstream dependencies
- Returns dependency tree with depth information
- Uses PostgreSQL recursive CTEs
-
validate_lineage_path
- Validates if a path exists between two nodes
- Returns boolean result with path length
-
get_node_metadata
- Retrieves metadata for a specific node
- Returns node information including type and description
-
trace_data_flow
- Traces complete data flow path
- Returns path with intermediate nodes
-
check_data_freshness
- Checks data freshness score
- Returns freshness metrics and last updated timestamp
Location: src/agents/tools.py
Technology: DuckDB 1.4.3 + Sentence Transformers 5.2.0
Components:
VectorStore: DuckDB-based vector databaseLocalEmbedder: Sentence-transformers embedder (all-MiniLM-L6-v2, 384 dimensions)
Schema:
CREATE TABLE embeddings (
id VARCHAR PRIMARY KEY,
text VARCHAR,
table_name VARCHAR,
column_names VARCHAR,
source_type VARCHAR,
created_at TIMESTAMP
);
CREATE TABLE vectors (
id VARCHAR PRIMARY KEY,
embedding DOUBLE[],
FOREIGN KEY (id) REFERENCES embeddings(id)
);Operations:
add_embedding(): Store text and embeddingsearch(): Cosine similarity search with configurable limit
Location: src/vector/
Technology: PostgreSQL 15
Schema:
CREATE TABLE nodes (
id VARCHAR PRIMARY KEY,
node_type VARCHAR, -- 'Table', 'Dashboard', 'Query'
name VARCHAR,
description TEXT,
metadata JSONB,
created_at TIMESTAMP
);
CREATE TABLE edges (
id SERIAL PRIMARY KEY,
source_id VARCHAR REFERENCES nodes(id),
target_id VARCHAR REFERENCES nodes(id),
edge_type VARCHAR, -- 'FEEDS_INTO', 'DEPENDS_ON'
strength FLOAT,
created_at TIMESTAMP
);
-- Indexes for performance
CREATE INDEX idx_edges_source ON edges(source_id);
CREATE INDEX idx_edges_target ON edges(target_id);Operations:
add_node(): Add a node to the graphadd_edge(): Add a relationship between nodesget_dependencies(): Recursive upstream query using WITH RECURSIVE
Location: src/graph/
Technology: Ollama
Model: Mistral 7B (default)
API: HTTP REST API at http://localhost:11434/api/generate
Configuration:
- Temperature: 0.3 (for factual answers)
- Model: mistral
- Timeout: 60 seconds
- Inference Speed: ~40 tokens/second on M2
Location: src/main.py (call_local_llm function)
flowchart TD
Start([User Query]) --> Frontend[Frontend Receives Query]
Frontend --> API[FastAPI /api/query Endpoint]
API --> AgentInit[Agent Initialization]
AgentInit --> Plan[Plan Node]
Plan --> PlanLLM[LLM: Generate Plan]
PlanLLM --> PlanResult[Plan Created]
PlanResult --> Investigate[Investigate Node]
Investigate --> InvLLM[LLM: Select Tools]
InvLLM --> ToolSelect[Tools Selected]
ToolSelect --> ToolExec[Tool Execution]
ToolExec --> VectorSearch[Vector Search Tool]
ToolExec --> GraphQuery[Graph Query Tool]
VectorSearch --> DuckDB[(DuckDB)]
GraphQuery --> PostgreSQL[(PostgreSQL)]
DuckDB --> VectorResults[Vector Results]
PostgreSQL --> GraphResults[Graph Results]
VectorResults --> StateUpdate[Update Agent State]
GraphResults --> StateUpdate
StateUpdate --> CheckConf{Confidence<br/>Threshold?}
CheckConf -->|Low| Investigate
CheckConf -->|High| Synthesize
Synthesize --> SynLLM[LLM: Generate Answer]
SynLLM --> FinalAnswer[Final Answer]
FinalAnswer --> Response[API Response]
Response --> Frontend
Frontend --> Display([Display Results])
style Start fill:#e1f5ff
style Plan fill:#fff4e1
style Investigate fill:#ffe1f5
style ToolExec fill:#e1ffe1
style Synthesize fill:#f5e1ff
style Display fill:#ffe1e1
graph TB
subgraph AgentState["Agent State"]
Query[User Query]
Plan[Execution Plan]
Results[Tool Results]
end
subgraph VectorTool["Vector Search Tool"]
Embed[Embed Query Text]
Search[Search DuckDB]
Rank[Rank by Similarity]
end
subgraph GraphTool["Graph Query Tool"]
Parse[Parse Table ID]
Recursive[Recursive CTE Query]
Build[Build Dependency Tree]
end
Query --> Embed
Embed --> Search
Search --> Rank
Rank --> Results
Query --> Parse
Parse --> Recursive
Recursive --> Build
Build --> Results
Results --> AgentState
style AgentState fill:#e1f5ff
style VectorTool fill:#fff4e1
style GraphTool fill:#ffe1f5
The agent uses a state machine pattern with LangGraph:
States:
planning: Creating execution planinvestigating: Selecting and preparing toolsexecuting: Running toolssynthesizing: Generating final answerdone: Query complete
State Transitions:
stateDiagram-v2
[*] --> Planning
Planning --> Investigating: Plan Created
Investigating --> Executing: Tool Selected
Executing --> Investigating: More Tools Needed
Executing --> Synthesizing: Sufficient Information
Synthesizing --> Done: Answer Generated
Done --> [*]
Executing --> Done: Max Steps Reached
Executing --> Done: Max Tools Reached
The agent uses LLM-based tool selection:
- Query analyzed in
investigate_node - LLM determines which tools are needed based on query intent
- Tools executed in sequence based on plan
- Results accumulated in agent state
- Confidence threshold determines if more investigation is needed
- Maximum recursion depth: 10 steps
- Maximum tool calls: 3 per query
- Prevents infinite loops
- Graceful degradation on recursion limit
File: semantic_lineage.duckdb
Characteristics:
- File-based (no server needed)
- In-memory operations for fast queries
- Fast cosine similarity search
- Suitable for <10K vectors
- Zero network latency
Backup: File can be copied for backup
Database: semantic_lineage
Characteristics:
- ACID compliance
- Recursive CTEs for graph traversal
- Indexes on source_id and target_id for performance
- JSONB for flexible metadata storage
- Supports complex graph queries
Connection: Configurable via DATABASE_URL environment variable
Base URL: http://localhost:8000
Endpoints:
-
GET /health
- Purpose: Health check
- Response:
{"status": "ok"}
-
POST /api/query
- Purpose: Execute lineage query
- Request Body:
{ "query": "What feeds into revenue dashboard?", "depth": 3 } - Response:
{ "query": "...", "answer": "...", "context_docs": [...], "lineage_path": {...}, "confidence": 0.85 }
Direct Function Call:
from src.agents.graph import run_agent
result = run_agent("What feeds into revenue?", verbose=False)Response Structure:
{
"user_query": "...",
"current_step": "done",
"final_answer": "...",
"confidence_score": 0.85,
"tool_calls_made": [...],
"tool_results": {...}
}| Component | Typical Latency | Notes |
|---|---|---|
| Vector Search | <10ms | DuckDB in-memory |
| Graph Query | <50ms | PostgreSQL with indexes |
| LLM Inference | 2-5 seconds | Ollama on M2 (40 tokens/sec) |
| Total Query | 3-6 seconds | End-to-end |
gantt
title Query Processing Timeline
dateFormat X
axisFormat %Ls
section Request
HTTP Request :0, 10
API Validation :10, 20
section Agent
Plan Node :20, 500
Investigate Node :500, 1000
Tool Execution :1000, 1100
Synthesize Node :1100, 3500
section Response
Format Response :3500, 3510
HTTP Response :3510, 3520
Current Limits:
- Vector DB: ~10K embeddings (DuckDB file-based)
- Graph DB: ~100K nodes/edges (PostgreSQL)
- Concurrent Queries: Limited by Ollama (single instance)
Scaling Options:
- Vector: Migrate to Pinecone or pgvector
- Graph: Add read replicas or use Neo4j
- LLM: Use OpenAI API or multiple Ollama instances
Memory:
- Ollama: ~4-8GB (Mistral 7B model)
- PostgreSQL: ~100MB (sample data)
- DuckDB: ~50MB (sample data)
- Python: ~200MB
Disk:
- Ollama model: ~4.4GB
- DuckDB file: ~10MB
- PostgreSQL data: ~50MB
Setup:
export TRACING_ENABLED=true
docker run -p 16686:16686 jaegertracing/all-in-one:latestTrace Structure:
graph TD
Root[Agent Execution] --> PlanSpan[Plan Node Span]
Root --> InvSpan[Investigate Node Span]
Root --> ToolSpan[Tool Execution Span]
Root --> SynSpan[Synthesize Node Span]
PlanSpan --> PlanLLM[LLM Inference Span]
InvSpan --> InvLLM[LLM Inference Span]
ToolSpan --> VectorTool[Vector Search Tool Span]
ToolSpan --> GraphTool[Graph Query Tool Span]
SynSpan --> SynLLM[LLM Inference Span]
style Root fill:#e1f5ff
style PlanSpan fill:#fff4e1
style InvSpan fill:#ffe1f5
style ToolSpan fill:#e1ffe1
style SynSpan fill:#f5e1ff
Traces:
- Agent execution spans
- Tool call spans with inputs/outputs
- LLM inference spans with prompts/responses
View: http://localhost:16686
Location: src/agents/tracing.py
- CORS enabled for local development (
allow_origins=["*"]) - No authentication (local-only deployment)
- Database credentials in environment variables
- Input validation using Pydantic models
- Implement authentication (JWT, OAuth)
- Restrict CORS origins to specific domains
- Use connection pooling for databases
- Add rate limiting to prevent abuse
- Encrypt sensitive data at rest
- Use HTTPS for all communications
- Implement API key authentication
- Neo4j Integration: Native graph database for complex queries
- Caching Layer: Redis for query result caching
- Batch Processing: Support for bulk queries
- Advanced Visualization: Interactive lineage graphs with D3.js
- Multi-Model Support: Support for multiple LLM providers
- Query History: Store and replay previous queries
- User Authentication: Multi-user support with role-based access
- Real-time Updates: WebSocket support for live lineage updates