This document provides an overview of the Docling RAG system's code structure, component relationships, and architecture.
The Docling RAG system follows a layered architecture pattern:
- Interface Layer: CLI and API interfaces (
main.py,fastapi_app.py,server.py) - Orchestration Layer: System coordination (
rag_orchestrator_api.py,docling_rag_orchestrator.py) - Processing Layer: Document processing and search (
docling_processor.py,vector_indexer.py, etc.) - Storage Layer: Document and vector storage management
User Interfaces (CLI, API)
↓ ↑
Orchestrators
↓ ↑
Component Services/Processors
↓ ↑
Data Management
docling-rag/
├── config.py # Configuration settings
├── config.json # JSON configuration file
├── main.py # Command-line interface
├── fastapi_app.py # FastAPI application
├── server.py # Standalone server script
├── rag_orchestrator_api.py # API orchestrator
├── doclingroc/ # Document processing components
│ ├── __init__.py
│ ├── docling_processor.py # Document processing with Docling
│ └── docling_rag_orchestrator.py # Main RAG orchestrator
├── vectorproc/ # Vector processing components
│ ├── __init__.py
│ ├── vector_indexer.py # Vector database management
│ └── semantic_search.py # Semantic search implementation
├── searchproc/ # Search processing components
│ ├── __init__.py
│ └── hybrid_search.py # Hybrid search implementation
├── utils/ # Utility functions
│ ├── __init__.py
│ ├── document_registry_fix.py # Document registry repair utility
│ └── common.py # Common utility functions
├── data/ # Document storage
├── processed_data/ # Processed document storage
│ ├── chunks/ # Chunked documents
│ └── docling/ # Docling output
└── vector_db/ # Vector database storage
- Purpose: Provides command-line interface
- Functions: Process, index, search, ask, system status
- Relationships: Uses
RAGOrchestratorAPI
- Purpose: Provides REST API
- Endpoints: Document processing, indexing, search, QA
- Relationships: Uses
RAGOrchestratorAPI
- Purpose: High-level API for the RAG system
- Functions: API-friendly methods for document processing, search, QA
- Relationships: Uses
DoclingEnhancedRAGOrchestrator
- Purpose: Core RAG system orchestration
- Functions: Document processing, indexing, search, QA
- Relationships: Uses processors, indexers, and search engines
- Purpose: Document processing and chunking
- Functions: Process documents, chunk text, clean text
- Relationships: Used by
DoclingEnhancedRAGOrchestrator
- Purpose: Vector database management
- Functions: Index documents, manage vector registry
- Relationships: Used by
DoclingEnhancedRAGOrchestrator
- Purpose: Semantic search implementation
- Functions: Search for documents by meaning
- Relationships: Uses
VectorIndexer
- Purpose: Combine semantic and lexical search
- Functions: Hybrid search, reranking
- Relationships: Uses
SemanticSearchEngine
- Purpose: Rerank search results based on contextual relevance
- Functions: Rerank search results
- Relationships: Used by search engines
- Purpose: System configuration
- Functions: Load and manage configuration settings
- Relationships: Used by all components
- Purpose: Common utility functions
- Functions: File operations, timing, error handling
- Relationships: Used by all components
-
Document Processing Flow:
DoclingProcessor → Document Registry → Chunks -
Indexing Flow:
Chunks → VectorIndexer → Vector Database → Vector Registry -
Search Flow:
Query → SemanticSearch/HybridSearch → Results → Reranker → Ranked Results -
QA Flow:
Question → Search → Context Retrieval → LLM → Answer
FastAPI App / CLI
↓
RAGOrchestratorAPI
↓
DoclingEnhancedRAGOrchestrator
↓
┌───────┬─────────┬──────────┐
↓ ↓ ↓ ↓
DoclingProcessor VectorIndexer SemanticSearch HybridSearch
↑ ↑ ↑
└──────────────┘ │
│
ContextualReranker
Key methods:
process_document(file_path): Process a single documentprocess_directory(directory): Process all documents in a directory
Key methods:
index_document(document_id): Index a specific documentindex_all_documents(): Index all processed documentsget_stats(): Get vector database statistics
Key methods:
search(query, top_k): Perform semantic searchget_document_context(document_id): Get context for a document
Key methods:
search(query, top_k): Perform hybrid search_combine_search_results(): Combine semantic and lexical results
Key methods:
process_documents(source_paths): Process documentsindex_documents(force_reindex): Index documentssearch(query, top_k): Search for documentsask(question, top_k): Answer questions using document context
Key methods:
process_document(file_path): Process a document (async)index_documents(force_reindex): Index documents (async)search(query, top_k): Search for documents (async)ask(question, top_k): Answer questions (async)
The system is configured through multiple files:
- config.py: Core configuration implementation
- config.json: User-editable configuration
- .env: Environment-specific settings
Key configuration sections:
- paths: Data and storage directories
- embedding: Model settings
- vector_db: Vector database settings
- chunking: Document chunking parameters
- search: Search parameters
- llm: Language model settings
- Maps document paths to metadata
- Tracks document IDs, hashes, and chunk files
- Tracks vector database metadata
- Records indexed documents and chunk counts
- data/: Raw documents
- processed_data/chunks/: Processed document chunks
- processed_data/docling/: Docling-specific output
- vector_db/: Vector database files
- vector_db/index.faiss: FAISS index (if using FAISS)
- vector_db/chroma/: Chroma database (if using Chroma)
The system is designed to be extended in several ways:
- New Document Processors: Add new processors in
doclingroc/ - New Search Methods: Add new search implementations in
searchproc/ - New Vector Stores: Add support for additional vector databases in
vectorproc/ - New LLM Integrations: Add support for different LLMs in the orchestrator
To add extensions, follow the existing patterns and interfaces, then update the orchestrator to use your new components.