This project implements a multimodal PDF RAG pipeline with two stages: indexing and question answering. During indexing, each PDF page is converted to an image, described by an OpenAI vision model, embedded, and stored in Qdrant with page metadata. During chat, the user query is rewritten from conversation history, top-k pages are retrieved, retrieved page images are sent to the answer model, and the final response is streamed.
- Indexing flow:
PDF page -> page image -> vision description -> embedding -> Qdrant upsert. - Page-level multimodal representation:
each vectorized unit is one PDF page with an LLM-generated full-page description and metadata (
source,page_number,image_path). - History-aware query rewrite: chat history + user message are transformed into a retrieval-ready rewritten query before search.
- Top-k vector retrieval from Qdrant: rewritten query embedding is used to fetch the most relevant pages for grounding.
- Image-grounded answer generation: answer generation runs on retrieved page images plus user query and rewritten query context.
- Streaming response protocol:
FastAPI SSE stream emits structured events (
session,retrieval,token,end_of_response,error). - Deterministic re-indexing:
POST /indexclears previous vectors and rendered page images before rebuilding the index. - Index inspection endpoints:
page-level debugging via
/index/statsand/index/pages/{page_number}forpage_contentand metadata validation. - Remote chat memory: async PostgreSQL persistence with connection pooling for thread-scoped multi-turn chat state.
- Built-in Web UI:
chat interface is served from
GET /web.
- API and lifecycle layer (
multimodal/server.py): initializes core services in lifespan, validates pooled Postgres connectivity, and serves index/chat/debug routes. - Indexing layer (
multimodal/services/indexing_service.py): loads the single PDF fromuploads/, creates page images, generates page descriptions, and prepares vector documents. - Model layer (
multimodal/services/openai_service.py): runs page description generation, history-aware query rewrite, and image-grounded answer generation. - Retrieval layer (
multimodal/services/qdrant_service.py): manages collection reset, embedding upsert, top-k similarity retrieval, and page inspection reads. - Memory layer (
multimodal/db/*,multimodal/services/postgres_db_service.py): stores and fetches conversation history from remote PostgreSQL using async pooled connections.
- Multimodal PDF Q&A where answers must be grounded on page images (not only extracted text).
- Validation of LLM page descriptions using page-wise inspection (
/index/pages/{page_number}). - Multi-turn assistant workflows that require query rewriting before retrieval.
- Reference implementation for OpenAI multimodal RAG with FastAPI streaming, Qdrant retrieval, and remote PostgreSQL memory.
- Create environment file:
cp .env.example .env-
Set required variables in
.env:OPENAI_API_KEY,POSTGRESQL_URL -
Put one PDF into
uploads/(for exampleuploads/extracted.pdf). -
Start dev stack:
docker-compose -f docker-compose.dev.yml up --build- Build index:
curl -X POST "http://localhost:8000/index"- Use the app:
open
http://localhost:8000/webfor Chat UI. checkhttp://localhost:8000/healthfor health status. API docs are available athttp://localhost:8000/docs.
- Stop containers:
docker-compose -f docker-compose.dev.yml down- Start again (without rebuild):
docker-compose -f docker-compose.dev.yml up- Rebuild and start:
docker-compose -f docker-compose.dev.yml up --build- Optional full cleanup:
docker-compose -f docker-compose.dev.yml down -v- Index stats:
curl "http://localhost:8000/index/stats"- Inspect one indexed page:
returns
status,page_number,page_content(embedded text),metadata, andtoken_usage.
curl "http://localhost:8000/index/pages/1"