Multimodal Video Search Engine

An end-to-end, two-stage multimodal AI pipeline for zero-shot semantic video retrieval.

This system ingests raw video files, intelligently chunks them using scene detection, and allows users to search for highly specific visual moments using natural language. Instead of returning entire video files, the pipeline retrieves video segments with start and end timestamps.

🧠 Architectural & Design Highlights

This system bypasses simple keyword matching by relying entirely on dense vector representations and generative visual logic. It uses a Two-Stage Retrieval Architecture to balance extreme speed with high precision.

1. Stage 1: High-Recall Spatial Retrieval (SigLIP + Faiss)

The Concept: We use Google's SigLIP-base to project both text queries and video frames into a shared 768-dimensional contrastive latent space.
The Implementation: Videos are processed into distinct visual chunks. Frames are extracted at 1 FPS, embedded, and stored in a CPU-bound Faiss IndexFlatIP database.
Why this works: Faiss allows us to search tens of thousands of frames in milliseconds. By keeping the index on the CPU, we reserve critical VRAM for the Generative VLM in Stage 2.

2. Stage 2: High-Precision Generative Reranking (Qwen2-VL)

The Concept: Contrastive models (like SigLIP) are excellent at identifying objects but struggle with complex spatial relationships. Because Qwen2-VL-7B-Instruct processes video frames natively, it acts as a logical gatekeeper capable of understanding both spatial layout and temporal sequence.
The Engineering: Instead of prompting the VLM to generate a string of text (which is slow and hard to parse), we force it to output a single token ("Yes" or "No"). We extract the raw mathematical logits of the "Yes" token using PyTorch, convert it to a probability via softmax, and use it as a continuous scoring function.
Score Fusion: The final confidence score is a convex combination of the normalized Faiss inner-product (Stage 1) and the VLM's logical probability (Stage 2): $S_{final} = \alpha * S_{stage1} + (1 - \alpha) * S_{stage2}$.

3. Semantic Chunking & Timestamping

The Concept: Returning a 5-minute video for a 3-second action is a poor user experience.
The Implementation: The ingestion pipeline uses PySceneDetect to semantically slice videos based on visual content changes, combined with a bounded sliding window mechanism to restrict maximum duration. Overlapping these bounded chunks ensures continuous temporal coverage and guarantees no fast-moving action is lost across boundaries.

📂 Repository Structure

multimodal-video-search/
├── data/
│   ├── raw_videos/                 # Place raw .mp4 files here
│   ├── processed_frames/           # Auto-generated visual chunks
│   │   └── metadata.json           # Auto-generated chunk-to-video relationship mapping
│   ├── faiss_index.bin             # Auto-generated disk-backed vector store
│   └── faiss_index.bin.map.json    # Auto-generated persistent ID-to-chunk Faiss mapping
├── src/
│   ├── video_processor.py          # Video chunking & frame extraction
│   ├── embedder.py                 # SigLIP tensor processing
│   ├── vector_store.py             # Faiss indexing & persistence
│   └── reranker.py                 # Qwen2-VL logit extraction
├── main.py                         # CLI Orchestrator (Build/Search)
├── app.py                          # Streamlit Web UI
├── requirements.txt
└── README.md

🚀 Installation & Execution

1. Environment Setup

Note: An NVIDIA GPU with at least 16GB VRAM is strongly recommended for CUDA acceleration.

First, install PyTorch configured for your specific CUDA version (Example for CUDA 12.9):

pip install torch torchvision --index-url https://download.pytorch.org/whl/cu129

Then, install the remaining pipeline dependencies:

pip install -r requirements.txt

2. Building the Index (Data Ingestion)

Place your .mp4 files into data/raw_videos/. Then, run the build command to chunk the videos, embed the frames, and save the Faiss index to disk:

python main.py --build

3. Running the Search Engine

You can interact with the search engine using either the web interface or the terminal. Both interfaces will automatically load the saved Faiss index from disk without re-embedding the videos.

Option A: Web UI Launch the Streamlit web interface for a rich, visual search experience with interactive video playback:

streamlit run app.py

Option B: Command Line Interface (CLI) Run the search engine directly in your terminal for fast, text-based interactive querying:

python main.py --query

🔍 Retrieval Examples

The system handles zero-shot, highly descriptive queries. A few examples of top retrievals in a dataset of ~130 videos are shown below. All the retrieved video segments, except the rightmost one, are successfully relevant to the queries shown below them. The incorrect retrieval shows an Adidas backpack, while the query was a Nike backpack.

riding on a street

shovel

wine salute

hockey goalie

a person is playing the piano

A Nike backpack

The UI returns the top 5 ranking segments, dynamically rendering a video player clipped to the exact start_sec and end_sec of the identified action. A screenshot of the UI web app is shown below.

⚠️ Known Limitations & Failure Modes

Text-in-Image Hallucinations: While SigLIP is highly robust, it can occasionally struggle with exact OCR. For example, searching for a specific brand logo might yield a visually similar logo of a different brand if the VLM reranker's confidence is not high enough to correct it.
Complex Spatial Negation: Queries involving negation (e.g., "A street with no cars") often trip up contrastive models like SigLIP, as the embedding of the word "cars" forces the vector closer to images of cars. The Stage 2 Qwen2-VL reranker mitigates this, but an exceptionally high Stage 1 score can occasionally overpower the fusion equation.

📈 Scaling to Enterprise Volume

While this architecture performs exceptionally well on datasets of thousands of video clips, scaling to millions of videos requires modifications to the underlying ingestion and retrieval mechanics:

Overcoming RAM Exhaustion (Ingestion): Currently, the IndexFlatIP Faiss backend accumulates vectors in system RAM before writing to disk. At 10+ million chunks, this would trigger an Out-Of-Memory (OOM) crash. To scale, the pipeline could be upgraded to use OnDiskInvertedLists or Index Sharding, allowing vectors to stream directly to NVMe SSDs during the add() operation.
Vector Compression (Search): Storing 10 million 768-dimensional uncompressed vectors requires ~30GB of RAM. The index should be migrated to IndexIVFPQ (Inverted File with Product Quantization) to mathematically compress the latent space, reducing the memory footprint by roughly 30x with negligible impact on retrieval recall.
Audio Modality Integration: The underlying architecture can be extended with the potential integration of audio processing to capture conversational or environmental audio context alongside the visual data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multimodal Video Search Engine

🧠 Architectural & Design Highlights

1. Stage 1: High-Recall Spatial Retrieval (SigLIP + Faiss)

2. Stage 2: High-Precision Generative Reranking (Qwen2-VL)

3. Semantic Chunking & Timestamping

📂 Repository Structure

🚀 Installation & Execution

1. Environment Setup

2. Building the Index (Data Ingestion)

3. Running the Search Engine

🔍 Retrieval Examples

⚠️ Known Limitations & Failure Modes

📈 Scaling to Enterprise Volume

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
data		data
src		src
.gitignore		.gitignore
README.md		README.md
app.py		app.py
config.py		config.py
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Multimodal Video Search Engine

🧠 Architectural & Design Highlights

1. Stage 1: High-Recall Spatial Retrieval (SigLIP + Faiss)

2. Stage 2: High-Precision Generative Reranking (Qwen2-VL)

3. Semantic Chunking & Timestamping

📂 Repository Structure

🚀 Installation & Execution

1. Environment Setup

2. Building the Index (Data Ingestion)

3. Running the Search Engine

🔍 Retrieval Examples

⚠️ Known Limitations & Failure Modes

📈 Scaling to Enterprise Volume

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages