An end-to-end, two-stage multimodal AI pipeline for zero-shot semantic video retrieval.
This system ingests raw video files, intelligently chunks them using scene detection, and allows users to search for highly specific visual moments using natural language. Instead of returning entire video files, the pipeline retrieves video segments with start and end timestamps.
This system bypasses simple keyword matching by relying entirely on dense vector representations and generative visual logic. It uses a Two-Stage Retrieval Architecture to balance extreme speed with high precision.
- The Concept: We use Google's
SigLIP-baseto project both text queries and video frames into a shared 768-dimensional contrastive latent space. - The Implementation: Videos are processed into distinct visual chunks. Frames are extracted at 1 FPS, embedded, and stored in a CPU-bound
Faiss IndexFlatIPdatabase. - Why this works: Faiss allows us to search tens of thousands of frames in milliseconds. By keeping the index on the CPU, we reserve critical VRAM for the Generative VLM in Stage 2.
-
The Concept: Contrastive models (like SigLIP) are excellent at identifying objects but struggle with complex spatial relationships. Because
Qwen2-VL-7B-Instructprocesses video frames natively, it acts as a logical gatekeeper capable of understanding both spatial layout and temporal sequence. - The Engineering: Instead of prompting the VLM to generate a string of text (which is slow and hard to parse), we force it to output a single token ("Yes" or "No"). We extract the raw mathematical logits of the "Yes" token using PyTorch, convert it to a probability via softmax, and use it as a continuous scoring function.
-
Score Fusion: The final confidence score is a convex combination of the normalized Faiss inner-product (Stage 1) and the VLM's logical probability (Stage 2):
$S_{final} = \alpha * S_{stage1} + (1 - \alpha) * S_{stage2}$ .
- The Concept: Returning a 5-minute video for a 3-second action is a poor user experience.
- The Implementation: The ingestion pipeline uses
PySceneDetectto semantically slice videos based on visual content changes, combined with a bounded sliding window mechanism to restrict maximum duration. Overlapping these bounded chunks ensures continuous temporal coverage and guarantees no fast-moving action is lost across boundaries.
multimodal-video-search/
├── data/
│ ├── raw_videos/ # Place raw .mp4 files here
│ ├── processed_frames/ # Auto-generated visual chunks
│ │ └── metadata.json # Auto-generated chunk-to-video relationship mapping
│ ├── faiss_index.bin # Auto-generated disk-backed vector store
│ └── faiss_index.bin.map.json # Auto-generated persistent ID-to-chunk Faiss mapping
├── src/
│ ├── video_processor.py # Video chunking & frame extraction
│ ├── embedder.py # SigLIP tensor processing
│ ├── vector_store.py # Faiss indexing & persistence
│ └── reranker.py # Qwen2-VL logit extraction
├── main.py # CLI Orchestrator (Build/Search)
├── app.py # Streamlit Web UI
├── requirements.txt
└── README.md
Note: An NVIDIA GPU with at least 16GB VRAM is strongly recommended for CUDA acceleration.
First, install PyTorch configured for your specific CUDA version (Example for CUDA 12.9):
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu129
Then, install the remaining pipeline dependencies:
pip install -r requirements.txt
Place your .mp4 files into data/raw_videos/. Then, run the build command to chunk the videos, embed the frames, and save the Faiss index to disk:
python main.py --build
You can interact with the search engine using either the web interface or the terminal. Both interfaces will automatically load the saved Faiss index from disk without re-embedding the videos.
Option A: Web UI Launch the Streamlit web interface for a rich, visual search experience with interactive video playback:
streamlit run app.py
Option B: Command Line Interface (CLI) Run the search engine directly in your terminal for fast, text-based interactive querying:
python main.py --query
The system handles zero-shot, highly descriptive queries. A few examples of top retrievals in a dataset of ~130 videos are shown below. All the retrieved video segments, except the rightmost one, are successfully relevant to the queries shown below them. The incorrect retrieval shows an Adidas backpack, while the query was a Nike backpack.
riding on a street |
shovel |
wine salute |
hockey goalie |
a person is playing the piano |
A Nike backpack |
The UI returns the top 5 ranking segments, dynamically rendering a video player clipped to the exact start_sec and end_sec of the identified action. A screenshot of the UI web app is shown below.
- Text-in-Image Hallucinations: While SigLIP is highly robust, it can occasionally struggle with exact OCR. For example, searching for a specific brand logo might yield a visually similar logo of a different brand if the VLM reranker's confidence is not high enough to correct it.
- Complex Spatial Negation: Queries involving negation (e.g., "A street with no cars") often trip up contrastive models like SigLIP, as the embedding of the word "cars" forces the vector closer to images of cars. The Stage 2 Qwen2-VL reranker mitigates this, but an exceptionally high Stage 1 score can occasionally overpower the fusion equation.
While this architecture performs exceptionally well on datasets of thousands of video clips, scaling to millions of videos requires modifications to the underlying ingestion and retrieval mechanics:
- Overcoming RAM Exhaustion (Ingestion): Currently, the
IndexFlatIPFaiss backend accumulates vectors in system RAM before writing to disk. At 10+ million chunks, this would trigger an Out-Of-Memory (OOM) crash. To scale, the pipeline could be upgraded to useOnDiskInvertedListsor Index Sharding, allowing vectors to stream directly to NVMe SSDs during theadd()operation. - Vector Compression (Search): Storing 10 million 768-dimensional uncompressed vectors requires ~30GB of RAM. The index should be migrated to
IndexIVFPQ(Inverted File with Product Quantization) to mathematically compress the latent space, reducing the memory footprint by roughly 30x with negligible impact on retrieval recall. - Audio Modality Integration: The underlying architecture can be extended with the potential integration of audio processing to capture conversational or environmental audio context alongside the visual data.





