AI-powered video analysis tool — search inside any video using natural language.
Upload any video and describe a scene in plain English — the system finds the exact timestamp where that scene appears.
Example queries:
"a boat on the ocean at sunset""person walking in a forest""city street with traffic"
The system processes video frames using OpenAI's CLIP model (zero-shot image-text alignment) and returns the matched timestamp with video playback jumping directly to that moment.
User uploads video + types scene description
↓
OpenCV extracts 1 frame/sec from video
↓
CLIP encodes each frame → image embeddings
CLIP encodes text query → text embedding
↓
Cosine similarity computed between text ↔ all frames
↓
Top matching frame returned with timestamp
↓
React frontend seeks video to that timestamp
| Layer | Technology |
|---|---|
| Frontend | React.js, Bootstrap, react-router-dom |
| Backend | Python, Flask, Flask-CORS |
| AI / ML | OpenAI CLIP (ViT-B/32), Sentence Transformers |
| Video Processing | OpenCV (1 frame/sec extraction) |
| Similarity | Cosine Similarity (normalized dot product) |
Video-Scene-Classification-System/
├── Backend/
│ ├── app.py # Flask API server
│ ├── process_video_frames.py # OpenCV frame extraction
│ ├── clip_model.py # CLIP image & text embeddings
│ ├── nlp_model.py # Sentence Transformer integration
│ └── requirements.txt
├── Frontend/
│ └── my-project/
│ ├── src/
│ │ ├── App.js # Root component + routing
│ │ └── components/
│ │ ├── Home.js # Landing page
│ │ ├── Upload.js # Video upload + scene search (core)
│ │ ├── Team.js
│ │ └── FAQ.js
│ └── package.json
├── start-servers.bat # One-click start (Windows)
└── README.md
- Python 3.8+
- Node.js 16+
git clone https://github.com/RishabThapliyal/Video-Scene-Classification-System.git
cd Video-Scene-Classification-Systemcd Backend
python -m venv venv
venv\Scripts\activate # Windows
# source venv/bin/activate # Mac/Linux
pip install -r requirements.txt
python app.pyBackend runs at http://localhost:5000
cd Frontend/my-project
npm install
npm startFrontend runs at http://localhost:3000
# Starts both servers at once:
start-servers.bat| Method | Endpoint | Description |
|---|---|---|
POST |
/api/upload_video |
Upload video, receive video_id |
POST |
/api/search_scene |
Query with video_id + query string |
GET |
/api/thumbnail/<video_id>/<ts> |
Get frame thumbnail at timestamp |
Search request:
{
"video_id": "abc123",
"query": "a boat on calm water"
}Response:
{
"timestamp": "00:00:26",
"confidence_score": 0.84
}Processing every frame at 30fps with CLIP is computationally expensive. Sampling at 1 frame/sec gives a 30x reduction in CLIP inference calls while preserving sufficient semantic coverage for scene-level search.
| Video (10 min @ 30fps) | Frames processed |
|---|---|
| Full processing | 18,000 |
| 1 frame/sec sampling | 600 ✅ |
- Not deployed — requires local setup (ML models too large for free hosting)
- GPU recommended for videos longer than 5 minutes; CPU inference is slow
- 1 frame/sec sampling can miss very short (sub-second) events
- Purely visual — no audio analysis
Built as B.Tech Major Project at Graphic Era Hill University, Dehradun (June 2025).
| Name | Roll No |
|---|---|
| Rishab Thapliyal | 2119013 |
| Shubham Singh Karki | 2119234 |
| Vimal Singh Panwar | 2119423 |
| Yugraj | 2119460 |
Guide: Dr. Amrish Sharma, Professor of Practice, CSE Dept.
MIT License — open source, free to use.