Skip to content

RishabThapliyal/Video-Scene-Classification-System

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🎬 Video Scene Classification System

AI-powered video analysis tool — search inside any video using natural language.

Python React Flask CLIP GitHub


📌 What It Does

Upload any video and describe a scene in plain English — the system finds the exact timestamp where that scene appears.

Example queries:

  • "a boat on the ocean at sunset"
  • "person walking in a forest"
  • "city street with traffic"

The system processes video frames using OpenAI's CLIP model (zero-shot image-text alignment) and returns the matched timestamp with video playback jumping directly to that moment.


🧠 How It Works

User uploads video + types scene description
        ↓
OpenCV extracts 1 frame/sec from video
        ↓
CLIP encodes each frame → image embeddings
CLIP encodes text query → text embedding
        ↓
Cosine similarity computed between text ↔ all frames
        ↓
Top matching frame returned with timestamp
        ↓
React frontend seeks video to that timestamp

🛠️ Tech Stack

Layer Technology
Frontend React.js, Bootstrap, react-router-dom
Backend Python, Flask, Flask-CORS
AI / ML OpenAI CLIP (ViT-B/32), Sentence Transformers
Video Processing OpenCV (1 frame/sec extraction)
Similarity Cosine Similarity (normalized dot product)

📁 Project Structure

Video-Scene-Classification-System/
├── Backend/
│   ├── app.py                    # Flask API server
│   ├── process_video_frames.py   # OpenCV frame extraction
│   ├── clip_model.py             # CLIP image & text embeddings
│   ├── nlp_model.py              # Sentence Transformer integration
│   └── requirements.txt
├── Frontend/
│   └── my-project/
│       ├── src/
│       │   ├── App.js            # Root component + routing
│       │   └── components/
│       │       ├── Home.js       # Landing page
│       │       ├── Upload.js     # Video upload + scene search (core)
│       │       ├── Team.js
│       │       └── FAQ.js
│       └── package.json
├── start-servers.bat             # One-click start (Windows)
└── README.md

🚀 Getting Started

Prerequisites

  • Python 3.8+
  • Node.js 16+

1. Clone the repo

git clone https://github.com/RishabThapliyal/Video-Scene-Classification-System.git
cd Video-Scene-Classification-System

2. Backend setup

cd Backend
python -m venv venv
venv\Scripts\activate        # Windows
# source venv/bin/activate   # Mac/Linux

pip install -r requirements.txt
python app.py

Backend runs at http://localhost:5000

3. Frontend setup

cd Frontend/my-project
npm install
npm start

Frontend runs at http://localhost:3000

Quick Start (Windows)

# Starts both servers at once:
start-servers.bat

🔌 API Reference

Method Endpoint Description
POST /api/upload_video Upload video, receive video_id
POST /api/search_scene Query with video_id + query string
GET /api/thumbnail/<video_id>/<ts> Get frame thumbnail at timestamp

Search request:

{
  "video_id": "abc123",
  "query": "a boat on calm water"
}

Response:

{
  "timestamp": "00:00:26",
  "confidence_score": 0.84
}

⚙️ Key Design Decision — 1 Frame/sec Sampling

Processing every frame at 30fps with CLIP is computationally expensive. Sampling at 1 frame/sec gives a 30x reduction in CLIP inference calls while preserving sufficient semantic coverage for scene-level search.

Video (10 min @ 30fps) Frames processed
Full processing 18,000
1 frame/sec sampling 600 ✅

⚠️ Limitations

  • Not deployed — requires local setup (ML models too large for free hosting)
  • GPU recommended for videos longer than 5 minutes; CPU inference is slow
  • 1 frame/sec sampling can miss very short (sub-second) events
  • Purely visual — no audio analysis

👥 Team

Built as B.Tech Major Project at Graphic Era Hill University, Dehradun (June 2025).

Name Roll No
Rishab Thapliyal 2119013
Shubham Singh Karki 2119234
Vimal Singh Panwar 2119423
Yugraj 2119460

Guide: Dr. Amrish Sharma, Professor of Practice, CSE Dept.


📄 License

MIT License — open source, free to use.

About

AI-powered video analysis tool with natural language search inside video content using CLIP and Sentence Transformers.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors