This project involves web scraping, data preprocessing, database storage, and visualization of IMDb movie data from the last decade (2014-2024). The dataset includes details of 10,000 movies such as name, release year, genre, ratings, actors, directors, metascore, and more. The project culminates in an interactive Power BI dashboard for in-depth insights and reporting.
- Web Scraping: BeautifulSoup, Selenium, Requests, Time, Random
- Data Processing: Pandas, NumPy
- Machine Learning: Scikit-learn (RandomForestClassifier for genre prediction), TfidfVectorizer
- Database Management: MySQL, SQLAlchemy, MySQL Connector
- Visualization: Power BI
- Used Selenium for automated navigation (clicking, entering data, filtering, and loading more results).
- Extracted 10,000 movie links from IMDb using BeautifulSoup.
- Iterated through each movie link to collect metadata such as genres, ratings, directors, actors, and metascore.
- Implemented exception handling to manage missing values and errors.
- Created a structured dataset and saved it as a CSV file.
- Handled missing values and inconsistencies (e.g., correcting misplaced values in duration and rated columns).
- Filled missing movie names based on actual movie titles.
- Used RandomForestClassifier to predict missing genres using movie descriptions.
- Finalized the cleaned dataset and stored it in a MySQL database.
- Created an
imdbdatabase and imported the cleaned dataset. - Executed various SQL queries for insights, such as:
- Movies released in 2024
- Top 5 highest-rated movies
- Top 10 horror movies
- Movies directed by Christopher Nolan
- Highest-rated movie per genre
- Ranking movies by Metascore and ratings
- Most reviewed movies
- Longest-duration movie per genre
- Best-rated movie for each content rating
The Power BI dashboard consists of four interactive pages:
- Slicer for selecting a movie name.
- Display of key metrics such as duration, ratings, number of ratings, release year, genre, description, user reviews, critic reviews, and metascore.
- TreeMap showing total ratings by genre.
- Slicers for selecting genre and release year.
- Bar chart showing total movies by genre.
- Gauge chart displaying average Metascore.
- Donut chart visualizing content rating breakdown.
- Line chart illustrating the trend of total movie releases per year.
- Key performance indicators (KPIs): total movies, average duration, average ratings, total number of ratings.
- Slicers for genre and release year.
- Measures displaying best-rated movie and best Metascore movie, with images.
- Scatter plot showing the relationship between user ratings and Metascore.
- Table ranking movies by Metascore.
- Slicers for genre and release year.
- Measure displaying the longest movie by duration, with an image.
- Bar chart showing top actors by the number of movies they appeared in.
- Line chart comparing average user reviews vs. critic reviews by genre.
- Column charts:
- Top 5 movies by ratings.
- Top 10 directors by the number of movies directed.
- Longest duration movies by genre.
- Filters for genre and release year applied across all pages.
- Buttons to clear all slicer filters.
- Navigation buttons to move between different pages.
- Open and run
Scrapping_IMDb_movies.ipynbin Jupyter Notebook to extract movie data. - Run
Preprocessing-Modelling-Data Ingestion.ipynbto clean, preprocess, and store data in MySQL.
- Open
IMDb Dashboard.pbixin Power BI Desktop to explore interactive visualizations.
- Expand the dataset beyond 10,000 movies for deeper analysis.
- Improve genre prediction using advanced NLP techniques.
- Automate data updates to keep the dashboard current.
- Deploy the dashboard online for wider accessibility.
This project was created by Rafi Qamar. For any inquiries or collaborations, feel free to reach out!
