Welcome to my GitHub repository showcasing a collection of my university projects in the field of Data Science! This assortment reflects my commitment, passion, and dedication in the realm of data analysis and machine learning throughout my academic journey.
The repository contains various folders concerning the different examination projects carried out as a group, each of which may present scripts, reports or presentations, and sometimes even the data used for analysis.
The analyses were mainly carried out using Python and R, on cloud services such as Colab or Deepnote to exploit the collaborative environment.
Due to computational limitations, in some cases it was not possible to opt for more complex models with hyperparameter optimisation / k-folds cross validation.
In general, the platforms used were MongoDB, SQLite, Tableau, OpenRefine, Knime.
Data Management for Informed Decision-Making: analyzing Milan’s rental property market with integrated news data
In recent years, the demand for living in Milan has led to rising rental prices. Navigating the city's real estate market can be confusing, especially for newcomers. To address this, me and my colleagues created a database of apartment rental listings from popular websites like Immobiliare.it and Subito.it. By utilizing web scraping and API protocols, we extracted valuable data on contract prices, listing titles, descriptions, and property characteristics. Also, we obtained address information required, by developing algorithms to parse descriptions and titles. Recognizing the importance of the neighborhood's quality, we enriched the dataset with links to local news, events, and parades. These additions were scraped from MilanToday via the Google News portal. Finally, we merged the datasets into a NoSQL database, offering users easy access to organized and cleaned data for informed decisions on Milan's listings and desired neighborhoods.
The main goal of this work is to analyse the Airbnb market in the city of Milan. Starting with assumptions based on common knowledge, me and my colleagues intend to verify whether or not these are satisfied by the data at our disposal. We want to give a sense of information through immediately comprehensible graphs, allowing users to answer their own questions, which can range from the distribution of listings (”Which neighbourhoods have the most listings?”) to their respective prices (”Which is the most expensive neighbourhood?”), from host information (”What was the year with the most new host registrations?”) to user reviews (”Which neighbourhood has the best reviews?”). Each dashboard, elaborated in Tableau, will then be self-explanatory in order to enable targeted research. The target audience are ordinary users and possible new hosts who, before registering and making their accommodation available on the website, prefer to inform themselves on market trends, studying prices and availability for each neighbourhood in the city.
The PhysioNet 2017 dataset consists of 8528 electrocardiogram (ECG) recordings, collected using the AliveCor device, sampled at 300 Hz and divided by a group of experts into four different classes. The aim of the project is to build a neural network that is able to classify ECGs to their respective class with a good degree of accuracy.
Diabetes is a chronic disease that depends on a multitude of factors. Based on a set of behavioral risk factors, Machine Learning models were built using the Knime Analytics Platform to solve a binary classification problem in order to predict whether a person has diabetes or not. These models were carefully evaluated and compared to select the best solution in terms of performance and reliability. The results obtained will serve as a basis for the construction of a useful tool to support medical diagnosis and prevention activities.
Analysis and development of marketing strategies for an ecommerce data. In particular, the project involved the development of the following strategies:
- Customer Focus through preventing churn of high value customers with a marketing campaign for customer retention [RFM + CHURN and REPURCHASE MODELS];
- Product Focus to increase profit through a marketing campaign for product cross-selling [MBA];
- Feedback Focus in which to identify detractor and promoter customers with a loyalty marketing campaign to reduce the negative impact of detractors and incentivise the positive effect of promoters [SENTIMENT ANALYSIS].
In addition, 4 interactive dashboards on Tableau were developed to get an overview of the analyses performed and are available here.
In today’s fast-paced world of online shopping, customer reviews can play a significant role in influencing purchasing decisions. Platforms like Amazon have made it easier than ever for customers to share their feedback, helping other shoppers make informed choices. In this context we find our dataset consisting of fine food reviews from Amazon shared over a period of more than 10 years. The project aims to accomplish two primary tasks, using text classification and text summarization. In particular, through review classification, we aim to automatically separate positive feedbacks from the negative ones, giving businesses transparent insights into customer perceptions. On the other hand, text summarization is vital in condensing lengthy reviews into informative summaries, ensuring that critical opinions and informations are preserved. This method simplifies the review analysis process for both sellers and potential buyers, facilitating the comprehension of the overall attitude toward a product. By doing so, we are able to transform unstructured reviews into structured, actionable data.
Financial literacy is remarkably important in today’s complex world. Italy is at a disadvantage because its citizens are not as financially literate as those of other major economies. This puts both individuals and the whole country at risk when it comes to financial matters. Using data provided by the answers to a questionnaire administered by the Bank of Italy in 2017, our aim was, in addition to a detailed analysis, the use of machine learning models that allow, by classifying individuals to identify the main factors influencing the level of financial knowledge of Italian adults. This work is intended as a tool to support national and international institutions, as well as policy makers and financial analysts who, on the basis of the results obtained, have the possibility to intervene in a targeted manner to contribute to the development of financial education in the country.
I extend my gratitude to everyone who contributed or inspired these projects, as well as the open-source community that makes sharing knowledge possible.
I hope this repository inspires you and helps you explore the vast world of Data Science. Thank you for the visit!