This project implements an end-to-end churn prediction pipeline. It simulates a telecom company's relational database to perform feature engineering using SQL Window Functions and trains a CatBoost classifier to identify high-risk customers.
The primary goal is to demonstrate the ability to handle data transformation at the database level (ELT) rather than relying solely on in-memory processing, followed by state-of-the-art machine learning modeling.
- Data Simulation: Generates a relational database (SQLite) with realistic patterns for customers, call logs, and complaints.
- SQL Feature Engineering: Uses CTEs, Aggregations, and Window Functions (e.g., calculating trend changes over time) to extract features directly from the raw database tables.
- Machine Learning: Trains a CatBoost Classifier to predict customer churn, optimizing for Recall to capture as many potential churners as possible.
- Language: Python 3.x
- Database: SQLite
- Data Manipulation: SQL (Window Functions, Joins), Pandas
- Machine Learning: CatBoost, Scikit-learn
- Version Control: Git
The model achieved high performance in distinguishing between loyal and churning customers.
- ROC-AUC Score: 0.9367
- Recall (Churn Class): 0.78 (Correctly identified 78% of leaving customers)
- Precision (Churn Class): 0.70
Top Predictive Features:
calls_last_30_days(Derived via SQL: Significant drop in usage)total_complaints(Customer dissatisfaction signal)total_calls(General usage volume)
βββ data/ # Stores generated database and model artifacts
βββ src/
β βββ db_generator.py # Simulates customers, calls, and complaints data
β βββ feature_store.py # Extracts features using complex SQL queries
β βββ train_model.py # Trains, evaluates, and saves the CatBoost model
βββ requirements.txt # Project dependencies
βββ README.md # Project documentation