Fraud detection is a real-world and highly challenging problem due to the massive class imbalance. This project was an opportunity for me to apply machine learning to a practical case, from exploratory data analysis all the way to deploying a final model ready for inference.
The dataset used is the well-known Kaggle Credit Card Fraud Detection dataset (creditcard.csv).
🔗 Source: Kaggle - Credit Card Fraud Detection Dataset
It contains anonymized transaction features (PCA-transformed) plus the original Amount and Time variables.
- Size: 284,807 transactions x 31 features
- Target:
Class(0 = legitimate, 1 = fraud) - Imbalance: only 492 frauds (0.173%) vs 284,315 legitimate (99.827%)
The project follows a hybrid approach:
src/: Contains the source code modules (data loading, preprocessing, training, evaluation) for the automated pipeline.main.py: Main script to execute the entire pipeline from the terminal.notebooks/fraud_detection.ipynb: The original comprehensive notebook for exploration, visualization, and model experimentation.
I first explored the dataset to understand its structure:
- No missing values
- The
Amountfeature is highly skewed (skewness = 16.98) - The target distribution is extremely imbalanced
Target distribution
The preprocessing was kept minimal:
- Dropped
Time(useless for fraud detection) - Separated the target
Class - Standardization applied only on
Amount(usingStandardScaler) - The PCA components
V1...V28are already scaled - Data split with
train_test_split(..., stratify=y)to preserve the fraud ratio - Pipelines (
ColumnTransformer+ estimator) used to ensure reproducibility
I compared several models:
- Logistic Regression with class balancing and hyperparameter tuning (
GridSearchCV) - Random Forest with
RandomizedSearchCVand further tuning (tested depth, number of trees, sampling) - XGBoost with both random and grid search, optimizing directly for PR-AUC (better suited for imbalanced data). Final training included early stopping on a validation set using PR-AUC.
Validation curve (Random Forest - max_depth vs Recall)
I evaluated all models on the test set. Here are the main metrics (ROC-AUC, PR-AUC, Recall, Precision, F1):
| Model | ROC-AUC | PR-AUC | Recall | Precision | F1 |
|---|---|---|---|---|---|
| Logistic Regression | 0.9724 | 0.6929 | 0.8862 | 0.0637 | 0.1188 |
| Random Forest (tuned) | 0.9738 | 0.8321 | 0.7724 | 0.9048 | 0.8333 |
| XGBoost (baseline) | 0.9831 | 0.8608 | 0.8374 | 0.8306 | 0.8340 |
The results show that while Logistic Regression struggles due to imbalance, Random Forest improves recall, and XGBoost achieves the best overall trade-off.
For illustration, here are the diagnostic plots of Logistic Regression:
- Confusion matrix
- ROC curve
- Precision-Recall curve
The final model chosen is XGBoost, trained with early stopping:
- Best iteration: 954
- Best validation PR-AUC: 0.8397
At the default threshold (0.5), the model achieves:
- Precision = 0.933
- Recall = 0.797
- F1 = 0.860
At the optimized threshold (0.855), tuned for the best F1-score, the model achieves:
- Precision = 0.967
- Recall = 0.797
- F1 = 0.874
This shows that by adjusting the threshold, XGBoost provides a strong balance between catching frauds (recall) and avoiding false alarms (precision).
The final pipeline was exported with joblib:
models/xgb_final_model.pkl
Example of loading and using the model:
import joblib
import pandas as pd
# Load the trained pipeline
model = joblib.load("models/xgb_final_model.pkl")
# Example new transactions
X_new = pd.DataFrame([...]) # must have the same columns as training
# Get predicted probabilities
probas = model.predict_proba(X_new)[:, 1]
# Apply the decision threshold
threshold = 0.855
preds = (probas >= threshold).astype(int)To reproduce the results or train the model:
-
Install dependencies:
pip install -r requirements.txt
-
Run the automated pipeline: This script loads data, trains all models (using modular code in
src/), and saves the final XGBoost model.python main.py
-
Explore via Notebook: You can also run the original comprehensive notebook for a step-by-step analysis:
jupyter notebook notebooks/fraud_detection.ipynb
This project highlighted the challenge of detecting fraud in a highly imbalanced dataset. After testing multiple models, XGBoost proved to be the best compromise. At the optimized threshold, it reaches 96.7% precision and 79.7% recall, making it both robust and practical for real-world fraud detection. This balance allows the model to catch most fraudulent transactions while minimizing false alarms, which is crucial in real banking systems.




