Credit Card Fraud Detection

Fraud detection is a real-world and highly challenging problem due to the massive class imbalance. This project was an opportunity for me to apply machine learning to a practical case, from exploratory data analysis all the way to deploying a final model ready for inference.

Dataset

The dataset used is the well-known Kaggle Credit Card Fraud Detection dataset (creditcard.csv). 🔗 Source: Kaggle - Credit Card Fraud Detection Dataset It contains anonymized transaction features (PCA-transformed) plus the original Amount and Time variables.

Size: 284,807 transactions x 31 features
Target: Class (0 = legitimate, 1 = fraud)
Imbalance: only 492 frauds (0.173%) vs 284,315 legitimate (99.827%)

Project Structure

The project follows a hybrid approach:

src/: Contains the source code modules (data loading, preprocessing, training, evaluation) for the automated pipeline.
main.py: Main script to execute the entire pipeline from the terminal.
notebooks/fraud_detection.ipynb: The original comprehensive notebook for exploration, visualization, and model experimentation.

Step 1 - Exploratory Data Analysis (EDA)

I first explored the dataset to understand its structure:

No missing values
The Amount feature is highly skewed (skewness = 16.98)
The target distribution is extremely imbalanced

Target distribution

Step 2 - Preprocessing

The preprocessing was kept minimal:

Dropped Time (useless for fraud detection)
Separated the target Class
Standardization applied only on Amount (using StandardScaler)
The PCA components V1...V28 are already scaled
Data split with train_test_split(..., stratify=y) to preserve the fraud ratio
Pipelines (ColumnTransformer + estimator) used to ensure reproducibility

Step 3 - Models and Training

I compared several models:

Logistic Regression with class balancing and hyperparameter tuning (GridSearchCV)
Random Forest with RandomizedSearchCV and further tuning (tested depth, number of trees, sampling)
XGBoost with both random and grid search, optimizing directly for PR-AUC (better suited for imbalanced data). Final training included early stopping on a validation set using PR-AUC.

Validation curve (Random Forest - max_depth vs Recall)

Step 4 - Evaluation and Results

I evaluated all models on the test set. Here are the main metrics (ROC-AUC, PR-AUC, Recall, Precision, F1):

Model	ROC-AUC	PR-AUC	Recall	Precision	F1
Logistic Regression	0.9724	0.6929	0.8862	0.0637	0.1188
Random Forest (tuned)	0.9738	0.8321	0.7724	0.9048	0.8333
XGBoost (baseline)	0.9831	0.8608	0.8374	0.8306	0.8340

The results show that while Logistic Regression struggles due to imbalance, Random Forest improves recall, and XGBoost achieves the best overall trade-off.

For illustration, here are the diagnostic plots of Logistic Regression:

Confusion matrix

ROC curve

Precision-Recall curve

Step 5 - Final Model: XGBoost

The final model chosen is XGBoost, trained with early stopping:

Best iteration: 954
Best validation PR-AUC: 0.8397

At the default threshold (0.5), the model achieves:

Precision = 0.933
Recall = 0.797
F1 = 0.860

At the optimized threshold (0.855), tuned for the best F1-score, the model achieves:

Precision = 0.967
Recall = 0.797
F1 = 0.874

This shows that by adjusting the threshold, XGBoost provides a strong balance between catching frauds (recall) and avoiding false alarms (precision).

Step 6 - Using the Model

The final pipeline was exported with joblib:

models/xgb_final_model.pkl

Example of loading and using the model:

import joblib
import pandas as pd

# Load the trained pipeline
model = joblib.load("models/xgb_final_model.pkl")

# Example new transactions
X_new = pd.DataFrame([...])  # must have the same columns as training

# Get predicted probabilities
probas = model.predict_proba(X_new)[:, 1]

# Apply the decision threshold
threshold = 0.855
preds = (probas >= threshold).astype(int)

Installation and Usage

To reproduce the results or train the model:

Install dependencies:
```
pip install -r requirements.txt
```
Run the automated pipeline: This script loads data, trains all models (using modular code in src/), and saves the final XGBoost model.
```
python main.py
```
Explore via Notebook: You can also run the original comprehensive notebook for a step-by-step analysis:
- jupyter notebook notebooks/fraud_detection.ipynb

Conclusion

This project highlighted the challenge of detecting fraud in a highly imbalanced dataset. After testing multiple models, XGBoost proved to be the best compromise. At the optimized threshold, it reaches 96.7% precision and 79.7% recall, making it both robust and practical for real-world fraud detection. This balance allows the model to catch most fraudulent transactions while minimizing false alarms, which is crucial in real banking systems.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Credit Card Fraud Detection

Dataset

Project Structure

Step 1 - Exploratory Data Analysis (EDA)

Step 2 - Preprocessing

Step 3 - Models and Training

Step 4 - Evaluation and Results

Step 5 - Final Model: XGBoost

Step 6 - Using the Model

Installation and Usage

Conclusion

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
data		data
images		images
models		models
notebooks		notebooks
src		src
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Credit Card Fraud Detection

Dataset

Project Structure

Step 1 - Exploratory Data Analysis (EDA)

Step 2 - Preprocessing

Step 3 - Models and Training

Step 4 - Evaluation and Results

Step 5 - Final Model: XGBoost

Step 6 - Using the Model

Installation and Usage

Conclusion

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages