💳 Credit Card Fraud Detection in R
📌 Goal:
Develop a robust fraud detection system using machine learning techniques on highly imbalanced financial transaction data to accurately identify fraudulent activity.
🧠 Domain: Financial Analytics & Machine Learning
🎯 Task: Classification (Fraud vs. Genuine)
📂 Dataset: Kaggle Credit Card Fraud Dataset (284,807 transactions, 492 fraud cases, 0.172%)
Project Domain
Machine Learning
Task
Classification and Prediction
The Goal:
Financial fraud detection remains a critical challenge for banks and payment systems due to its rarity and evolving attack patterns. This project focuses on building and comparing multiple machine learning models in R to detect fraudulent transactions from anonymized PCA-transformed features. The primary goal is to improve recall and F1-score for minority (fraudulent) cases while minimizing false negatives, which carry significant financial risk.
1
The Challenge:
Methodology & Process
Data Preprocessing:
Removed non-informative features (e.g., Time).
Standardized Amount for consistent scale.
Addressed severe class imbalance using multiple strategies:
Down-sampling of majority class
Up-sampling of minority class
ROSE (Random Over-Sampling Examples) for synthetic balance
Exploratory Analysis:
Distribution and imbalance visualization.
t-SNE and PCA clustering to assess feature separability.
Model Development:
Implemented Decision Tree (CART), Random Forest, K-Nearest Neighbors (KNN), and Support Vector Machine (SVM) classifiers.
Evaluated across original and ROSE-balanced datasets.
Evaluation Metrics: Accuracy, Precision, Recall, F1-Score, ROC-AUC.
Results & Findings
Model | Dataset | Accuracy | Precision | Recall | F1-Score | AUC |
|---|---|---|---|---|---|---|
Decision Tree (CART) | Imbalanced | 99.8 % | High | Low | Low | 0.912 |
Decision Tree (ROSE) | Balanced | ↑ | ↑ | ↑ | ↑ | 0.968 |
KNN (k = 3) | ROSE | 99.9 % | 0.92 | 0.54 | 0.68 | – |
SVM | ROSE | – | Strong | Moderate | Balanced | – |
2
The Result
✅ Key Insight:
Resampling with ROSE substantially improved detection of minority (fraudulent) cases.
SVM achieved the best trade-off between precision and recall, while KNN exhibited strong accuracy but weaker recall—highlighting the difficulty of identifying rare fraud cases.
Models with overly high precision but low recall can miss critical frauds, underscoring the need for cost-sensitive approaches.
🔮 Future Scope:
Integrate ensemble methods (XGBoost, LightGBM, CatBoost) to enhance generalization.
Apply cost-sensitive learning to penalize missed fraud detections.
Explore real-time deployment pipelines for live transaction monitoring.
3








