💳 Credit Card Fraud Detection in R

📌 Goal:
Develop a robust fraud detection system using machine learning techniques on highly imbalanced financial transaction data to accurately identify fraudulent activity.

🧠 Domain: Financial Analytics & Machine Learning
🎯 Task: Classification (Fraud vs. Genuine)
📂 Dataset: Kaggle Credit Card Fraud Dataset (284,807 transactions, 492 fraud cases, 0.172%)

View on Github

Project Domain

Machine Learning

Task

Classification and Prediction

The Goal:

Catch the Fraud..!

Financial fraud detection remains a critical challenge for banks and payment systems due to its rarity and evolving attack patterns. This project focuses on building and comparing multiple machine learning models in R to detect fraudulent transactions from anonymized PCA-transformed features. The primary goal is to improve recall and F1-score for minority (fraudulent) cases while minimizing false negatives, which carry significant financial risk.

The Challenge:

Methodology & Process

Data Preprocessing:
- Removed non-informative features (e.g., Time).
- Standardized Amount for consistent scale.
- Addressed severe class imbalance using multiple strategies:
  - Down-sampling of majority class
  - Up-sampling of minority class
  - ROSE (Random Over-Sampling Examples) for synthetic balance
Exploratory Analysis:
- Distribution and imbalance visualization.
- t-SNE and PCA clustering to assess feature separability.
Model Development:
- Implemented Decision Tree (CART), Random Forest, K-Nearest Neighbors (KNN), and Support Vector Machine (SVM) classifiers.
- Evaluated across original and ROSE-balanced datasets.
Evaluation Metrics: Accuracy, Precision, Recall, F1-Score, ROC-AUC.

Results & Findings

Model	Dataset	Accuracy	Precision	Recall	F1-Score	AUC
Decision Tree (CART)	Imbalanced	99.8 %	High	Low	Low	0.912
Decision Tree (ROSE)	Balanced	↑	↑	↑	↑	0.968
KNN (k = 3)	ROSE	99.9 %	0.92	0.54	0.68	–
SVM	ROSE	–	Strong	Moderate	Balanced	–

The Result

✅ Key Insight:

Resampling with ROSE substantially improved detection of minority (fraudulent) cases.
SVM achieved the best trade-off between precision and recall, while KNN exhibited strong accuracy but weaker recall—highlighting the difficulty of identifying rare fraud cases.
Models with overly high precision but low recall can miss critical frauds, underscoring the need for cost-sensitive approaches.

🔮 Future Scope:

Integrate ensemble methods (XGBoost, LightGBM, CatBoost) to enhance generalization.
Apply cost-sensitive learning to penalize missed fraud detections.
Explore real-time deployment pipelines for live transaction monitoring.