🩺 Diabetes Prediction using CatBoost and Ensemble Models

📌 Goal:
Predict the likelihood of diabetes based on demographic, lifestyle, and clinical factors using advanced ensemble learning models.

🧠 Domain: Healthcare Analytics & Machine Learning
🎯 Task: Binary Classification (Diabetic vs Non-Diabetic)
📂 Dataset: Kaggle Diabetes Prediction Dataset (100,000 records → 96,146 after cleaning)

View on Github

Project Domain

Machine Learning

Task

Classification and Prediction

The Goal:

Sugar is a Problem here..?

Diabetes prediction is vital for early intervention and disease management. This project leverages ensemble-based machine learning algorithms to identify individuals at risk of diabetes using structured health records. The primary objective was to compare the performance of multiple models—ranging from traditional classifiers to advanced gradient-boosting frameworks—and determine the most reliable predictor for clinical risk assessment.

The Challenge:

Methodology & Process

Data Preparation:
- Removed 3,854 duplicate entries.
- Handled class imbalance via SMOTE oversampling (Non-Diabetic: 87,664 | Diabetic: 8,482).
- Outliers removed using IQR filtering (≈43K rows).
- Applied Label Encoding for categorical features (gender, smoking_history).
- Train–test split: 80/20, with feature scaling where appropriate.
Exploratory Analysis:
- Distribution plots for Age, BMI, HbA1c, and Blood Glucose.
- Correlation heatmap confirmed strong associations between HbA1c and glucose with diabetes prevalence.
Models Implemented:
Logistic Regression, KNN, Decision Tree, Random Forest, Gradient Boosting, LightGBM, XGBoost, and CatBoost.
All models tuned using GridSearchCV and validated via 5-fold cross-validation.

The Result

Results & Findings

Model	Accuracy	F1-Score	AUC
Logistic Regression	0.89	0.89	0.89
KNN	0.93	0.93	0.94
Decision Tree	0.97	0.97	0.97
LightGBM	0.97	0.97	0.97
XGBoost	0.98	0.98	0.98
CatBoost (Best)	0.98	0.98	1.00

✅ Key Insight:
The CatBoost Classifier outperformed all other models with near-perfect accuracy (98%) and AUC (1.00), validated through 5-fold cross-validation (mean accuracy ≈ 98.1%).
Its ability to handle categorical features natively and model complex nonlinear interactions makes it exceptionally suitable for healthcare prediction tasks.

📊 Visuals:

ROC Curves across all models
Confusion Matrix for CatBoost
Feature Importance Visualization (HbA1c and glucose ranked highest)

🏁 Conclusion:
The project demonstrates that ensemble-based approaches—especially CatBoost—can serve as reliable diagnostic tools for diabetes risk prediction, combining high accuracy with interpretability and robust generalization.