🩺 Diabetes Prediction using CatBoost and Ensemble Models

📌 Goal:
Predict the likelihood of diabetes based on demographic, lifestyle, and clinical factors using advanced ensemble learning models.

🧠 Domain: Healthcare Analytics & Machine Learning
🎯 Task: Binary Classification (Diabetic vs Non-Diabetic)
📂 Dataset: Kaggle Diabetes Prediction Dataset (100,000 records → 96,146 after cleaning)

Project Domain

Machine Learning

Task

Classification and Prediction

thumbnail Image
thumbnail Image
thumbnail Image

The Goal:

Sugar is a Problem here..?

Sugar is a Problem here..?

Diabetes prediction is vital for early intervention and disease management. This project leverages ensemble-based machine learning algorithms to identify individuals at risk of diabetes using structured health records. The primary objective was to compare the performance of multiple models—ranging from traditional classifiers to advanced gradient-boosting frameworks—and determine the most reliable predictor for clinical risk assessment.

1

Image
Image

The Challenge:

Methodology & Process

  • Data Preparation:

    • Removed 3,854 duplicate entries.

    • Handled class imbalance via SMOTE oversampling (Non-Diabetic: 87,664 | Diabetic: 8,482).

    • Outliers removed using IQR filtering (≈43K rows).

    • Applied Label Encoding for categorical features (gender, smoking_history).

    • Train–test split: 80/20, with feature scaling where appropriate.

  • Exploratory Analysis:

    • Distribution plots for Age, BMI, HbA1c, and Blood Glucose.

    • Correlation heatmap confirmed strong associations between HbA1c and glucose with diabetes prevalence.

  • Models Implemented:
    Logistic Regression, KNN, Decision Tree, Random Forest, Gradient Boosting, LightGBM, XGBoost, and CatBoost.
    All models tuned using GridSearchCV and validated via 5-fold cross-validation.

2

Image
Image

The Result

Results & Findings

Model

Accuracy

F1-Score

AUC

Logistic Regression

0.89

0.89

0.89

KNN

0.93

0.93

0.94

Decision Tree

0.97

0.97

0.97

LightGBM

0.97

0.97

0.97

XGBoost

0.98

0.98

0.98

CatBoost (Best)

0.98

0.98

1.00

✅ Key Insight:
The CatBoost Classifier outperformed all other models with near-perfect accuracy (98%) and AUC (1.00), validated through 5-fold cross-validation (mean accuracy ≈ 98.1%).
Its ability to handle categorical features natively and model complex nonlinear interactions makes it exceptionally suitable for healthcare prediction tasks.

📊 Visuals:

  • ROC Curves across all models

  • Confusion Matrix for CatBoost

  • Feature Importance Visualization (HbA1c and glucose ranked highest)

🏁 Conclusion:
The project demonstrates that ensemble-based approaches—especially CatBoost—can serve as reliable diagnostic tools for diabetes risk prediction, combining high accuracy with interpretability and robust generalization.

3

Image
Image

Let's Connect

Let's Work Together

Project Collaboration

Projects in Generative AI, ML and Imaging using advanced computational methods

Mentorship and Guidance

Open to join ongoing publications, supervision, and interdisciplinary projects exploring deep learning and scientific computing

Image banner

Let's Connect

Let's Work Together

Project Collaboration

Projects in Generative AI, ML and Imaging using advanced computational methods

Mentorship and Guidance

Open to join ongoing publications, supervision, and interdisciplinary projects exploring deep learning and scientific computing

Image banner

Let's Connect

Let's Work Together

Project Collaboration

Projects in Generative AI, ML and Imaging using advanced computational methods

Mentorship and Guidance

Open to join ongoing publications, supervision, and interdisciplinary projects exploring deep learning and scientific computing

Image banner