🩺 Multiclass Obesity Prediction using LightGBM

📌 Goal:
Predict individual obesity levels (7 classes) from lifestyle, demographic, and dietary features using advanced machine learning classifiers, with LightGBM achieving the best overall accuracy.

🧠 Domain: Health Informatics & Predictive Analytics
🎯 Task: Multiclass Classification
📂 Dataset: UCI/Kaggle Obesity Dataset (~21K records, 17 features)

View on Github

Project Domain

Machine Learning

Task

Classification and Prediction

The Goal:

Gain or Cut?

This study explores obesity prediction as a complex health analytics problem integrating behavioral, demographic, and physiological factors. Traditional measures like BMI often fail to represent multidimensional risk. The goal was to design a robust ML pipeline that classifies individuals into seven obesity levels — from Insufficient Weight to Obesity Class III — supporting precision healthcare and personalized intervention design.

The Challenge:

Methodology & Process

Data Preparation:
Cleaned 21,758 samples from UCI/Kaggle datasets; handled missing data, outliers (IQR method), and applied SMOTE for class balance.
Derived BMI and normalized numerical features with Min-Max scaling.
Feature Engineering:
Encoded categorical variables (gender, activity, diet, transport) using one-hot and label encoders.
Models Implemented:
Logistic Regression, KNN, SVM, Naive Bayes, Decision Tree, Random Forest, XGBoost, CatBoost, and LightGBM.
Optimization:
Hyperparameter tuning via Optuna; evaluated using Accuracy, Precision, Recall, F1-score, and AUC metrics.
Implementation Stack: Python | Scikit-learn | Optuna | LightGBM | Matplotlib

The Result

Results & Findings

Model	Accuracy	Key Insight
Logistic Regression	86.7 %	Strong baseline, interpretable
SVM	88.6 %	Captured non-linear relations
Random Forest	89.2 %	Robust to noise, feature ranking
LightGBM	90 % +	Best balance of speed, scalability, and accuracy

✅ Key Outcomes:

LightGBM achieved the highest overall accuracy and cross-validation score, excelling in multiclass generalization.
The model revealed BMI, age, and activity frequency as dominant predictors.
Demonstrated feasibility of scalable ML systems for personalized obesity risk assessment.

🔮 Future Work:
Enhance interpretability via SHAP/LIME, extend to real-time wellness dashboards, and explore hybrid stacking with neural networks for improved multi-class sensitivity.