🩺 Multiclass Obesity Prediction using LightGBM
📌 Goal:
Predict individual obesity levels (7 classes) from lifestyle, demographic, and dietary features using advanced machine learning classifiers, with LightGBM achieving the best overall accuracy.
🧠 Domain: Health Informatics & Predictive Analytics
🎯 Task: Multiclass Classification
📂 Dataset: UCI/Kaggle Obesity Dataset (~21K records, 17 features)
Project Domain
Machine Learning
Task
Classification and Prediction
The Goal:
This study explores obesity prediction as a complex health analytics problem integrating behavioral, demographic, and physiological factors. Traditional measures like BMI often fail to represent multidimensional risk. The goal was to design a robust ML pipeline that classifies individuals into seven obesity levels — from Insufficient Weight to Obesity Class III — supporting precision healthcare and personalized intervention design.
1
The Challenge:
Methodology & Process
Data Preparation:
Cleaned 21,758 samples from UCI/Kaggle datasets; handled missing data, outliers (IQR method), and applied SMOTE for class balance.
Derived BMI and normalized numerical features with Min-Max scaling.Feature Engineering:
Encoded categorical variables (gender, activity, diet, transport) using one-hot and label encoders.Models Implemented:
Logistic Regression, KNN, SVM, Naive Bayes, Decision Tree, Random Forest, XGBoost, CatBoost, and LightGBM.Optimization:
Hyperparameter tuning via Optuna; evaluated using Accuracy, Precision, Recall, F1-score, and AUC metrics.Implementation Stack: Python | Scikit-learn | Optuna | LightGBM | Matplotlib
2
The Result
Results & Findings
Model | Accuracy | Key Insight |
|---|---|---|
Logistic Regression | 86.7 % | Strong baseline, interpretable |
SVM | 88.6 % | Captured non-linear relations |
Random Forest | 89.2 % | Robust to noise, feature ranking |
LightGBM | 90 % + | Best balance of speed, scalability, and accuracy |
✅ Key Outcomes:
LightGBM achieved the highest overall accuracy and cross-validation score, excelling in multiclass generalization.
The model revealed BMI, age, and activity frequency as dominant predictors.
Demonstrated feasibility of scalable ML systems for personalized obesity risk assessment.
🔮 Future Work:
Enhance interpretability via SHAP/LIME, extend to real-time wellness dashboards, and explore hybrid stacking with neural networks for improved multi-class sensitivity.
3








