Chemicals in Cosmetics – Predictive Analytics with Regression Models
Predict chemical presence and concentration patterns in cosmetic formulations using machine learning regression models on the California Safe Cosmetics Program dataset.
🧠 Domain: Data Science & Predictive Analytics
🎯 Task: Regression Modeling
📂 Dataset: California Safe Cosmetics Program (≈114K entries, 20 features)
Project Domain
Machine Learning
Task
Prediction and Classification
The Goal:
This project analyzes the California Safe Cosmetics Program dataset to predict the number of chemicals present in cosmetic products. The motivation was to assess potential chemical exposure risks through predictive analytics. By leveraging regression-based machine learning, the study investigates how factors like product type, category, and brand relate to chemical usage intensity, offering insight into product safety and formulation trends.
1
The Challenge:
The dataset (≈114K entries, 20 features) underwent thorough preprocessing duplicate removal, null handling, label encoding, and year-based feature extraction from report dates. Outliers were detected using Z-score filtering, followed by feature correlations and encoding for model training.
Multiple regression algorithms were tested:
Linear, Lasso, Ridge, and ElasticNet as baseline linear models.
Random Forest, LightGBM, and CatBoost Regressors for non-linear ensemble learning.
Each model was trained with an 80/20 split, tuned via cross-validation, and evaluated using MAE, MSE, and R² metrics.
2
The Result
Linear models achieved limited explanatory power (R² ≈ 0.11), while ensemble models performed significantly better:
Random Forest: R² ≈ 0.68, MAE ≈ 0.17
LightGBM: R² ≈ 0.81, MAE ≈ 0.13
CatBoost: R² ≈ 0.83, MAE ≈ 0.12
Model | R² | MAE | Remarks |
|---|---|---|---|
Linear / Ridge | 0.11 | 0.38 | Weak linear relationships |
Random Forest | 0.68 | 0.17 | Captured moderate non-linear trends |
LightGBM | 0.81 | 0.13 | Efficient and high accuracy |
CatBoost | 0.83 | 0.12 | Best performer with strong categorical handling |
CatBoost Regressor emerged as the most accurate and computationally efficient model, thanks to its native handling of categorical variables and strong generalization ability.
Conclusion: Gradient boosting methods, particularly CatBoost, effectively model chemical-composition variability across cosmetic categories, demonstrating the potential of machine learning in predictive product safety analytics.
3








