Chemicals in Cosmetics – Predictive Analytics with Regression Models

Predict chemical presence and concentration patterns in cosmetic formulations using machine learning regression models on the California Safe Cosmetics Program dataset.

🧠 Domain: Data Science & Predictive Analytics
🎯 Task: Regression Modeling
📂 Dataset: California Safe Cosmetics Program (≈114K entries, 20 features)

View on Github

Project Domain

Machine Learning

Task

Prediction and Classification

The Goal:

How chemicals in cosemtics are harmful..?

This project analyzes the California Safe Cosmetics Program dataset to predict the number of chemicals present in cosmetic products. The motivation was to assess potential chemical exposure risks through predictive analytics. By leveraging regression-based machine learning, the study investigates how factors like product type, category, and brand relate to chemical usage intensity, offering insight into product safety and formulation trends.

The Challenge:

The dataset (≈114K entries, 20 features) underwent thorough preprocessing duplicate removal, null handling, label encoding, and year-based feature extraction from report dates. Outliers were detected using Z-score filtering, followed by feature correlations and encoding for model training.

Multiple regression algorithms were tested:

Linear, Lasso, Ridge, and ElasticNet as baseline linear models.
Random Forest, LightGBM, and CatBoost Regressors for non-linear ensemble learning.

Each model was trained with an 80/20 split, tuned via cross-validation, and evaluated using MAE, MSE, and R² metrics.

The Result

Linear models achieved limited explanatory power (R² ≈ 0.11), while ensemble models performed significantly better:

Random Forest: R² ≈ 0.68, MAE ≈ 0.17
LightGBM: R² ≈ 0.81, MAE ≈ 0.13
CatBoost: R² ≈ 0.83, MAE ≈ 0.12

Model	R²	MAE	Remarks
Linear / Ridge	0.11	0.38	Weak linear relationships
Random Forest	0.68	0.17	Captured moderate non-linear trends
LightGBM	0.81	0.13	Efficient and high accuracy
CatBoost	0.83	0.12	Best performer with strong categorical handling

CatBoost Regressor emerged as the most accurate and computationally efficient model, thanks to its native handling of categorical variables and strong generalization ability.

Conclusion: Gradient boosting methods, particularly CatBoost, effectively model chemical-composition variability across cosmetic categories, demonstrating the potential of machine learning in predictive product safety analytics.