Chemicals in Cosmetics – Predictive Analytics with Regression Models

Predict chemical presence and concentration patterns in cosmetic formulations using machine learning regression models on the California Safe Cosmetics Program dataset.

🧠 Domain: Data Science & Predictive Analytics
🎯 Task: Regression Modeling
📂 Dataset: California Safe Cosmetics Program (≈114K entries, 20 features)

Project Domain

Machine Learning

Task

Prediction and Classification

thumbnail Image
thumbnail Image
thumbnail Image

The Goal:

How chemicals in cosemtics are harmful..?

How chemicals in cosemtics are harmful..?

This project analyzes the California Safe Cosmetics Program dataset to predict the number of chemicals present in cosmetic products. The motivation was to assess potential chemical exposure risks through predictive analytics. By leveraging regression-based machine learning, the study investigates how factors like product type, category, and brand relate to chemical usage intensity, offering insight into product safety and formulation trends.

1

Image
Image

The Challenge:

The dataset (≈114K entries, 20 features) underwent thorough preprocessing duplicate removal, null handling, label encoding, and year-based feature extraction from report dates. Outliers were detected using Z-score filtering, followed by feature correlations and encoding for model training.

Multiple regression algorithms were tested:

  • Linear, Lasso, Ridge, and ElasticNet as baseline linear models.

  • Random Forest, LightGBM, and CatBoost Regressors for non-linear ensemble learning.

Each model was trained with an 80/20 split, tuned via cross-validation, and evaluated using MAE, MSE, and metrics.

2

Image
Image

The Result

Linear models achieved limited explanatory power (R² ≈ 0.11), while ensemble models performed significantly better:

  • Random Forest: R² ≈ 0.68, MAE ≈ 0.17

  • LightGBM: R² ≈ 0.81, MAE ≈ 0.13

  • CatBoost: R² ≈ 0.83, MAE ≈ 0.12

Model

MAE

Remarks

Linear / Ridge

0.11

0.38

Weak linear relationships

Random Forest

0.68

0.17

Captured moderate non-linear trends

LightGBM

0.81

0.13

Efficient and high accuracy

CatBoost

0.83

0.12

Best performer with strong categorical handling

CatBoost Regressor emerged as the most accurate and computationally efficient model, thanks to its native handling of categorical variables and strong generalization ability.

Conclusion: Gradient boosting methods, particularly CatBoost, effectively model chemical-composition variability across cosmetic categories, demonstrating the potential of machine learning in predictive product safety analytics.

3

Image
Image

Let's Connect

Let's Work Together

Project Collaboration

Projects in Generative AI, ML and Imaging using advanced computational methods

Mentorship and Guidance

Open to join ongoing publications, supervision, and interdisciplinary projects exploring deep learning and scientific computing

Image banner

Let's Connect

Let's Work Together

Project Collaboration

Projects in Generative AI, ML and Imaging using advanced computational methods

Mentorship and Guidance

Open to join ongoing publications, supervision, and interdisciplinary projects exploring deep learning and scientific computing

Image banner

Let's Connect

Let's Work Together

Project Collaboration

Projects in Generative AI, ML and Imaging using advanced computational methods

Mentorship and Guidance

Open to join ongoing publications, supervision, and interdisciplinary projects exploring deep learning and scientific computing

Image banner