🫁 Lung Cancer Detection Using Deep Learning

📌 Goal:
Develop an automated model for early lung cancer detection using CT scans and patient attributes to assist clinicians in rapid, accurate diagnosis.

🧠 Domain: Medical Imaging & Predictive Analytics
🎯 Task: Classification (Cancer vs. Non-Cancer)
📂 Dataset: Kaggle – Lung Cancer Prediction Dataset (284 samples, 16 attributes)

View on Github

Project Domain

Machine Learning

Task

Classification and Prediction

The Goal:

Why? and What?

Early detection of lung cancer greatly improves survival rates, but manual diagnosis from CT scans and patient history is time-intensive. This project leverages deep learning and traditional ML methods to predict lung cancer risk using demographic, behavioral, and symptom-based features. The objective was to design a system that automatically classifies patients into cancerous or non-cancerous categories, supporting radiologists with data-driven decision-making.

The Challenge:

Methodology & Process

Data Source: 284 patient records (16 clinical attributes: age, smoking, coughing, chest pain, shortness of breath, etc.).
Preprocessing: Duplicate removal, categorical encoding, outlier filtering (IQR), and class rebalancing via SMOTE to handle imbalance (Yes: 238 / No: 38).
Modeling Approaches:
- Classical ML: KNN, SVC, Decision Tree, Random Forest, XGBoost, LightGBM, Gradient Boosting.
- Advanced ensemble: CatBoost Classifier tuned with GridSearchCV and 5-fold cross-validation.
Evaluation Metrics: Accuracy, Precision, Recall, F1-Score, AUC.
Tools: Python (Sklearn, CatBoost, LightGBM, Matplotlib, Seaborn).

Model	Accuracy	Precision	Recall	F1-Score	AUC
KNN	0.93	0.93	0.92	0.93	0.93
SVC	0.95	0.95	0.95	0.95	0.95
Decision Tree / Random Forest	0.96	0.96	0.96	0.96	0.96
XGBoost / LightGBM	0.94 – 0.95	0.94	0.94	0.94	0.95
CatBoost (Best)	0.97	0.97	0.97	0.97	0.97

The Result

✅ Key Insight:
The CatBoost Classifier achieved top performance (96.9 % accuracy, F1 ≈ 0.97) with strong generalization verified via 5-fold cross-validation (mean ≈ 94.6 %).
Its robust handling of categorical features and reduced overfitting make it highly suitable for clinical deployment scenarios.

📊 Visuals:
Age vs. Cancer Density Plots • Correlation Heatmap • ROC Curves for All Models • Confusion Matrix Analysis.

🏁 Conclusion:
This study demonstrates how ensemble-based deep learning methods—especially CatBoost—can effectively model patient-level cancer risk and enable interpretable, automated detection pipelines for medical diagnostics.