🫁 Lung Cancer Detection Using Deep Learning
📌 Goal:
Develop an automated model for early lung cancer detection using CT scans and patient attributes to assist clinicians in rapid, accurate diagnosis.
🧠 Domain: Medical Imaging & Predictive Analytics
🎯 Task: Classification (Cancer vs. Non-Cancer)
📂 Dataset: Kaggle – Lung Cancer Prediction Dataset (284 samples, 16 attributes)
Project Domain
Machine Learning
Task
Classification and Prediction
The Goal:
Early detection of lung cancer greatly improves survival rates, but manual diagnosis from CT scans and patient history is time-intensive. This project leverages deep learning and traditional ML methods to predict lung cancer risk using demographic, behavioral, and symptom-based features. The objective was to design a system that automatically classifies patients into cancerous or non-cancerous categories, supporting radiologists with data-driven decision-making.
1
The Challenge:
Methodology & Process
Data Source: 284 patient records (16 clinical attributes: age, smoking, coughing, chest pain, shortness of breath, etc.).
Preprocessing: Duplicate removal, categorical encoding, outlier filtering (IQR), and class rebalancing via SMOTE to handle imbalance (Yes: 238 / No: 38).
Modeling Approaches:
Classical ML: KNN, SVC, Decision Tree, Random Forest, XGBoost, LightGBM, Gradient Boosting.
Advanced ensemble: CatBoost Classifier tuned with GridSearchCV and 5-fold cross-validation.
Evaluation Metrics: Accuracy, Precision, Recall, F1-Score, AUC.
Tools: Python (Sklearn, CatBoost, LightGBM, Matplotlib, Seaborn).
Model | Accuracy | Precision | Recall | F1-Score | AUC |
|---|---|---|---|---|---|
KNN | 0.93 | 0.93 | 0.92 | 0.93 | 0.93 |
SVC | 0.95 | 0.95 | 0.95 | 0.95 | 0.95 |
Decision Tree / Random Forest | 0.96 | 0.96 | 0.96 | 0.96 | 0.96 |
XGBoost / LightGBM | 0.94 – 0.95 | 0.94 | 0.94 | 0.94 | 0.95 |
CatBoost (Best) | 0.97 | 0.97 | 0.97 | 0.97 | 0.97 |
2
The Result
✅ Key Insight:
The CatBoost Classifier achieved top performance (96.9 % accuracy, F1 ≈ 0.97) with strong generalization verified via 5-fold cross-validation (mean ≈ 94.6 %).
Its robust handling of categorical features and reduced overfitting make it highly suitable for clinical deployment scenarios.
📊 Visuals:
Age vs. Cancer Density Plots • Correlation Heatmap • ROC Curves for All Models • Confusion Matrix Analysis.
🏁 Conclusion:
This study demonstrates how ensemble-based deep learning methods—especially CatBoost—can effectively model patient-level cancer risk and enable interpretable, automated detection pipelines for medical diagnostics.
3








