This project involves data preparation, data balancing, feature selection, and classification model building to predict cancer diagnoses using the 2019 Behavioral Risk Factor Surveillance System (BRFSS) survey data from the CDC. The objective is to develop machine learning models that classify individuals based on their health attributes and determine the best-performing model for cancer prediction.
- Introduction
- Key Results Summary
- Data Preprocessing
- Data Balancing
- Feature Selection
- Classification Models
- Performance Metrics
- Hyperparameter Tuning
- Ensemble Method
- Conclusion & Future Improvements
- Predict cancer diagnoses using machine learning classification models.
- Evaluate different feature selection and data balancing techniques to improve model performance.
- Use ensemble methods to further refine predictions.
- Source: 2019 Behavioral Risk Factor Surveillance System (BRFSS)
- Target Variable: Whether an individual was ever told they had cancer (Y = Yes, N = No).
- Best Model: Naive Bayes on the
ros_chi_sq
dataset (Random Over-Sampling + Chi-Square Feature Selection). - True Positive Rate (TPR):
- Class 0 (No Cancer): 67%
- Class 1 (Cancer Diagnosis): 64%
- Weighted AUC: 0.72
Data preprocessing involved multiple steps:
1️⃣ Data Cleaning
- Removed unnecessary columns.
- Handled missing values (
KNN
for numeric,mode
for categorical,median
for ordinal). - Standardized numeric attributes (e.g., weight, height) using Min-Max Scaling.
- Encoded the target variable (
Y = 1
,N = 0
).
2️⃣ Outlier Treatment
- Interquartile Range (IQR) method used for capping outliers.
3️⃣ Data Splitting
- 80-20 split for training & testing datasets.
- Verified no missing values or duplicated records after splitting.
To address class imbalance, two techniques were applied:
- Random Over-Sampling (ROS): Increases the minority class to match the majority class.
- Random Under-Sampling (RUS): Decreases the majority class to match the minority class.
- ROS: Class 0 = 3315, Class 1 = 3315
- RUS: Class 0 = 685, Class 1 = 685
To improve classification performance, multiple feature selection methods were used:
- Recursive Feature Elimination (RFE)
- Random Forest Importance
- Information Gain
- Boruta
- Chi-Squared Test
Dataset | Class 0 | Class 1 |
---|---|---|
ros_boruta_features |
6630 | 140 |
ros_chi_sq_features |
6630 | 51 |
rus_chi_sq_features |
1370 | 51 |
Six different machine learning classification models were implemented:
- Logistic Regression (GLM)
- Decision Tree (RPART)
- Support Vector Machine (SVM)
- Random Forest (RF)
- Naive Bayes (NB)
- K-Nearest Neighbors (KNN)
Performance metrics were stored as CSV files for future analysis.
Metric | Class 0 | Class 1 | Weighted Avg |
---|---|---|---|
TPR | 0.613 | 0.712 | 0.628 |
FPR | 0.287 | 0.387 | 0.303 |
Precision | 0.922 | 0.250 | 0.819 |
Recall | 0.613 | 0.712 | 0.629 |
F1-Score | 0.737 | 0.370 | 0.681 |
AUC | 0.717 | 0.717 | 0.717 |
Predicted Class 0 | Predicted Class 1 | |
---|---|---|
Actual Class 0 | 519 | 44 |
Actual Class 1 | 327 | 109 |
📈 ROC Curve for Naive Bayes (ros_chi_sq
dataset)
To optimize model performance, 10-fold cross-validation was used.
🔹 Tuned Models & Parameters:
- Random Forest (
mtry
) - Naive Bayes (
fL
,usekernel
,adjust
) - Support Vector Machine (
C
,sigma
) - K-Nearest Neighbors (
kmax
,distance
)
📉 Impact: Despite tuning, no significant improvements were observed, confirming that initial configurations were already optimized.
An ensemble method was employed using Naive Bayes, Random Forest, and Logistic Regression.
🚀 Meta-Model Performance
Metric | Class 0 | Class 1 | Weighted Avg |
---|---|---|---|
TPR | 0.667 | 0.641 | 0.663 |
Precision | 0.911 | 0.258 | 0.811 |
AUC | 0.705 | 0.705 | 0.705 |
🛑 Conclusion: The ensemble model did not outperform the Naive Bayes model, confirming that simpler models worked best for this dataset.
✅ Key Takeaways
- The Naive Bayes model performed best (AUC = 0.72).
- Feature selection & data balancing significantly impacted model performance.
- Hyperparameter tuning & ensemble learning did not yield substantial improvements.
🚀 Next Steps
- Explore deep learning models (Neural Networks).
- Investigate non-linear feature transformations.
- Apply feature aggregation techniques to improve predictive power.
📌 Challenge: Finding the optimal model is like climbing a mountain—you don’t know which path leads to the peak until you try!
git clone https://github.com/ericylc23/Predicting-Cancer-ML.git
cd Predicting-Cancer-ML
pip install -r requirements.txt
jupyter notebook
For questions or collaborations, please reach out to me via:
📧 Email: [email protected]
🔗 LinkedIn: https://www.linkedin.com/in/eric-yuanlc/
🌟 If you found this project useful, don’t forget to ⭐ star the repo!