Problem Statement
Goal & Obejctive
Tools: Python, JupyterLab, Git
Libraries: Pandas, Numpy, Feature-engine, Scikit-learn, Imbalanced-learn, SHAP-learn, Gain & Lift Analysis
Dataset: Predicting Churn for Bank Customers [source]
Summary of the analysis
- This dataset has 10000 observations and 14 variables with 11 numerical variables, 3 categorical variables and one target variable.
- All numerical variables have a right-skewed distribution and contain a lot of outliers.
- Exited is the target variable that labels a 0 (not churn) and 1 (churn). The current condition is 20% of customer churn
- From exploratory data analysis, customer who use num of products > 2 have trend churn, The older the customer, the higher the churn rate
- Based on data characteristics, the selected algorithm to build a classification model is tree-based or ensemble. The classification model with the xgboost algorithm is able to correctly predict 75% of visitors who make a purchase.
- Age, NumOfProduct, Gender Male, Geography France and IsActiveMember are the biggest impact on churn rate.
- Percentage Saving cost with model have 69%
What I have learned
- Framing the business problem.
- Create a machine learning model and extract insight from it to make an actionable recommendation for the business team.
- Make a business simulation from insights that decrease churn rate.
File Dictionaries
- EDA_2Pendo (1).ipynb: this notebook contains all of project details, such as business understanding, exploratory data analysis & insights from dataset and external data.
- Supervised_2pendo.ipynb : data preprocessing, modeling, lift & gain analysis, feature importance with SHAP, business recommendation
- 2pendo-presentation_final_project.pdf: summary of the project.