Exploratory Data Analysis (EDA)

Description

A gym is looking to better understand why its customers choose to cancel their memberships. They want to model this phenomenom (i.e., churn) and predict which future customers are most likely to cancel their memberships based on their gym-going habits and characteristics. A study of the gym's dataset is provided below.

Data source: https://www.kaggle.com/datasets/adrianvinueza/gym-customers-features-and-churn/data

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

df = pd.read_csv('data/gym_churn_us.csv')

Visualize Data

df.head()

	gender	Near_Location	Partner	Promo_friends	Phone	Contract_period	Group_visits	Age	Avg_additional_charges_total	Month_to_end_contract	Lifetime	Avg_class_frequency_total	Avg_class_frequency_current_month
0	1	1	1	1	0	6	1	29	14.227470	5.0	3	0.020398	0.000000
1	0	1	0	0	1	12	1	31	113.202938	12.0	7	1.922936	1.910244
2	0	1	1	0	1	1	0	28	129.448479	1.0	2	1.859098	1.736502
3	0	1	1	1	1	12	1	33	62.669863	12.0	2	3.205633	3.357215
4	1	1	1	1	1	1	0	26	198.362265	1.0	3	1.113884	1.120078

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 14 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   gender                             4000 non-null   int64  
 1   Near_Location                      4000 non-null   int64  
 2   Partner                            4000 non-null   int64  
 3   Promo_friends                      4000 non-null   int64  
 4   Phone                              4000 non-null   int64  
 5   Contract_period                    4000 non-null   int64  
 6   Group_visits                       4000 non-null   int64  
 7   Age                                4000 non-null   int64  
 8   Avg_additional_charges_total       4000 non-null   float64
 9   Month_to_end_contract              4000 non-null   float64
 10  Lifetime                           4000 non-null   int64  
 11  Avg_class_frequency_total          4000 non-null   float64
 12  Avg_class_frequency_current_month  4000 non-null   float64
 13  Churn                              4000 non-null   int64  
dtypes: float64(4), int64(10)
memory usage: 437.6 KB

# Check for duplicates
df.duplicated().sum()

# Lowercase column names
df = df.rename(columns=lambda x: x.lower())

Descriptive Statistics

df.describe().T.drop(['25%', '50%', '75%'], axis=1)

	count	mean	std	min	max
gender	4000.0	0.510250	0.499957	0.000000	1.000000
near_location	4000.0	0.845250	0.361711	0.000000	1.000000
partner	4000.0	0.486750	0.499887	0.000000	1.000000
promo_friends	4000.0	0.308500	0.461932	0.000000	1.000000
phone	4000.0	0.903500	0.295313	0.000000	1.000000
contract_period	4000.0	4.681250	4.549706	1.000000	12.000000
group_visits	4000.0	0.412250	0.492301	0.000000	1.000000
age	4000.0	29.184250	3.258367	18.000000	41.000000
avg_additional_charges_total	4000.0	146.943728	96.355602	0.148205	552.590740
month_to_end_contract	4000.0	4.322750	4.191297	1.000000	12.000000
lifetime	4000.0	3.724750	3.749267	0.000000	31.000000
avg_class_frequency_total	4000.0	1.879020	0.972245	0.000000	6.023668
avg_class_frequency_current_month	4000.0	1.767052	1.052906	0.000000	6.146783
churn	4000.0	0.265250	0.441521	0.000000	1.000000

As we can see, 6 of the 13 features are binary, and the other 7 are continuous:

Binary:
- gender
- near_location
- partner
- promo_friends
- phone
- group_visits
Continuous:
- contract_period
- age
- avg_additional_charges_total
- month_to_end_contract
- lifetime
- avg_class_frequency_total
- avg_class_frequency_current_month

Target Variable

The target variable for this problem is the churn variable, which indicates whether a customer has cancelled their gym membership (0 for non-cancellation, 1 for cancellation). This is the variable that we will be attempting to predict later on.

Data Distributions

def plot_distributions(df, hue, alpha):
    # Number of subplots
    nrows = int(np.ceil(len(df.columns) - 1 / 3))
    ncols = 3

    # Create figure
    fig = plt.figure(constrained_layout=True, figsize=(ncols*3.5, nrows*3.5))
    gs = fig.add_gridspec(nrows, ncols, wspace=0.1, hspace=0.1)

    for i, col in enumerate(df.columns):
        if col == 'churn':
            break

        # Create subplot
        subfig = fig.add_subfigure(gs[i // 3, i % 3])
        axs = subfig.subplots(2, 1)

        # Plot
        sns.histplot(data=df, x=col, bins='auto', palette=['green', 'red'], 
                     ax=axs[0], hue=hue, alpha=alpha)
        sns.boxplot(data=df, y=hue, palette=['green', 'red'], x=col, ax=axs[1], 
                    orient='h')

        # Style
        axs[0].set_xlabel('')
        subfig.suptitle(col.upper())
        subfig.set_facecolor('lightgrey')
    
    plt.show()

# Binary features
binary_features = ['gender', 'near_location', 'partner', 'promo_friends', 
                   'phone', 'group_visits', 'churn']
plot_distributions(df[binary_features], hue='churn', alpha=0.4)

The above histograms and boxplots show the distributions of each binary feature, grouped by their relationship to the target variable, churn. There doesn't seem to be anything very unusual here, but it must be noted that the classes are quite imbalanced (i.e., there are roughly 3x as many non-churn data points as there are churn). To address this, we might need to balance the classes (e.g., downsampling and upsamling) later on.

# Continuous features
continuous_features = ['contract_period', 'age', 'avg_additional_charges_total',
                       'month_to_end_contract', 'lifetime',
                       'avg_class_frequency_total', 
                       'avg_class_frequency_current_month', 'churn']
plot_distributions(df[continuous_features], hue='churn', alpha=0.4)

As with the binary variables, the distributions of the continuous features are shown above. At first glance, it appears that those most likely to cancel their memberships have a lower average age, attend fewer fitness classes, and sign up for shorter membership contracts. These insights will be explored further in a later section. It should be noted that some of these distributions contain fairly significant outliers, which might need to be addressed if model performance is inadequate.

Based on the distributions of binary and continuous features, the following observations can be made.

Clients who cancel their memberships are...

Typically under 35 years old
Have 1 month left on their contract
Have shorter contract periods than non-cancelers
Spend less on additional services than non-cancelers
Have been members for fewer than 5 months
Have attended fewer classes than non-cancelers

Feature Correlations

plt.figure(figsize=(8, 6))
sns.heatmap(df.corr(), linewidths=.5, cmap='inferno')
plt.show()

The correlation matrix looks good, but there are two pairs of features that are very highly correlated (i.e., collinear):

contract_period, month_to_end_contract
avg_class_frequency_current_month, avg_class_frequency_total

I will fix this by removing avg_class_frequency_total and contract_period from the dataset.

df_cleaned = df.drop(['contract_period', 'avg_class_frequency_total'], axis = 1)

plt.figure(figsize=(8, 6))
sns.heatmap(df_cleaned.corr(), linewidths=.5, cmap='inferno')
plt.show()

Clustering

Generate Clusters

To further understand the different characteristics of gym-goers who cancel and those who don't, I will group them into clusters. These clusters will concretely show how cancelers and non-cancelers differ along the different dimensions/features of our dataset.

from sklearn.decomposition import PCA 
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Standardize
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)

# Apply K-Means
km = KMeans(n_clusters=3, n_init=10, random_state=42).fit(df_scaled)
labels = km.predict(df_scaled)

Plot Clusters

pca = PCA(n_components=2).fit_transform(df_scaled)
df_red = pd.DataFrame(pca, columns=['x', 'y'])
df_red['cluster']=labels
plt.scatter(df_red[df_red['cluster']==0]['x'], df_red[df_red['cluster']==0]['y'], c='green')
plt.scatter(df_red[df_red['cluster']==1]['x'], df_red[df_red['cluster']==1]['y'], c='red')
plt.scatter(df_red[df_red['cluster']==2]['x'], df_red[df_red['cluster']==2]['y'], c='blue')
plt.show()

df['cluster'] = labels
cluster_feature_means = df.groupby('cluster').mean()
cluster_feature_means.round(2).T

cluster	0	1	2
gender	0.51	0.50	0.52
near_location	0.76	0.83	0.94
partner	0.35	0.36	0.77
promo_friends	0.18	0.19	0.57
phone	0.91	0.90	0.90
contract_period	1.61	2.25	10.51
group_visits	0.26	0.41	0.55
age	26.93	30.17	29.87
avg_additional_charges_total	115.15	157.12	161.41
month_to_end_contract	1.56	2.13	9.57
lifetime	1.05	4.84	4.61
avg_class_frequency_total	1.41	2.12	1.98
avg_class_frequency_current_month	1.00	2.12	1.97
churn	0.95	0.00	0.01

Cluster Insights

These clusters reveal some important information that builds on the EDA performed earlier. Clusters 1 and 2 reflect the gym-goers who did not cancel their memberships, while Cluster 0 mostly reflects gym-goers who did cancel. As such, the differences between Cluster 0 and the other two Clusters represent the differences between those who cancel and those who don't.

Cluster 0 (churn) vs Clusters 1, 2 (no churn):

Cluster 0 lives farther away from the gym
Cluster 0 signs up for shorter contract period
Cluster 0 makes fewer group visits
Cluster 0 roughly three years younger on average
Cluster 0 spends less on additional purchases
Cluster 0 has less time remaining on contract
Cluster 0 has been a member for 3 months less than Clusters 1, 2
Cluster 0 has attended 1 fewer classes in current month than Clusters 1, 2

These insights give us a profile of the customers who are most likely to cancel their memberships.

Predictive Modeling

Model Selection

For this binary classification problem, I will test the following models:

Logistic Regression
Linear SVM
XGBoost

Since the gym is interested in identifying all customers who are likely to cancel their memberships, we must place extra emphasis on recall scores.

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.metrics import ConfusionMatrixDisplay, RocCurveDisplay
from xgboost import XGBClassifier

seed = 42

# Splitting into features and target
X = df_cleaned.drop('churn', axis=1)
y = df_cleaned['churn']

# Splitting into train and test
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=seed)

# Scaling to 0 mean and unit variance
scaler = StandardScaler()
scaler.fit(X)
X_tr_scaled = scaler.fit_transform(X_tr)
X_te_scaled = scaler.transform(X_te)

# Instiating models
models = {'Logistic Regression': LogisticRegression(random_state=seed),
          'Linear SVM': LinearSVC(random_state=seed),
          'XGBoost': XGBClassifier(seed=seed)}

# Testing models
def test_model(name, model, X_tr, y_tr, X_te, y_te):
    model.fit(X_tr, y_tr)
    y_pred = model.predict(X_te)
    print(name)
    print('Accuracy:', accuracy_score(y_te, y_pred))
    print('Precision:', precision_score(y_te, y_pred))
    print('Recall:', recall_score(y_te, y_pred))

    fig, axs = plt.subplots(1, 2, figsize=(12, 4), gridspec_kw={'wspace': 0.5})
    
    # Display Confusion Matrix
    cm = ConfusionMatrixDisplay.from_estimator(model, X_te, y_te,
                                                            normalize=None,
                                                            ax=axs[0])
    cm.ax_.set_title(f'Confusion Matrix ({name})')

    # Display ROC
    roc = RocCurveDisplay.from_estimator(model, X_te, y_te, ax=axs[1])
    roc.ax_.set_title(f'ROC Curve ({name})')
    plt.show()
    print()

# Logistic Regression
test_model('Logistic Regression', models['Logistic Regression'], X_tr_scaled, y_tr, X_te_scaled, y_te)

Logistic Regression
Accuracy: 0.905
Precision: 0.8315789473684211
Recall: 0.7821782178217822

# Linear SVM
test_model('Linear SVM', models['Linear SVM'], X_tr_scaled, y_tr, X_te_scaled, y_te)

Linear SVM
Accuracy: 0.90625
Precision: 0.8290155440414507
Recall: 0.7920792079207921

# XGBoost
test_model('XGBoost', models['XGBoost'], X_tr_scaled, y_tr, X_te_scaled, y_te)

XGBoost
Accuracy: 0.885
Precision: 0.7925531914893617
Recall: 0.7376237623762376

Hyperparameter Tuning

As expected, the most complex models doesn't perform well out of the box, likely requiring hyperparameter tuning. I will attempt to tune XGBoost's hyperparameters to see if performance can be improved before selecting a model.

def tune_hyperparameters():
    learning_rate = np.linspace(.4, .6, 4)
    min_child_weight = np.linspace(1, 3, 4)
    subsample = np.linspace(.5, 1, 4)
    n_estimators = np.linspace(100, 400, 4)
    r = 0
    for lr in learning_rate:
        for mcw in min_child_weight:
            for ss in subsample:
                for ne in n_estimators:
                    xgb = XGBClassifier(objective='binary:logistic', 
                                      learning_rate=lr, 
                                      max_depth=1,
                                      min_child_weight=int(mcw), 
                                      subsample=ss, 
                                      n_estimators=int(ne),
                                      random_state=seed)
                    xgb.fit(X_tr_scaled, y_tr)
                    y_pred = xgb.predict(X_te_scaled)
                    rec = recall_score(y_te, y_pred)
                    if rec > r:
                        r = rec
                        print('Recall:', r, 
                            'Learning rate:', lr, 
                            'Min child weight:', mcw, 
                            'Subsample:', ss, 
                            'N estimators:', ne)

tune_hyperparameters()

Recall: 0.7772277227722773 Learning rate: 0.4 Min child weight: 1.0 Subsample: 0.5 N estimators: 100.0
Recall: 0.7821782178217822 Learning rate: 0.4 Min child weight: 1.0 Subsample: 0.5 N estimators: 300.0
Recall: 0.7871287128712872 Learning rate: 0.4 Min child weight: 1.0 Subsample: 0.5 N estimators: 400.0
Recall: 0.7920792079207921 Learning rate: 0.4 Min child weight: 1.0 Subsample: 0.6666666666666666 N estimators: 100.0
Recall: 0.7970297029702971 Learning rate: 0.4 Min child weight: 1.0 Subsample: 1.0 N estimators: 100.0
Recall: 0.806930693069307 Learning rate: 0.4666666666666667 Min child weight: 1.0 Subsample: 0.5 N estimators: 100.0

models['XGBoost'] = XGBClassifier(objective='binary:logistic', 
                                  learning_rate=0.467, 
                                  max_depth=1,
                                  min_child_weight=1, 
                                  subsample=0.5, 
                                  n_estimators=100,
                                  random_state=seed)

test_model('XGBoost', models['XGBoost'], X_tr_scaled, y_tr, X_te_scaled, y_te)

XGBoost
Accuracy: 0.90875
Precision: 0.8274111675126904
Recall: 0.806930693069307

After hyperparameter tuning, XGBoost performs much better than Logistic Regression and slightly better than Linear SVM in terms of recall.

Feature Importance

fi = dict(zip(df_cleaned.columns[:-1], models['XGBoost'].feature_importances_))
fi_df = pd.DataFrame(data={'feature': fi.keys(), 'importance': fi.values()})
fi_df.sort_values(by='importance', ascending=False, inplace=True, ignore_index=True)
fi_df

	feature	importance
0	lifetime	0.358408
1	month_to_end_contract	0.196666
2	age	0.095170
3	avg_class_frequency_current_month	0.090184
4	promo_friends	0.063519
5	near_location	0.049971
6	group_visits	0.044614
7	avg_additional_charges_total	0.042625
8	gender	0.035744
9	phone	0.023100
10	partner	0.000000

Conclusions

Summary

As revealed by the EDA, Clustering, and Predictive Modeling, the following features are most relevant for predicting churn:

lifetime: how long person has been a member
month_to_end_contract: how many months left on contract
age: customer's age
avg_class_frequency_current_month: how many classes customer has attended in current month
promo_friends: whether customer signed up through a referral or not
contract_period: the length of customer's membership contract
avg_additional_charges_total: how much a customer has spent on additional services at the gym
group_visits: how many group activities a customer is enrolled in

Who is likeliest to cancel?

History: Customers are most likely to cancel when they are still new the most do not work in a company associated with the gym nor do they obtain a discount for enrolling a to the gym. The longer a person stays, the less likely they are to cancel.
Proximity: Customers are more likely to cancel if they live farther away from the gym.
Friends: Customers are less likely to cancel their memberships if they enrolled through a friend's promo code. It seems that customers who have friends at the gym tend to cancel less.
Contract: Customers with shorter contracts tend to cancel more frequently. This might be because customers who plan to stay longer may opt for longer contracts, while customers who simply want to "try out" the gym opt for shorter contracts.
Group Sessions: Participating in group sessions seems to encourage and motivate customers to continue at the gym.
Age: Customers who cancel are, on average, 3 three years younger than those who don't. This might be because young customers are more likely to move away from the gym or have less established routines.
Additional services: Customers who spend more on additional services tend to cancel less frequently, suggesting that as customers spend more, they become more associated with the gym.
Remaining period: Customers are most likely to cancel in the final two months of their contract periods.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README_files		README_files
data		data
.gitignore		.gitignore
README.md		README.md
churn-prediction.ipynb		churn-prediction.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of Contents

Exploratory Data Analysis (EDA)

Description

Visualize Data

Descriptive Statistics

Target Variable

Data Distributions

Feature Correlations

Clustering

Generate Clusters

Plot Clusters

Cluster Insights

Predictive Modeling

Model Selection

Hyperparameter Tuning

Feature Importance

Conclusions

Summary

Who is likeliest to cancel?

About

Releases

Packages

Languages

markbotros1/churn-prediction

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Exploratory Data Analysis (EDA)

Description

Visualize Data

Descriptive Statistics

Target Variable

Data Distributions

Feature Correlations

Clustering

Generate Clusters

Plot Clusters

Cluster Insights

Predictive Modeling

Model Selection

Hyperparameter Tuning

Feature Importance

Conclusions

Summary

Who is likeliest to cancel?

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages