Skip to content

A study of churn using data analytics, visualization, and predictive modeling

Notifications You must be signed in to change notification settings

markbotros1/churn-prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Table of Contents

  1. Exploratory Data Analysis (EDA)
  2. Clustering
  3. Predictive Modeling
  4. Conclusions

Exploratory Data Analysis (EDA)

Description

A gym is looking to better understand why its customers choose to cancel their memberships. They want to model this phenomenom (i.e., churn) and predict which future customers are most likely to cancel their memberships based on their gym-going habits and characteristics. A study of the gym's dataset is provided below.

Data source: https://www.kaggle.com/datasets/adrianvinueza/gym-customers-features-and-churn/data

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_csv('data/gym_churn_us.csv')

Visualize Data

df.head()
gender Near_Location Partner Promo_friends Phone Contract_period Group_visits Age Avg_additional_charges_total Month_to_end_contract Lifetime Avg_class_frequency_total Avg_class_frequency_current_month Churn
0 1 1 1 1 0 6 1 29 14.227470 5.0 3 0.020398 0.000000 0
1 0 1 0 0 1 12 1 31 113.202938 12.0 7 1.922936 1.910244 0
2 0 1 1 0 1 1 0 28 129.448479 1.0 2 1.859098 1.736502 0
3 0 1 1 1 1 12 1 33 62.669863 12.0 2 3.205633 3.357215 0
4 1 1 1 1 1 1 0 26 198.362265 1.0 3 1.113884 1.120078 0
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 14 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   gender                             4000 non-null   int64  
 1   Near_Location                      4000 non-null   int64  
 2   Partner                            4000 non-null   int64  
 3   Promo_friends                      4000 non-null   int64  
 4   Phone                              4000 non-null   int64  
 5   Contract_period                    4000 non-null   int64  
 6   Group_visits                       4000 non-null   int64  
 7   Age                                4000 non-null   int64  
 8   Avg_additional_charges_total       4000 non-null   float64
 9   Month_to_end_contract              4000 non-null   float64
 10  Lifetime                           4000 non-null   int64  
 11  Avg_class_frequency_total          4000 non-null   float64
 12  Avg_class_frequency_current_month  4000 non-null   float64
 13  Churn                              4000 non-null   int64  
dtypes: float64(4), int64(10)
memory usage: 437.6 KB
# Check for duplicates
df.duplicated().sum()
0
# Lowercase column names
df = df.rename(columns=lambda x: x.lower())

Descriptive Statistics

df.describe().T.drop(['25%', '50%', '75%'], axis=1)
count mean std min max
gender 4000.0 0.510250 0.499957 0.000000 1.000000
near_location 4000.0 0.845250 0.361711 0.000000 1.000000
partner 4000.0 0.486750 0.499887 0.000000 1.000000
promo_friends 4000.0 0.308500 0.461932 0.000000 1.000000
phone 4000.0 0.903500 0.295313 0.000000 1.000000
contract_period 4000.0 4.681250 4.549706 1.000000 12.000000
group_visits 4000.0 0.412250 0.492301 0.000000 1.000000
age 4000.0 29.184250 3.258367 18.000000 41.000000
avg_additional_charges_total 4000.0 146.943728 96.355602 0.148205 552.590740
month_to_end_contract 4000.0 4.322750 4.191297 1.000000 12.000000
lifetime 4000.0 3.724750 3.749267 0.000000 31.000000
avg_class_frequency_total 4000.0 1.879020 0.972245 0.000000 6.023668
avg_class_frequency_current_month 4000.0 1.767052 1.052906 0.000000 6.146783
churn 4000.0 0.265250 0.441521 0.000000 1.000000

As we can see, 6 of the 13 features are binary, and the other 7 are continuous:

  • Binary:
    • gender
    • near_location
    • partner
    • promo_friends
    • phone
    • group_visits
  • Continuous:
    • contract_period
    • age
    • avg_additional_charges_total
    • month_to_end_contract
    • lifetime
    • avg_class_frequency_total
    • avg_class_frequency_current_month

Target Variable

The target variable for this problem is the churn variable, which indicates whether a customer has cancelled their gym membership (0 for non-cancellation, 1 for cancellation). This is the variable that we will be attempting to predict later on.

Data Distributions

def plot_distributions(df, hue, alpha):
    # Number of subplots
    nrows = int(np.ceil(len(df.columns) - 1 / 3))
    ncols = 3

    # Create figure
    fig = plt.figure(constrained_layout=True, figsize=(ncols*3.5, nrows*3.5))
    gs = fig.add_gridspec(nrows, ncols, wspace=0.1, hspace=0.1)

    for i, col in enumerate(df.columns):
        if col == 'churn':
            break

        # Create subplot
        subfig = fig.add_subfigure(gs[i // 3, i % 3])
        axs = subfig.subplots(2, 1)

        # Plot
        sns.histplot(data=df, x=col, bins='auto', palette=['green', 'red'], 
                     ax=axs[0], hue=hue, alpha=alpha)
        sns.boxplot(data=df, y=hue, palette=['green', 'red'], x=col, ax=axs[1], 
                    orient='h')

        # Style
        axs[0].set_xlabel('')
        subfig.suptitle(col.upper())
        subfig.set_facecolor('lightgrey')
    
    plt.show()
# Binary features
binary_features = ['gender', 'near_location', 'partner', 'promo_friends', 
                   'phone', 'group_visits', 'churn']
plot_distributions(df[binary_features], hue='churn', alpha=0.4)

png

The above histograms and boxplots show the distributions of each binary feature, grouped by their relationship to the target variable, churn. There doesn't seem to be anything very unusual here, but it must be noted that the classes are quite imbalanced (i.e., there are roughly 3x as many non-churn data points as there are churn). To address this, we might need to balance the classes (e.g., downsampling and upsamling) later on.

# Continuous features
continuous_features = ['contract_period', 'age', 'avg_additional_charges_total',
                       'month_to_end_contract', 'lifetime',
                       'avg_class_frequency_total', 
                       'avg_class_frequency_current_month', 'churn']
plot_distributions(df[continuous_features], hue='churn', alpha=0.4)

png

As with the binary variables, the distributions of the continuous features are shown above. At first glance, it appears that those most likely to cancel their memberships have a lower average age, attend fewer fitness classes, and sign up for shorter membership contracts. These insights will be explored further in a later section. It should be noted that some of these distributions contain fairly significant outliers, which might need to be addressed if model performance is inadequate.

Based on the distributions of binary and continuous features, the following observations can be made.

Clients who cancel their memberships are...

  • Typically under 35 years old
  • Have 1 month left on their contract
  • Have shorter contract periods than non-cancelers
  • Spend less on additional services than non-cancelers
  • Have been members for fewer than 5 months
  • Have attended fewer classes than non-cancelers

Feature Correlations

plt.figure(figsize=(8, 6))
sns.heatmap(df.corr(), linewidths=.5, cmap='inferno')
plt.show()

png

The correlation matrix looks good, but there are two pairs of features that are very highly correlated (i.e., collinear):

  • contract_period, month_to_end_contract
  • avg_class_frequency_current_month, avg_class_frequency_total

I will fix this by removing avg_class_frequency_total and contract_period from the dataset.

df_cleaned = df.drop(['contract_period', 'avg_class_frequency_total'], axis = 1)
plt.figure(figsize=(8, 6))
sns.heatmap(df_cleaned.corr(), linewidths=.5, cmap='inferno')
plt.show()

png

Clustering

Generate Clusters

To further understand the different characteristics of gym-goers who cancel and those who don't, I will group them into clusters. These clusters will concretely show how cancelers and non-cancelers differ along the different dimensions/features of our dataset.

from sklearn.decomposition import PCA 
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Standardize
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)

# Apply K-Means
km = KMeans(n_clusters=3, n_init=10, random_state=42).fit(df_scaled)
labels = km.predict(df_scaled)

Plot Clusters

pca = PCA(n_components=2).fit_transform(df_scaled)
df_red = pd.DataFrame(pca, columns=['x', 'y'])
df_red['cluster']=labels
plt.scatter(df_red[df_red['cluster']==0]['x'], df_red[df_red['cluster']==0]['y'], c='green')
plt.scatter(df_red[df_red['cluster']==1]['x'], df_red[df_red['cluster']==1]['y'], c='red')
plt.scatter(df_red[df_red['cluster']==2]['x'], df_red[df_red['cluster']==2]['y'], c='blue')
plt.show()

png

df['cluster'] = labels
cluster_feature_means = df.groupby('cluster').mean()
cluster_feature_means.round(2).T
cluster 0 1 2
gender 0.51 0.50 0.52
near_location 0.76 0.83 0.94
partner 0.35 0.36 0.77
promo_friends 0.18 0.19 0.57
phone 0.91 0.90 0.90
contract_period 1.61 2.25 10.51
group_visits 0.26 0.41 0.55
age 26.93 30.17 29.87
avg_additional_charges_total 115.15 157.12 161.41
month_to_end_contract 1.56 2.13 9.57
lifetime 1.05 4.84 4.61
avg_class_frequency_total 1.41 2.12 1.98
avg_class_frequency_current_month 1.00 2.12 1.97
churn 0.95 0.00 0.01

Cluster Insights

These clusters reveal some important information that builds on the EDA performed earlier. Clusters 1 and 2 reflect the gym-goers who did not cancel their memberships, while Cluster 0 mostly reflects gym-goers who did cancel. As such, the differences between Cluster 0 and the other two Clusters represent the differences between those who cancel and those who don't.

Cluster 0 (churn) vs Clusters 1, 2 (no churn):

  • Cluster 0 lives farther away from the gym
  • Cluster 0 signs up for shorter contract period
  • Cluster 0 makes fewer group visits
  • Cluster 0 roughly three years younger on average
  • Cluster 0 spends less on additional purchases
  • Cluster 0 has less time remaining on contract
  • Cluster 0 has been a member for 3 months less than Clusters 1, 2
  • Cluster 0 has attended 1 fewer classes in current month than Clusters 1, 2

These insights give us a profile of the customers who are most likely to cancel their memberships.

Predictive Modeling

Model Selection

For this binary classification problem, I will test the following models:

  • Logistic Regression
  • Linear SVM
  • XGBoost

Since the gym is interested in identifying all customers who are likely to cancel their memberships, we must place extra emphasis on recall scores.

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.metrics import ConfusionMatrixDisplay, RocCurveDisplay
from xgboost import XGBClassifier

seed = 42

# Splitting into features and target
X = df_cleaned.drop('churn', axis=1)
y = df_cleaned['churn']

# Splitting into train and test
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=seed)

# Scaling to 0 mean and unit variance
scaler = StandardScaler()
scaler.fit(X)
X_tr_scaled = scaler.fit_transform(X_tr)
X_te_scaled = scaler.transform(X_te)
# Instiating models
models = {'Logistic Regression': LogisticRegression(random_state=seed),
          'Linear SVM': LinearSVC(random_state=seed),
          'XGBoost': XGBClassifier(seed=seed)}

# Testing models
def test_model(name, model, X_tr, y_tr, X_te, y_te):
    model.fit(X_tr, y_tr)
    y_pred = model.predict(X_te)
    print(name)
    print('Accuracy:', accuracy_score(y_te, y_pred))
    print('Precision:', precision_score(y_te, y_pred))
    print('Recall:', recall_score(y_te, y_pred))

    fig, axs = plt.subplots(1, 2, figsize=(12, 4), gridspec_kw={'wspace': 0.5})
    
    # Display Confusion Matrix
    cm = ConfusionMatrixDisplay.from_estimator(model, X_te, y_te,
                                                            normalize=None,
                                                            ax=axs[0])
    cm.ax_.set_title(f'Confusion Matrix ({name})')

    # Display ROC
    roc = RocCurveDisplay.from_estimator(model, X_te, y_te, ax=axs[1])
    roc.ax_.set_title(f'ROC Curve ({name})')
    plt.show()
    print()
# Logistic Regression
test_model('Logistic Regression', models['Logistic Regression'], X_tr_scaled, y_tr, X_te_scaled, y_te)
Logistic Regression
Accuracy: 0.905
Precision: 0.8315789473684211
Recall: 0.7821782178217822

png

# Linear SVM
test_model('Linear SVM', models['Linear SVM'], X_tr_scaled, y_tr, X_te_scaled, y_te)
Linear SVM
Accuracy: 0.90625
Precision: 0.8290155440414507
Recall: 0.7920792079207921

png

# XGBoost
test_model('XGBoost', models['XGBoost'], X_tr_scaled, y_tr, X_te_scaled, y_te)
XGBoost
Accuracy: 0.885
Precision: 0.7925531914893617
Recall: 0.7376237623762376

png

Hyperparameter Tuning

As expected, the most complex models doesn't perform well out of the box, likely requiring hyperparameter tuning. I will attempt to tune XGBoost's hyperparameters to see if performance can be improved before selecting a model.

def tune_hyperparameters():
    learning_rate = np.linspace(.4, .6, 4)
    min_child_weight = np.linspace(1, 3, 4)
    subsample = np.linspace(.5, 1, 4)
    n_estimators = np.linspace(100, 400, 4)
    r = 0
    for lr in learning_rate:
        for mcw in min_child_weight:
            for ss in subsample:
                for ne in n_estimators:
                    xgb = XGBClassifier(objective='binary:logistic', 
                                      learning_rate=lr, 
                                      max_depth=1,
                                      min_child_weight=int(mcw), 
                                      subsample=ss, 
                                      n_estimators=int(ne),
                                      random_state=seed)
                    xgb.fit(X_tr_scaled, y_tr)
                    y_pred = xgb.predict(X_te_scaled)
                    rec = recall_score(y_te, y_pred)
                    if rec > r:
                        r = rec
                        print('Recall:', r, 
                            'Learning rate:', lr, 
                            'Min child weight:', mcw, 
                            'Subsample:', ss, 
                            'N estimators:', ne)

tune_hyperparameters()
Recall: 0.7772277227722773 Learning rate: 0.4 Min child weight: 1.0 Subsample: 0.5 N estimators: 100.0
Recall: 0.7821782178217822 Learning rate: 0.4 Min child weight: 1.0 Subsample: 0.5 N estimators: 300.0
Recall: 0.7871287128712872 Learning rate: 0.4 Min child weight: 1.0 Subsample: 0.5 N estimators: 400.0
Recall: 0.7920792079207921 Learning rate: 0.4 Min child weight: 1.0 Subsample: 0.6666666666666666 N estimators: 100.0
Recall: 0.7970297029702971 Learning rate: 0.4 Min child weight: 1.0 Subsample: 1.0 N estimators: 100.0
Recall: 0.806930693069307 Learning rate: 0.4666666666666667 Min child weight: 1.0 Subsample: 0.5 N estimators: 100.0
models['XGBoost'] = XGBClassifier(objective='binary:logistic', 
                                  learning_rate=0.467, 
                                  max_depth=1,
                                  min_child_weight=1, 
                                  subsample=0.5, 
                                  n_estimators=100,
                                  random_state=seed)

test_model('XGBoost', models['XGBoost'], X_tr_scaled, y_tr, X_te_scaled, y_te)
XGBoost
Accuracy: 0.90875
Precision: 0.8274111675126904
Recall: 0.806930693069307

png

After hyperparameter tuning, XGBoost performs much better than Logistic Regression and slightly better than Linear SVM in terms of recall.

Feature Importance

fi = dict(zip(df_cleaned.columns[:-1], models['XGBoost'].feature_importances_))
fi_df = pd.DataFrame(data={'feature': fi.keys(), 'importance': fi.values()})
fi_df.sort_values(by='importance', ascending=False, inplace=True, ignore_index=True)
fi_df
feature importance
0 lifetime 0.358408
1 month_to_end_contract 0.196666
2 age 0.095170
3 avg_class_frequency_current_month 0.090184
4 promo_friends 0.063519
5 near_location 0.049971
6 group_visits 0.044614
7 avg_additional_charges_total 0.042625
8 gender 0.035744
9 phone 0.023100
10 partner 0.000000

Conclusions

Summary

As revealed by the EDA, Clustering, and Predictive Modeling, the following features are most relevant for predicting churn:

  • lifetime: how long person has been a member
  • month_to_end_contract: how many months left on contract
  • age: customer's age
  • avg_class_frequency_current_month: how many classes customer has attended in current month
  • promo_friends: whether customer signed up through a referral or not
  • contract_period: the length of customer's membership contract
  • avg_additional_charges_total: how much a customer has spent on additional services at the gym
  • group_visits: how many group activities a customer is enrolled in

Who is likeliest to cancel?

  • History: Customers are most likely to cancel when they are still new the most do not work in a company associated with the gym nor do they obtain a discount for enrolling a to the gym. The longer a person stays, the less likely they are to cancel.
  • Proximity: Customers are more likely to cancel if they live farther away from the gym.
  • Friends: Customers are less likely to cancel their memberships if they enrolled through a friend's promo code. It seems that customers who have friends at the gym tend to cancel less.
  • Contract: Customers with shorter contracts tend to cancel more frequently. This might be because customers who plan to stay longer may opt for longer contracts, while customers who simply want to "try out" the gym opt for shorter contracts.
  • Group Sessions: Participating in group sessions seems to encourage and motivate customers to continue at the gym.
  • Age: Customers who cancel are, on average, 3 three years younger than those who don't. This might be because young customers are more likely to move away from the gym or have less established routines.
  • Additional services: Customers who spend more on additional services tend to cancel less frequently, suggesting that as customers spend more, they become more associated with the gym.
  • Remaining period: Customers are most likely to cancel in the final two months of their contract periods.

About

A study of churn using data analytics, visualization, and predictive modeling

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published