Skip to content

A project which develops a customer segmentation and then model to optimise company mail out processes as part of the Udacity Data Science Nanodegree

Notifications You must be signed in to change notification settings

Phoebe-Macdonald/Arvato-Customer-Segmentation-Udacity-Capstone

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 

Repository files navigation

Arvato-Customer-Segmentation-Udacity-Capstone

A full explanation of the project motivation, methodology and results can be found here: https://medium.com/@phoebe.macdonald/who-are-arvato-banks-most-valuable-customers-7504f6bf3c77

Motivation:

With the rise of disruptive fin-techs and general public uncertainty, it is more important than ever for banks to understand who are their customers and how to best to communicate with them. This project aims to solve this for Arvato Financial Solutions by applying both unsupervised and supervised machine learning techniques. This project had who aims:

  • Identify who Arvato customers are and the qualities that distinguish them from the general population
  • Understand how Arvato bank can optimise efficacy and efficiency of marketing communication with their customers, with a particular focus on a direct mail-out campaign

Summary of results

To achieve the first objective, a k-modes clustering algorithm was refined and developed. Results were visualised post t-SNE reduction and evaluated using the Silhouette method Distributions of the customer dataset and the general population were compared across clusters. The general population were described by one cluster. There were 5 additional clusters within the customers dataset. Individuals in these cluster differed from the general population in terms of their age and wealth - generally they were older, higher earners and of a higher class.

To achieve the second objective, three deep-learning algorithms were trained, evaluated and compared. The best model achieved and AUC under the ROC curve of 0.57 and a AUC under the Precision Recall curve of 0.02. These figures indicate that the model was better than a random classifier at identifying customers likely to respond to a direct mail-out. The model was then used to make predictions on a new dataset and results uploaded to a Kaggle competition: https://www.kaggle.com/c/udacity-arvato-identify-customers

Prerequisites:

Data preparation:

  • import numpy as np
  • import pandas as pd
  • import re

Data visualisations:

  • import matplotlib.pyplot as plt
  • import seaborn as sns
  • from matplotlib import pyplot
  • from mpl_toolkits.mplot3d import Axes3D

Data processing:

  • from sklearn import preprocessing
  • from sklearn.preprocessing import StandardScaler
  • from sklearn.decomposition import PCA
  • from sklearn.manifold import TSNE

Customer segmentation:

  • from sklearn.cluster import KMeans
  • from kmodes.kmodes import KModes
  • from sklearn.metrics import silhouette_score
  • from sklearn.metrics import silhouette_samples

Predictive models:

  • from catboost import CatBoostClassifier
  • import lightgbm as lgb
  • import xgboost as xgb

Model evaluation:

  • from sklearn.model_selection import train_test_split
  • from sklearn.model_selection import GridSearchCV
  • from sklearn import metrics
  • from sklearn.metrics import roc_curve
  • from sklearn.metrics import precision_recall_curve

Credits:

My fantastic mentor Marom and manager Jamie and the following links:

Files:

About

A project which develops a customer segmentation and then model to optimise company mail out processes as part of the Udacity Data Science Nanodegree

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published