-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy path7. Customer Lifetime Value Modeling with OLS and Bayesian Linear Regression.Rmd
432 lines (297 loc) · 11.9 KB
/
7. Customer Lifetime Value Modeling with OLS and Bayesian Linear Regression.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
---
title: "Customer Lifetime Value (CLV) Modeling with Linear Regression - OLS and Bayesian"
output: html_notebook
---
## Load and inspect data
```{r Dataset}
#browseURL("https://archive.ics.uci.edu/ml/datasets/Online+Retail+II")
```
**Data Set Information:**
This Online Retail II data set contains all the transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2011.The company mainly sells unique all-occasion gift-ware. Many customers of the company are wholesalers.
**Attribute Information:**
**InvoiceNo:** Invoice number. Nominal.
A 6-digit integral number uniquely assigned to each transaction.
If this code starts with the letter 'c', it indicates a cancellation.
**StockCode:** Product (item) code. Nominal.
A 5-digit integral number uniquely assigned to each distinct product.
Description: Product (item) name. Nominal.
**Quantity:** The quantities of each product (item) per transaction. Numeric.
**InvoiceDate:** Invoice date and time. Numeric.
The day and time when a transaction was generated.
**UnitPrice:** Unit price. Numeric.Product price per unit in sterling (£).
**CustomerID:** Customer number. Nominal.
A 5-digit integral number uniquely assigned to each customer.
**Country:** Country name. Nominal.
The name of the country where a customer resides.
```{r Install and load packages}
# Install pacman if needed
if (!require("pacman")) install.packages("pacman")
# load packages
pacman::p_load(pacman,
tidyverse, rpart, psych, corrplot, cowplot, tree, VIM, GGally, lubridate, car)
```
```{r Import dataset}
#import file
dataset_2009 <- read.csv("datasets/online_retail_2009.csv")
head(dataset_2009) #check results
dataset_2010 <- read.csv("datasets/online_retail_2010.csv")
head(dataset_2010) #check results
```
```{r Any missing data}
#any missing data? Using aggr function from VIM package
aggr(dataset_2009, prop = F, numbers = T) # no red - no missing values
aggr(dataset_2010, prop = F, numbers = T) # no red - no missing values
```
Some customerID's are missing.
```{r Combine datasets}
df_combined <- rbind(dataset_2009, dataset_2010)
dim(df_combined)
```
```{r Explore combined dataset}
(df_summary_stats <- df_combined %>%
summary())
```
1067371 total rows.
Lots of negative values for quantity and price.
These rows will need to be removed.
```{r Remove negative value orders and those with no customer ids and test orders}
df <- df_combined %>%
filter(Quantity > 0) %>%
filter(Price > 0) %>%
filter(Customer.ID != "NA") %>%
filter(!grepl("TEST", StockCode))
dim(df)
```
Removed negative orders and now have 805,539 rows of data
```{r Reformat variables - InvoiceDate}
str(df$InvoiceDate)
#Date is character type, but needs to be date type
#Get summary stats on date column
df %>%
select(InvoiceDate) %>%
summarize_all(funs(min, max))
#Convert to date type - this works swimmingly!
df$InvoiceDate <- lubridate::mdy_hm(df$InvoiceDate)
#check results
str(df$InvoiceDate)
```
```{r Add a date anchor to dataframe}
#We will use the most recent purchase date in the data as our time_now variable
df <- df %>%
mutate(time_now = max(df$InvoiceDate))
#check results
str(df$time_now)
```
```{r Add days_since column and format to numeric time interval}
# df <- df %>%
# mutate(days_since = as.numeric(InvoiceDate - time_now),
# purchase_amount = Quantity * Price)
df_01 <- df %>%
mutate(days_since = round(as.numeric(difftime(time_now, InvoiceDate, units = "days"))),
purchase_amount = Quantity * Price)
str(df_01$days_since)
str(df_01$purchase_amount)
head(df_01)
```
```{r Rename some columns}
#This will make our data manipulation easier
df_02 <- df_01 %>%
rename(
customer_id = Customer.ID,
stock_code = StockCode,
invoice_date = InvoiceDate
)
names(df_02)
```
## Customer Lifetime Value (CLV) Modeling with Linear Regression (OLS)
```{r Get our data in the right format}
#load zoo library
library(zoo)
#We will use df_02
head(df_02)
#We need to add a quarter & month column to dataframe from our invoice_date
df_03 <- df_02%>%
mutate(year_quarter = as.yearqtr(invoice_date, format = "%Y-%m-%d"),
year_month = as.yearmon(invoice_date))
#check results
head(df_03)
#Get summary stats on year_quarter column
df_03 %>%
select(year_quarter) %>%
summarize_all(funs(min, max))
#Get summary stats on year_month column
df_03 %>%
select(year_month) %>%
summarize_all(funs(min, max))
```
```{r Filter on the last quarterly data that we have}
#filter on most recent customers only
#We don't have a full 3 months of data for the 4th quarter of 2011 so we need to expand it to include 3 months
#Testing the date filter logic to use in the next step
as.Date(max(df_03$invoice_date)) - 90
df_filtered <- df_03 %>%
filter(invoice_date >= as.Date(max(df_03$invoice_date)) - 90)
#check results
glimpse(df_filtered)
table(df_filtered$year_month)
```
```{r Group sales data by customer and quarters}
df_grouped <- df_filtered %>%
group_by(customer_id) %>%
summarize(sales_last_3mon = sum(purchase_amount),
avg_sales = round(mean(purchase_amount),2),
avg_item_price = mean(Price),
n_purchases = n(),
days_since_last_purch = round(min(days_since),2),
customer_duration = round(max(days_since),2))
head(df_grouped)
```
```{r Check Correlation}
# Visualization of correlations
df_grouped %>% select_if(is.numeric) %>%
select(-customer_id) %>%
cor() %>%
corrplot(method = "circle", type = "upper", insig = "blank", diag = FALSE, addCoef.col = "grey")
```
Strong correlation between sales for the last 3 months and mean sales.
Assumptions of simple linear regression model
1.linear relationship between x and y
2.no measurement error in X (weak exogoeneity)
3. independence of errors expectation of errors is 0
4. constant variance of prediction errors (homoscedasticity)
5. normality of errors
```{r We need to split data into test train}
# Determine row to split on: split
split <- round(nrow(df_grouped) * 0.80)
# Create train
train <- df_grouped[1:split,]
# Create test
test <- df_grouped[(split + 1):nrow(df_grouped),]
```
```{r We want to run the linear regression model}
# Conduct regression on training set
sales_model <- train %>%
lm(
sales_last_3mon ~ avg_sales + avg_item_price + n_purchases + days_since_last_purch + customer_duration, # "as a function of"
data = .
)
# Show summary
sales_model %>% summary()
```
We have quite a few significant variables. A decent r-squared and low-p-value for the model.
```{r Create the predictions}
# Predict on test set
preds <- predict(sales_model, test, type = "response")
# Compute errors
error <- preds - test$sales_last_3mon
# Calculate RMSE
sqrt(mean((preds - test$sales_last_3mon)^2))
```
RMSE = 6944.564
We check VIF to avoid multicollinearity.
There are no high VIF's to worry about.
```{r Check Variance Influence Factor}
car::vif(sales_model)
```
```{r Can we predict future sales}
# Calculating mean of future sales
mean(preds, na.rm = TRUE)
```
Our predicted sales are 1143.
Our actual sales for the past 3 months is 1196.
Not too far away from forecasted amount.
We need to look for outliers or high-leverage observations. Will use the broom package to do this.
```{r Leverage computations from sales_model}
sales_model %>%
broom::augment() %>%
arrange(desc(.hat)) %>%
select(sales_last_3mon, n_purchases, avg_sales, .fitted, .resid, .hat) %>%
head(10)
```
The leverage scores (.hat column) show the first 3 observations with highest leverage. This means that the observations are far from the mean of the explanatory variables.
Upon closer inspection of df_grouped, there is a customer with 1 very large purchase (customer_id 16446). The second observation (customer_id 16742) had 1 order with a a very high average price. The third observation (customer_id 14096) has very high number of orders 5000+. We will remove these observations as they are having a big effect on our model and re-model again.
```{r Remove outliers}
df_outliers_removed <- df_grouped %>%
filter(customer_id != 16446 && customer_id != 16742 && customer_id != 14096)
head(df_outliers_removed)
```
```{r Remove and clean up objects from first model}
# To clean up the memory
rm(split, train, test, preds, error)
```
```{r 2nd iteration - We need to split data into test train}
# Determine row to split on: split
split <- round(nrow(df_outliers_removed) * 0.80)
# Create train
train <- df_outliers_removed[1:split,]
# Create test
test <- df_outliers_removed[(split + 1):nrow(df_outliers_removed),]
#check results
dim(train)
dim(train)/dim(df_outliers_removed) #proportions
dim(test)
dim(test)/dim(df_outliers_removed) # proportions
```
```{r 2nd iteration - We want to run the linear regression model}
# Conduct regression on training set - we will remove insignificant variable avg_item_price
sales_model_02 <- train %>%
lm(
sales_last_3mon ~ avg_sales + n_purchases + days_since_last_purch + customer_duration, # "as a function of"
data = .
)
# Show summary
sales_model_02 %>% summary()
```
Our r-squared actually did not move that much with the outliers removed, but now we get different coefficients which does change our interpretations and our forecasts for the CLV.
```{r Run predictions on the new model}
# Predict on test set
preds <- predict(sales_model_02, test, type = "response")
# Compute errors
error <- preds - test$sales_last_3mon
# Calculate RMSE
sqrt(mean((preds - test$sales_last_3mon)^2))
```
This is great! We lowered the RMSE which provides better forecasts.
```{r}
cor(sales_model_02$fitted.values,train$sales_last_3mon) # Computes the correlation between the fitted values and the actual ones
plot(train$sales_last_3mon,sales_model_02$fitted.values) # Plot the fitted values vs. the actual ones
data_mod <- data.frame(Predicted = predict(sales_model_02),
Observed = df_outliers_removed[c(1:2311),]$sales_last_3mon)
ggplot(data_mod,
aes(x = Predicted,
y = Observed)) +
geom_point() +
geom_abline(intercept = 0,
slope = 1,
color = "red",
size = 1)
```
## Interpretation of the coefficients & learnings from the OLS model:
1. 71% of the variation in CLV can be explained by our independent variables. The remaining variation goes unexplained. The value of CLV would be -121 when all of my variables are 0 which doesn't have a managerial interpretation.
2. Our model is significant with the low p-value
3. The effect of avg_sales and n_purchases and days_since_last_purch and customer_duration are statistically significant.
4. A one unit increase in avg_sales leads to 1 currency (british pounds) increase in CLV.
5. For each additional purchase order per customer increases CLV by 13 british pounds.
6. The longer the time since the customer's last order (for every day), reduces CLV by 12 british pounds.
7. For each additional day when we acquire a new customer, we increase CLV by 15 british pounds.
## Customer Lifetime Value (CLV) Modeling with Bayesian Linear Regression
We will next try a Bayesian approach.
A Bayesian analysis answers the question, "Given these data, how likely is the difference?"
We will compare against the classic frequentist OLS model from above.
```{r Bayes LM}
library(rstanarm)
# Conduct regression
sales_bayes_model <- train %>%
stan_glm(
sales_last_3mon ~ avg_sales + n_purchases + days_since_last_purch + customer_duration,
data = .,
seed = 123
)
# Show summary
sales_bayes_model %>% summary()
```
The Bayes model lends to the same interpretation of our OLS lm for the coefficients are nearly the same. The Rhat's are all decent at 1.0 for each coefficient. The model has converged, but just barely.
## Final Summary
Key takeaways for CLV modeling and what can we suggest to both the sales and marketing teams:
1. We can use the CLV to allocate marketing resources between customers to maximize future profits.
2. We can prevent profitable customers from churning