-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path02_regularization.Rmd
688 lines (500 loc) · 27.1 KB
/
02_regularization.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
# Regularization
```{r knitr-setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,
message = FALSE,
warning = FALSE,
fig.path = "./figs/",
fig.align = "center",
fig.asp = 0.618,
out.width = "80%")
eval_chnk <- FALSE
```
Regularization is a common topic in machine learning and bayesian statistics. In this chapter, we will describe the three most common regularized linear models in the machine learning literature and introduce them in the context of the PISA data set. At the end of the document you'll find exercises that will put your knowledge to the test. Most of the material is built upon @boehmke2019 and @james2013, which can be used as reference.
## Ridge regularization
Do no let others fool you into thinking that ridge regression is a fancy artificial intelligence algorithm. Are you familiar with linear regression? If you are, then ridge regression is just an adaptation of linear regression.
The whole aim of linear regression, or Ordinary Least Squares (OLS), is to minimize the sum of the squared residuals. In other words, fit `N` number of regression lines to the data and keep only the one that has the lowest sum of squared residuals. In simple formula jargon, OLS tries to **minimize** this:
\begin{equation}
RSS = \sum_{k = 1}^n(actual_i - predicted_i)^2
\end{equation}
For each fitted regression line, you compare the predicted value ($predicted_i$) versus the actual value ($actual_i$), square it, and add it all up. Each fitted regression line then has an associated Residual Sum of Squares (RSS) and the linear model chooses the line with the lowest RSS.
> Note: Social scientists are familiar with the RSS and call it just by it's name. However, be aware that in machine learning jargon, the RSS belongs to a general family called **loss functions**. Loss functions are metrics that evaluate the **fit** of your model and there are many around (such as AIC, BIC or R2).
Ridge regression takes the previous RSS loss function and adds one term:
\begin{equation}
RSS + \lambda \sum_{k = 1}^n \beta^2_j
\end{equation}
The term on the right is called a *shrinkage penalty* because it forces each coefficient $\beta_j$ closer to zero by squaring it. The shrinkage part is clearer once you think of this term as forcing each coefficient to be as small as possible without compromising the Residual Sum of Squares (RSS). In other words, we want the smallest coefficients that don't affect the fit of the line (RSS).
An intuitive example is to think of RSS and $\sum_{k = 1}^n \beta^2_j$ as two separate things. RSS estimates how the model fits the data and $\sum_{k = 1}^n \beta^2_j$ limits how much you overfit the data. Finally, the $\lambda$ between these two terms (called lambda) can be interpreted as a "weight". The higher the lambda, the higher the weight that will be given to the shrinkage term of the equation. If $\lambda$ is 0, then multiplying 0 by $\sum_{k = 1}^n \beta^2_j$ will always return zero, forcing our previous equation to simply be reduced to the single term $RSS$.
Why is there a need to "limit" how well the model fits the data? Because we, social scientists and data scientists, very commonly **overfit** the data. The plot below shows a simulation from [Simon Jackson](https://drsimonj.svbtle.com/ridge-regression-with-glmnet) where we can see that when tested on a training set, OLS and Ridge tend to overfit the data. However, when tested on the test data, ridge regression has lower out of sample error as the $R2$ is higher for models with different observations.
```{r, echo = FALSE}
# This chunk was taken and adapted from https://drsimonj.svbtle.com/ridge-regression-with-glmnet
library(broom)
library(glmnet)
library(dplyr)
library(purrr)
library(tidyr)
library(ggplot2)
# Compute R^2 from true and predicted values
rsquare <- function(true, predicted) {
sse <- sum((predicted - true)^2)
sst <- sum((true - mean(true))^2)
rsq <- 1 - sse / sst
# For this post, impose floor...
if (rsq < 0) rsq <- 0
return (rsq)
}
# Train ridge and OLS regression models on simulated data set with `n_train`
# observations and a number of features as a proportion to `n_train`,
# `p_features`. Return R squared for both models on:
# - y values of the training set
# - y values of a simulated test data set of `n_test` observations
# - The beta coefficients used to simulate the data
ols_vs_ridge <- function(n_train, p_features, n_test = 200) {
## Simulate datasets
n_features <- floor(n_train * p_features)
betas <- rnorm(n_features)
x <- matrix(rnorm(n_train * n_features), nrow = n_train)
y <- x %*% betas + rnorm(n_train)
train <- data.frame(y = y, x)
x <- matrix(rnorm(n_test * n_features), nrow = n_test)
y <- x %*% betas + rnorm(n_test)
test <- data.frame(y = y, x)
## OLS
lm_fit <- lm(y ~ ., train)
# Match to beta coefficients
lm_betas <- tidy(lm_fit) %>%
filter(term != "(Intercept)") %>%
{.$estimate}
lm_betas_rsq <- rsquare(betas, lm_betas)
# Fit to training data
lm_train_rsq <- glance(lm_fit)$r.squared
# Fit to test data
lm_test_yhat <- predict(lm_fit, newdata = test)
lm_test_rsq <- rsquare(test$y, lm_test_yhat)
## Ridge regression
lambda_vals <- 10^seq(3, -2, by = -.1) # Lambda values to search
cv_glm_fit <- cv.glmnet(as.matrix(train[,-1]), train$y, alpha = 0, lambda = lambda_vals, nfolds = 5)
opt_lambda <- cv_glm_fit$lambda.min # Optimal Lambda
glm_fit <- cv_glm_fit$glmnet.fit
# Match to beta coefficients
glm_betas <- tidy(glm_fit) %>%
filter(term != "(Intercept)", lambda == opt_lambda) %>%
{.$estimate}
glm_betas_rsq <- rsquare(betas, glm_betas)
# Fit to training data
glm_train_yhat <- predict(glm_fit, s = opt_lambda, newx = as.matrix(train[,-1]))
glm_train_rsq <- rsquare(train$y, glm_train_yhat)
# Fit to test data
glm_test_yhat <- predict(glm_fit, s = opt_lambda, newx = as.matrix(test[,-1]))
glm_test_rsq <- rsquare(test$y, glm_test_yhat)
data.frame(
model = c("OLS", "Ridge"),
betas_rsq = c(lm_betas_rsq, glm_betas_rsq),
train_rsq = c(lm_train_rsq, glm_train_rsq),
test_rsq = c(lm_test_rsq, glm_test_rsq)
)
}
# Function to run `ols_vs_ridge()` `n_replications` times
repeated_comparisons <- function(n_train, p_features, n_replications = 1) {
map(seq(n_replications), ~ ols_vs_ridge(n_train, p_features)) %>%
map2(seq(.), ~ mutate(.x, replicate = .y)) %>%
reduce(rbind)
}
set.seed(23141)
d <- purrr::cross_df(list(
n_train = seq(20, 200, 20),
p_features = .95
))
d <-
d %>%
mutate(results = map2(n_train, p_features, repeated_comparisons))
d %>%
unnest() %>%
group_by(model, n_train) %>%
summarise(
train_rsq = mean(train_rsq),
test_rsq = mean(test_rsq)) %>%
gather(data, rsq, contains("rsq")) %>%
ungroup() %>%
mutate(data = factor(gsub("_rsq", "", data), levels = c("train", "test"))) %>%
ggplot(aes(n_train, rsq, color = model)) +
geom_line() +
geom_point(size = 4, alpha = .3) +
facet_wrap(~ data) +
theme_minimal() +
labs(x = "Number of training observations",
y = "R squared")
```
The strength of the ridge regression comes from the fact that it compromises fitting the training data really well for improved generalization. In other words, we increase **bias** (because we force the coefficients to be smaller) for lower **variance** (making our predictions more robust). In other words, the whole gist behind ridge regression is penalizing very large coefficients for better generalization on new data.
Having that intuition in mind, there is one important thing to keep in mind: the predictors of the ridge regression need to be standardized. Why is this the case? Because due to the scale of a predictor, its coefficient can be more penalized than other predictors. Suppose that you have the income of a particular person (measured in thousands per months) and time spent with their families (measured in seconds) and you're trying to predict happiness. A one unit increase in salary could be penalized much more than a one unit increase in time spent with their families **just** because a one unit increase in salary can be much bigger due to it's metric.
In R, you can fit a ridge regression using `tidymodels` and `tidyflow`. Let's load the packages that we will work with and read the data:
```{r, message = FALSE}
library(tidymodels)
library(tidyflow)
data_link <- "https://raw.githubusercontent.com/cimentadaj/ml_socsci/master/data/pisa_us_2018.csv"
pisa <- read.csv(data_link)
```
We will construct our `tidyflow` step by step. We begin with the data and then separate the training and test data. All of our modeling will be performed on the training data and the test data is saved for later (the test data must be completely ignored until you have your final tuned model). The second step is specifying the variables in the model and scaling all of them, as I have explained, we want to normalize all variables such that no variable gets more penalized than other due to their metric.
```{r}
# Specify all variables and scale
rcp <-
# Define dependent (math_score) and independent variables
~ recipe(math_score ~ MISCED + FISCED + HISEI + REPEAT + IMMIG + DURECEC + BSMJ, data = .) %>%
# Scale all predictors (already knows it's the independent variables)
step_scale(all_predictors())
tflow <-
tidyflow(seed = 231141) %>%
plug_data(pisa) %>%
plug_split(initial_split, prop = .7) %>%
# Add the recipe with all variables and scale
plug_recipe(rcp)
tflow
```
The argument `prop` controls the proportion of the sample that will be in the training data. Here we specify it to be `.7`, 70% of the data. The third step is specifying the **tuning** parameters. The ridge regression has a parameter called `penalty` which needs to be set by us. `penalty` is the "weight" term in the ridge equation, which controls how much weight do we want to give to the "shrinkage penalty" (this is the $\lambda$ from the equation). If this penalty is set to 0, it means we attach **no** weight to the penalty term and we will get the same result over OLS. Let's try that:
```{r}
############################# Ridge regression ################################
###############################################################################
regularized_reg <-
set_engine(
# mixture specifies the type of penalized regression: 0 is ridge regression
linear_reg(penalty = 0, mixture = 0),
"glmnet"
)
model1 <-
tflow %>%
plug_model(regularized_reg) %>%
fit()
# Get ridge coefficients
mod <- model1 %>% pull_tflow_fit() %>% .[["fit"]]
ridge_coef <- predict(mod, s = 0, type = "coefficients")
############################# Linear model ####################################
###############################################################################
model2 <-
tflow %>%
plug_model(set_engine(linear_reg(), "lm")) %>%
fit()
lm_coef <- model2 %>% pull_tflow_fit() %>% .[["fit"]] %>% coef()
############################# Comparing model #################################
###############################################################################
comparison <-
data.frame(coefs = names(lm_coef),
`Linear coefficients` = unname(round(lm_coef, 2)),
`Ridge coefficients` = round(as.vector(ridge_coef), 2))
knitr::kable(comparison)
```
Coming from a social science background, it might seem counterintuitive that the researcher has to specify tuning parameters for the model. In traditional social science statistics, models usually estimate similar values internally and the user doesn't have to think about them. However, there are strategies already implemented to explore the combination of many possible values. With our previous example, we have to add `tune()` to the `penalty` argument and add a grid for the model to search for the best one:
```{r}
# Here we add the cross-validation and grid
tflow <-
tflow %>%
# Cross-validation
plug_resample(vfold_cv, v = 5) %>%
# Grid
plug_grid(grid_regular)
regularized_reg <- update(regularized_reg, penalty = tune())
res <-
tflow %>%
# Update the model to specify that `penalty` will be tuned
plug_model(regularized_reg) %>%
fit()
final_ridge <- complete_tflow(res, metric = "rmse")
final_ridge %>%
pull_tflow_fit() %>%
.[["fit"]] %>%
plot(xvar = "lambda", label = TRUE)
```
Here we can see how our coefficients are affected by increasing the weight of the `penalty` parameter. Each of those lines are the coefficients for the variables. The `x` axis contains the penalty values and we can see how as the penalty increases, the size of the coefficients is shrinking to be close to zero. By around the log of `penalty` around 8 nearly all coefficients are shrinked very close to zero. This plot is just an exercise to understand how the ridge regression works. In other words, we can figure out the best lambda automatically:
```{r}
best_tune <-
res %>%
pull_tflow_fit_tuning() %>%
select_best(metric = "rmse")
best_tune
```
However, there's no need to calculate this, as `complete_tflow` figures it out for you (as you can see in the code chunk above, `complete_tflow` extracts this automatically and fits the best model). We can calculate the $RMSE$ of the training data from the best model and compare it to the predictions on the testing data:
```{r}
train_rmse_ridge <-
final_ridge %>%
predict_training() %>%
rmse(math_score, .pred)
holdout_ridge <-
final_ridge %>%
predict_testing() %>%
rmse(math_score, .pred)
train_rmse_ridge$type <- "training"
holdout_ridge$type <- "testing"
ridge <- as.data.frame(rbind(train_rmse_ridge, holdout_ridge))
ridge$model <- "ridge"
ridge
```
The testing error (RMSE) is higher than the training error, as expected, as the training set nearly always **memorizes** the data better for the training.
## Lasso regularization
The Lasso regularization is very similar to the ridge regularization where only one thing changes: the penalty term. Instead of squaring the coefficients in the penalty term, the lasso regularization takes the absolute value of the coefficient.
\begin{equation}
RSS + \lambda \sum_{k = 1}^n |\beta_j|
\end{equation}
Although it might not be self-evident from this, the lasso reguralization has an important distinction: it can force a coefficient to be exactly zero. This means that lasso does a selection of variables which have big coefficients while not compromising the RSS of the model. The problem with ridge regression is that as the number of variables increases, the training error will almost always improve but the test error will not.
For example, if we define the same model from above using a lasso, you'll see that it forces coefficients to be **exactly zero** if they don't add anything relative to the RSS of the model. This means that variables which do not add anything to the model will be excluded unless they add explanatory power that compensates the size of their coefficient. Here's the same lasso example:
```{r}
regularized_reg <- update(regularized_reg, mixture = 1)
res <-
tflow %>%
plug_model(regularized_reg) %>%
fit()
final_lasso <- complete_tflow(res, metric = "rmse")
final_lasso %>%
pull_tflow_fit() %>%
.[["fit"]] %>%
plot(xvar = "lambda", label = TRUE)
```
In contrast to the ridge regression, where coefficients are forced to be close to zero, the lasso penalty actually forces some coefficients **to be zero**. This property means that the lasso makes a **selection of the variables with the higher coefficients** and eliminates those which do not have a strong relationship. Lasso is usually better at model interpretation because it removes redundant variables while ridge can be useful if you want to keep a number of variables in the model, despite them being weak predictors (as controls, for example).
To check the final model and it's error, we can recycle the code from above and adapt it to the lasso:
```{r}
train_rmse_lasso <-
final_lasso %>%
predict_training() %>%
rmse(math_score, .pred)
holdout_lasso <-
final_lasso %>%
predict_testing() %>%
rmse(math_score, .pred)
train_rmse_lasso$type <- "training"
holdout_lasso$type <- "testing"
lasso <- as.data.frame(rbind(train_rmse_lasso, holdout_lasso))
lasso$model <- "lasso"
lasso
```
So far, we can check which model is performing better:
```{r}
model_comparison <- rbind(ridge, lasso)
model_comparison
```
Currently the ridge regression has a very minor advantaged over the lasso yet the difference is probably within the margin of error. Depending on your aim, you might want to choose either of the models. For example, if our models contained a lot of variables, lasso might be more interpretable as it reduces the number of variables. However, if you have reasons to believe that keeping all variables in the model is important, then ridge provides an advantage.
## Elastic Net regularization
If you're aware of ridge and lasso, then elastic net regularization is a logical step. Elastic Net (the name sounds fancy, but it is also an adaptation of OLS) combines both penalties to form one single equation.
Here we define our ridge penalty:
$$ridge = \lambda \sum_{k = 1}^n \beta_j^2$$
And here we define our lasso penalty:
$$lasso = \lambda \sum_{k = 1}^n |\beta_j|$$
Elastic net regularization is the addition of these two penalties in comparison to the RSS:
$$RSS + lasso + ridge$$
I think the best explanation for elastic net regularization comes from @boehmke2019:
> Although lasso models perform feature selection, when two strongly correlated features are pushed towards zero, one may be pushed fully to zero while the other remains in the model. Furthermore, the process of one being in and one being out is not very systematic. In contrast, the ridge regression penalty is a little more effective in systematically handling correlated features together. Consequently, the advantage of the elastic net penalty is that it enables effective regularization via the ridge penalty with the feature selection characteristics of the lasso penalty.
Essentially, you now have two tuning parameters. In the grid of values, instead of specifying a `mixture` of `0` (ridge) or `1` (lasso), `tidyflow` will slide through several values of `mixture` ranging from 0 to 1 and compare that to several values of `lambda`. This is formally called a **grid search**.
We can recycle the same code from above:
```{r}
regularized_reg <- update(regularized_reg, mixture = tune())
res <-
tflow %>%
plug_model(regularized_reg) %>%
fit()
final_elnet <- complete_tflow(res, metric = "rmse")
train_rmse_elnet <-
final_elnet %>%
predict_training() %>%
rmse(math_score, .pred)
holdout_elnet <-
final_elnet %>%
predict_testing() %>%
rmse(math_score, .pred)
train_rmse_elnet$type <- "training"
holdout_elnet$type <- "testing"
elnet <- as.data.frame(rbind(train_rmse_elnet, holdout_elnet))
elnet$model <- "elnet"
elnet
```
The RMSE of the elastic net is somewhat lower than then ridge and lasso but also probably within the margin of error. Let's compare it visually:
```{r}
model_comparison <- rbind(model_comparison, elnet)
model_comparison %>%
ggplot(aes(model, .estimate, color = type, group = type)) +
geom_point(position = "dodge") +
geom_line() +
scale_y_continuous(name = "RMSE") +
scale_x_discrete(name = "Models") +
theme_minimal()
```
## Exercises
The [Fragile Families Challenge](https://www.fragilefamilieschallenge.org/) is a study that aimed to predict a series of indicators of children at age 15 only using data from ages 0 to 9. With this challenge, the principal investigators wanted to test whether skills such as cognitive and non-cognitive abilities were correctly predicted. With that idea in mind, they were interested in following up children that beat the 'predictions': those children that exceeded the model's prediction, for example given their initial conditions.
Using a similarly constructed non-cognitive proxy, I've created a non-cognitive index using the PISA 2018 for the United States which is the average of the questions:
- ST182Q03HA - I find satisfaction in working as hard as I can.
- ST182Q04HA - Once I start a task, I persist until it is finished.
- ST182Q05HA - Part of the enjoyment I get from doing things is when I improve on my past performance.
- ST182Q06HA - If I am not good at something, I would rather keep struggling to master it than move on to something I may [...]
The scale of the index goes from 1 to 4, where in 4 the student strongly agrees and 1 is they completely disagree. In other words, this index shows that the higher the value, the higher the non cognitive skills. You can check out the complete PISA codebook [here](https://docs.google.com/spreadsheets/d/12--3vD737rcu6olviKutRLEiyKNZ2bynXcJ4CpwtNsQ/edit?usp=sharing).
In these series of exercises you will have to try different models that predict this index of non-cognitive skills, perform a grid search for the three models and compare the predictions of the three models.
First, read in the data with:
```{r, eval = FALSE}
library(tidymodels)
library(tidyflow)
data_link <- "https://raw.githubusercontent.com/cimentadaj/ml_socsci/master/data/pisa_us_2018.csv"
pisa <- read.csv(data_link)
```
#### 1. Create a `tidyflow` with a split {-#ex1}
* Begin with the data `pisa`
* To plug a split, use `initial_split`
Remember to set the seed to `2341` so that everyone can compare their results.
<details>
<summary><strong>> Answer </strong></summary>
```{r, eval = FALSE}
tflow <-
pisa %>%
tidyflow(seed = 2341) %>%
plug_split(initial_split)
tflow
```
</details>
#### 2. Run a ridge regression with non-cognitive as the dependent variable {-#ex2}
* Plug in a formula (hint, look at `?plug_formula`) and use as many variables as you want (you can reuse the previous variables from the examples or pick all of them). A formula of the like `noncogn ~ .` will regress `noncogn` on all variables.
* Plug in the ridge regression with `penalty` set to `0.001` (hint: remember to set `mixture` to the value corresponding to the ridge regression)
* Fit the ridge model (with `fit`)
* Predict on the training data with `predict_training` and explore the $R^2$ (`rsq`) and $RMSE$ (`rmse`).
<details>
<summary><strong>> Answer </strong></summary>
```{r, eval = FALSE}
ridge_mod <- set_engine(linear_reg(penalty = 0.001, mixture = 0), "glmnet")
tflow <-
tflow %>%
plug_formula(noncogn ~ .) %>%
plug_model(ridge_mod)
m1 <- fit(tflow)
m1_rsq <- predict_training(m1) %>% rsq(noncogn, .pred)
m1_rmse <- predict_training(m1) %>% rmse(noncogn, .pred)
```
</details>
#### 3. Add a recipe to scale all of the predictors and rerun the previous model {-#ex3}
* Drop the formula from the `tidyflow` with `drop_formula`
* Add a recipe with the same formula you had, but including the `step_scale` for all predictors
* Rerun the model and extract the $R^2$ and $RMSE$
How does the $R^2$ and $RMSE$ change? Was there an impact in change?
<details>
<summary><strong>> Answer </strong></summary>
```{r, eval = FALSE}
rcp <-
~ recipe(noncogn ~ ., data = .) %>%
step_scale(all_predictors())
tflow <-
tflow %>%
drop_formula() %>%
plug_recipe(rcp)
m2 <- fit(tflow)
m2_rsq <- predict_training(m2) %>% rsq(noncogn, .pred)
m2_rmse <- predict_training(m2) %>% rmse(noncogn, .pred)
```
</details>
#### 4. Adapt the previous model to do a grid search of `penalty` values {-#ex4}
* Add a cross-validation resample (`vfold_cv`)
* Add a tuning grid (`grid_regular`) and specify `levels = 10`. This will create a tuning grid of 10 values
* Update the `penalty` parameter to be `tune`d
* Run the grid search (`fit`)
* Extract the tuning grid (`pull_tflow_fit_tuning`) and visualize `autoplot`
Is there a pattern with the improvement/decrease of the metrics of fit with respect to the `penalty`?
<details>
<summary><strong>> Answer </strong></summary>
```{r, eval = FALSE}
ridge_mod <- update(ridge_mod, penalty = tune())
tflow <-
tflow %>%
replace_model(ridge_mod) %>%
plug_resample(vfold_cv) %>%
plug_grid(grid_regular, levels = 10)
m3 <- fit(tflow)
m3 %>%
pull_tflow_fit_tuning() %>%
autoplot()
```
</details>
#### 5. Run a lasso regression with the same specification as above {-#ex5}
* Update the model to have a mixture of `1` (this is specifying that we want a lasso)
* Run the grid search (`fit`)
* Extract the tuning grid (`pull_tflow_fit_tuning`) and visualize `autoplot`
Which model is performing better? Ridge or Lasso? Can you comment on the pattern of the penalty between ridge and lasso?
<details>
<summary><strong>> Answer </strong></summary>
```{r, eval = FALSE}
lasso_mod <- update(ridge_mod, mixture = 1)
m4 <-
tflow %>%
replace_model(lasso_mod) %>%
fit()
m4 %>%
pull_tflow_fit_tuning() %>%
autoplot()
```
</details>
#### 6. Run an elastic net regression on non cognitive skills {-#ex6}
* Update the model to have a `tune`d mixture
* Replace the model in the `tidyflow` with the elastic net model
* Run the grid search (`fit`)
* Extract the tuning grid (`pull_tflow_fit_tuning`) and visualize `autoplot`
<details>
<summary><strong>> Answer </strong></summary>
```{r, eval = FALSE}
elnet_mod <- update(lasso_mod, mixture = tune())
m5 <-
tflow %>%
replace_model(elnet_mod) %>%
fit()
m5 %>%
pull_tflow_fit_tuning() %>%
autoplot()
# Additional plot with standard error
library(tidyr)
m5 %>%
pull_tflow_fit_tuning() %>%
collect_metrics() %>%
pivot_longer(penalty:mixture) %>%
mutate(low = mean - (std_err * 2),
high = mean + (std_err * 2)) %>%
ggplot(aes(value, mean)) +
geom_point() +
geom_errorbar(aes(ymin = low, ymax = high)) +
facet_grid(.metric ~ name)
```
</details>
#### 7. Compare the three models {-#ex7}
* Finalize the three models (the ridge model, the lasso model and the elastic net model) with `complete_tflow`. Remember to set the `metric`!
* Use the three finalized models (the ones that were produced by `complete_tflow`) to `predict_training` and `predict_testing` on each one
* Calculate the `rmse` of the three models on both training and testing
* Visualize the three models and their error for training/testing
* Comment on which models is better in out-of-sample fit
* Is it better to keep the most accurate model or a model that includes relevant confounders (even if they're relationship is somewhat weak)?
<details>
<summary><strong>> Answer </strong></summary>
```{r, eval = FALSE}
# Since we will be repeating the same process many times
# let's write a function to predict on the training/testing
# and combine them. This function will accept a single
# model and produce a data frame with the RMSE error for
# training and testing. This way, we can reuse the code
# without having to copy everything many times
calculate_err <- function(final_model, type_model = NULL) {
final_model <- complete_tflow(final_model, metric = "rmse")
err_train <-
final_model %>%
predict_training() %>%
rmse(noncogn, .pred)
err_test <-
final_model %>%
predict_testing() %>%
rmse(noncogn, .pred)
err_train$type <- "train"
err_test$type <- "test"
res <- as.data.frame(rbind(err_train, err_test))
res$model <- type_model
res
}
final_res <-
rbind(
calculate_err(m3, "ridge"),
calculate_err(m4, "lasso"),
calculate_err(m5, "elnet")
)
final_res %>%
ggplot(aes(model, .estimate, color = type)) +
geom_point() +
theme_minimal()
## BONUS
## Fit a linear regression and compare the four models
## What is the best model to pick considering both accuracy and simplicity?
```
</details>