-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path03_trees.Rmd
1599 lines (1174 loc) · 83.2 KB
/
03_trees.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Tree-based methods
```{r, echo = FALSE}
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
```
In this chapter we will touch upon the most popular tree-based methods used in machine learning. Haven't heard of the term "tree-based methods"? Do not panic. The idea behind tree-based methods is very simple and we'll explain how they work step by step through the basics. Most of the material on this chapter was built upon @boehmke2019 and @james2013.
Before we begin, let's load `tidyflow` and `tidymodels` and read the data that we'll be using.
```{r, message = FALSE}
library(tidymodels)
library(tidyflow)
library(rpart.plot)
library(vip)
library(baguette)
data_link <- "https://raw.githubusercontent.com/cimentadaj/ml_socsci/master/data/pisa_us_2018.csv"
pisa <- read.csv(data_link)
```
## Decision trees
Decision trees are simple models. In fact, they are even simpler than linear models. They require little statistical background and are in fact among the simplest models to communicate to a general audience. In particular, the visualizations used for decision trees are very powerful in conveying information and can even serve as an exploratory avenue for social research.
Throughout this chapter, we'll be using the PISA data set from the regularization chapter. On this example we'll be focusing on predicting the `math_score` of students in the United States, based on the socio economic status of the parents (named `HISEI` in the data; the higher the `HISEI` variable, the higher the socio economic status), the father's education (named `FISCED` in the data; coded as several categories from 0 to 6 where 6 is high education) and whether the child repeated a grade (named `REPEAT` in the data). `REPEAT` is a dummy variable where `1` means the child repeated a grade and `0` no repetition.
Decision trees, as their name conveys, are tree-like diagrams. They work by defining `yes-or-no` rules based on the data and assign the most common value for each respondent within their final branch. The best way to learn about decision trees is by looking at one. Let's do that:
```{r, echo = FALSE}
mod1 <- set_engine(decision_tree(mode = "regression"), "rpart")
tflow <-
pisa %>%
tidyflow(seed = 23151) %>%
plug_split(initial_split) %>%
plug_formula(math_score ~ MISCED + FISCED + HISEI + REPEAT + IMMIG + DURECEC + BSMJ) %>%
plug_model(mod1)
vanilla_fit <- fit(tflow)
tree <- pull_tflow_fit(vanilla_fit)$fit
cols <- c("black", "grey", "grey", "grey", "grey", "grey", "grey", "grey", "grey", "grey", "grey")
rpart.plot(tree, col = cols, branch.col = cols, split.col = cols)
```
In this example the top-most box which says `HISEI < 56` is the **root node**. This is the most important variable that predicts `math_score`. Inside the blue box you can see two numbers: $100\%$ which means that the entire sample is present in this **node** and the number `474`, the average test score for mathematics for the entire sample:
On both sides of the root node (`HISEI < 56`) there is a `yes` and a `no`. Decision trees work by **partitioning** variables into `yes-or-no` branches. The `yes` branch satisfies the name of **root** (`HISEI < 56`) and always branches out to the left:
```{r, echo = FALSE}
cols <- c("black", "black", "grey", "grey", "grey", "grey", "grey", "grey", "grey", "grey", "grey")
rpart.plot(tree, col = cols, branch.col = cols, split.col = cols)
```
In contrast, the `no` branch always branches out to the right:
```{r, echo = FALSE}
cols <- c("black", "grey", "grey", "grey", "black", "grey", "grey", "grey", "grey", "grey", "grey")
rpart.plot(tree, col = cols, branch.col = cols, split.col = cols)
```
The criteria for separating into `yes-or-no` branches is that respondents must be very similar within branches and very different between branches (later in this chapter I will explain in detail which criteria is used and how). The decision tree figures out that respondents that have an `HISEI` below $56$ and above $56$ are the most different with respect to the mathematics score. The left branch (where there is a `yes` in the **root node**) are those which have a `HISEI` below 56 and the right branch (where there is a `no`) are those which have a `HISEI` above $56$. Let's call these two groups the low and high SES respectively. If we look at the two boxes that come down from these branches, the low SES branch has an average math score of $446$ while the high SES branch has an average test score of $501$:
```{r, echo = FALSE}
cols <- c("black", "black", "grey", "grey", "black", "grey", "grey", "grey", "grey", "grey", "grey")
rpart.plot(tree, col = cols, branch.col = cols, split.col = cols)
```
For the sake of simplicity, let's focus now on the branch of the low SES group (the left branch). The second node coming out of the low SES branch contains 50\% of the sample and an average math score of $446$. This is the node with the rule `REPEAT >= 0.5`:
```{r, echo = FALSE}
cols <- c("black", "black", "grey", "grey", "grey", "grey", "grey", "grey", "grey", "grey", "grey")
rpart.plot(tree, col = cols, branch.col = cols, split.col = cols)
```
This 'intermediate' node is called **internal node**. For calculating this **internal node**, the decision tree algorithm limits the entire data set to only those which have low SES (literally, the decision tree does something like `pisa[pisa$HISEI < 56, ]`) and asks the same question that it did in the **root node**: of all the variables in the model which one separates two branches such that respondents are very similar within the branch but very different between the branches with respect to `math_score`?
For those with low SES background, this variable is whether the child repeated a grade or not. In particular, those coming from low SES background which repeated a grade, had an average math score of $387$ whereas those who didn't have an average math score of $456$:
```{r, echo = FALSE}
cols <- c("black", "black", "black", "black", "grey", "grey", "grey", "grey", "grey", "grey", "grey")
rpart.plot(tree, col = cols, branch.col = cols, split.col = cols)
```
These two nodes at the bottom are called **leaf nodes** because they are like the 'leafs of the tree'. **Leaf nodes** are of particular importance because they are the ones that dictate what the final value of `math_score` will be. Any new data that is predicted with this model will always give an average `math_score` of $456$ for those of low SES background who didn't repeat a grade:
```{r, echo = FALSE}
cols <- c("black", "black", "grey", "black", "grey", "grey", "grey", "grey", "grey", "grey", "grey")
rpart.plot(tree, col = cols, branch.col = cols, split.col = cols)
```
Similarly, any respondent from high SES background, with a highly educated father who didn't repeat a grade, will get assigned a `math_score` of $527$:
```{r, echo = FALSE}
cols <- c("black", "grey", "grey", "grey", "black", "grey", "grey", "grey", "black", "grey", "black")
rpart.plot(tree, col = cols, branch.col = cols, split.col = cols)
```
That is it. That is a decision tree in it's simplest form. It contains a **root node** and several **internal** and **leaf nodes** and it can be interpreted just as we just did. The right branch of the tree can be summarized with the same interpretation. For example, for high SES respondents, father's education (`FISCED`) is more important than `REPEAT` to separate between math scores:
```{r, echo = FALSE}
cols <- c("black", "grey", "grey", "grey", "black", "black", "grey", "grey", "black", "grey", "grey")
rpart.plot(tree, col = cols, branch.col = cols, split.col = cols)
```
This is the case because it comes first in the tree. Substantially, this might be due to the fact that there is higher variation in education credentials for parents of high SES background than for those of low SES background. We can see that those with the highest father's education (`FISCED` above $5.5$), the average math score is $524$ whereas those with father's education below $5.5$ have a math score of $478$.
Did you notice that we haven't interpreted any coefficients? That's right. Decision trees have no coefficients and many other machine learning algorithms also don't produce coefficients. Although for the case of decision trees this is because the model produces information in another way (through the visualization of trees), lack of coefficients is common in machine learning models because they are too complex to generate coefficients for single predictors. These models are non-linear, non-parametric in nature, producing very complex relationships that are difficult to summarize as coefficients. Instead, they produce predictions. We'll be delving into this topic in future sections in detail.
These examples show that decision trees are a great tool for exploratory analysis and I strongly believe they have an immense potential for exploring interactions in social science research. In case you didn't notice it, we literally just interpreted an interaction term that social scientists would routinely use in linear models. Without having to worry about statistical significance or plotting marginal effects, social scientists can use decision trees as an exploratory medium to understand interactions in an intuitive way.
You might be asking yourself, how do we fit these models and visualize them? `tidyflow` and `tidymodels` have got you covered. For example, for fitting the model from above, we can begin our `tidyflow`, add a split, a formula and define the decision tree:
```{r}
# Define the decision tree and tell it the the dependent
# variable is continuous ('mode' = 'regression')
mod1 <- set_engine(decision_tree(mode = "regression"), "rpart")
tflow <-
# Plug the data
pisa %>%
# Begin the tidyflow
tidyflow(seed = 23151) %>%
# Separate the data into training/testing
plug_split(initial_split) %>%
# Plug the formula
plug_formula(math_score ~ FISCED + HISEI + REPEAT) %>%
# Plug the model
plug_model(mod1)
vanilla_fit <- fit(tflow)
tree <- pull_tflow_fit(vanilla_fit)$fit
rpart.plot(tree)
```
All `plug_*` functions serve to build your machine learning workflow and the model `decision_tree` serves to define the decision tree and all of the arguments. `rpart.plot` on the other hand, is a function used specifically for plotting decision trees (that is why we loaded the package `rpart.plot` at the beginning). No need to delve much into this function. It just works if you pass it a decision tree model: that is why `pull` the model fit before calling it.
I've told all the good things about decision trees but they have important disadvantages. There are two that we'll discuss in this chapter. The first one is that decision trees tend to overfit a lot. Just for the sake of exemplifying this, let's switch to another example. Let's say we're trying to understand which variables are related to whether teachers set goals in the classroom. Substantially, this example might not make a lot of sense, but but let's follow along just to show how much trees can overfit the data. This variable is named `ST102Q01TA`. Let's plug it into our `tidyflow` and visualize the tree:
```{r}
# We can recycle the entire `tflow` from above and just
# replace the formula:
tflow <-
tflow %>%
replace_formula(ST102Q01TA ~ .)
fit_complex <- fit(tflow)
tree <- pull_tflow_fit(fit_complex)$fit
rpart.plot(tree)
```
The tree is quite big compared to our previous example and makes the interpretation more difficult. However, equally important, some **leaf nodes** are very small. Decision trees can capture a lot of noise and mimic the data very closely. $6$ **leaf nodes** have less than $3\%$ of the sample. These are **leaf nodes** with very weak statistical power:
```{r, echo = FALSE}
cols <- c(rep("grey", 10), "black", "grey", "grey", "grey", "grey", "grey", "grey", "black", "black", "grey", "grey", "black", "black", "grey", "grey", "grey", "black")
rpart.plot(tree, col = cols, branch.col = cols, split.col = cols)
```
What would happen if a tiny $1\%$ of those **leaf nodes** responded **slightly** different? It is possible we get a complete different tree. Decision trees are not well known for being robust. In fact, it is one of its main weaknesses. However, decision trees have an argument called `min_n` that force the tree to discard any **node** that has a number of observations below your specified minimum. Let's run the model above and set the minimum number of observation per **node** to be $200$:
```{r}
dectree <- update(mod1, min_n = 200)
tflow <-
tflow %>%
replace_model(dectree)
fit_complex <- fit(tflow)
tree <- pull_tflow_fit(fit_complex)$fit
rpart.plot(tree)
```
The tree was reduced considerably now. There are fewer **leaf nodes** and all nodes have a greater sample size than before.
You might be wondering: what should the minimum sample size be? There is no easy answer for this. The rule of thumb should be relative to your data and research question. In particular, the identification of small nodes should be analyzed with care. Perhaps there **is** a group of outliers that constitute a node and it's not a problem of statistical noise. By increasing the minimum sample size for each node you would be destroying that statistical finding.
For example, suppose we are studying welfare social expenditure as the dependent variable and then we had other independent variables, among which are country names. Scandinavian countries might group pretty well into a solitary node because they are super powers in welfare spending (these are Denmark, Norway, Sweden and Finland). If we increased the minimum sample size to $10$, we might group them with Germany and France, which are completely different in substantive terms. The best rule of thumb I can recommend is no other than to study your problem at hand with great care and make decisions accordingly. It might make sense to increase the sample or it might not depending on the research question, the sample size, whether you're exploring the data or whether you're interested in predicting on new data.
Despite `min_n` helping to make the tree more robust, there are still several nodes with low sample sizes. Another way to approach this problem is through the depth of the tree. As can be seen from the previous plot, decision trees can create **leaf nodes** which are very small. In other more complicated scenarios, your tree might get huge. Yes, huge:
```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("./img/large_tree.png")
```
More often that not, these huge trees are just overfitting the data. They are creating very small nodes that capture noise from the data and when you're predicting on new data, they perform terribly bad. As well as the `min_n` argument, decision trees have another argument called `tree_depth`. This argument forces the tree to stop growing if it passes the maximum depth of the tree as measured in nodes. Let's run our previous example with only a depth of three nodes:
```{r }
dectree <- update(mod1, min_n = 200, tree_depth = 3)
tflow <-
tflow %>%
replace_model(dectree)
fit_complex <- fit(tflow)
tree <- pull_tflow_fit(fit_complex)$fit
rpart.plot(tree)
```
The tree was reduced considerably now in combination with the minimum number of respondents within each node. In fact, there is only one node that has a sample size lower than $3\%$. The `min_n` and `tree_depth` can help you reduce the overfitting of your tree, but don't think these are easy fixes. Decision trees are simply to easy to overfit the data and as we'll see, there are more advanced tree methods that can help to fix this.
Note that we've been interpreting decision trees in a 'subjective' fashion. That is, we've been cutting the nodes of the trees from subjective criteria that makes sense to our research problem. This is how we social scientists would analyze the data. The tree should model our theoretical problem and make substantive sense. However, for machine learning, we have other criteria: how well it predicts. Let's check how our model predicts at this point:
```{r }
fit_complex %>%
predict_training() %>%
rmse(ST102Q01TA, .pred)
```
Our predictions for each set goals is off by around $.5$ in a scale of $1$ through $4$. This is not terribly bad. For example, it means that for every child that answered a $2$, on average, we have an error of around $.5$. This means that any prediction for a single number runs the risk of being wrongly predicting either the number from above or below (a child with a $2$ might get a wrong prediction of $3$ or a $1$ but hardly a $4$). To improve prediction, we can allow `tidyflow` to search for the best combination of `min_n` and `tree_depth` that maximizes prediction. Let's perform a grid search for these two tuning values. However, let's set the range of tuning values ourselves:
```{r}
tune_mod <- update(dectree, min_n = tune(), tree_depth = tune())
tflow <-
tflow %>%
plug_resample(vfold_cv, v = 5) %>%
plug_grid(
expand.grid,
tree_depth = c(1, 3, 9),
min_n = c(50, 100)
) %>%
replace_model(tune_mod)
fit_tuned <- fit(tflow)
fit_tuned %>%
pull_tflow_fit_tuning() %>%
show_best(metric = "rmse")
```
It seems that our predictions on the training data were slightly overfitting the data, as the best error from the cross-validation search is centered around `0.459` with a standard error of `0.01`. Let's explore whether the error changes between the minimum sample size and the tree depth:
```{r}
tree_depth_lvl <- paste0("Tree depth: ", c(1, 3, 9))
fit_tuned %>%
pull_tflow_fit_tuning() %>%
collect_metrics() %>%
mutate(ci_low = mean - (1.96 * std_err),
ci_high = mean + (1.96 * std_err),
tree_depth = factor(paste0("Tree depth: ", tree_depth), levels = tree_depth_lvl),
min_n = factor(min_n, levels = c("50", "100"))) %>%
filter(.metric == "rmse") %>%
ggplot(aes(min_n, mean)) +
geom_point() +
geom_errorbar(aes(ymin = ci_low, ymax = ci_high), width = .1) +
scale_x_discrete("Minimum sample size per node") +
scale_y_continuous("Average RMSE") +
facet_wrap(~ tree_depth, nrow = 1) +
theme_minimal()
```
On the `x` axis we have the minimum sample size per node (these are the values for `min_n`) and on the `y` axis we have the error of the model through cross-validation (the $RMSE$). The lower each points is on the plot, the better, since it means that the error is lower.
Let's begin with the first plot on the left. The points represent the error of the model with different sample sizes for the nodes with a fixed tree depth of $1$. For a tree depth of $1$, the error of the model is around `.65`. However, as the number of trees increases (the additional *plots* to the right), the error comes down to nearly `.47` when there is a `tree_depth` of 9. It seems that the simplest model with the lowest $RMSE$ has a `tree_depth` of 9 and a minimum sample size of 50. We calculated this ourselves for this example, but `complete_tflow` can calculate this for you:
```{r}
final_model <-
fit_tuned %>%
complete_tflow(metric = "rmse",
tree_depth,
method = "select_by_one_std_err")
train_err <-
final_model %>%
predict_training() %>%
rmse(ST102Q01TA, .pred)
test_err <-
final_model %>%
predict_testing() %>%
rmse(ST102Q01TA, .pred)
c("testing error" = test_err$.estimate, "training error" = train_err$.estimate)
```
Our testing error and our training error have a difference of only $0.01$, not bad. The cross-validation tuning seemed to have helped avoid a great deal of overfitting.
Before we go through the next section, I want to briefly mention an alternative to `tree_depth` and `min_n`. A technique called 'tree pruning' is also very common for modeling decision trees. It first grows a very large and complex tree and **then** starts pruning the leafs. This technique is also very useful but due to the lack of time, we won't cover this in the course. You can check out the material on this technique from the resources outlined in the first paragraph of this section.
### Advanced: how do trees choose where to split? {#advancedsplit}
Throughout most of the chapter we've seen that trees find optimal 'splits' that make the respondents very different between the splits and very similar within them. But how do decision trees make these splits? Let's work out a simple example using the `HISEI` variable from the first model in this section.
```{r, echo = FALSE}
pisa_ex <- pisa %>% select(math_score, HISEI, FISCED)
calculate_random_split <- function(dt, var1) {
var1 <- enquo(var1)
random_split <-
dt %>%
distinct({{var1}}) %>%
filter({{var1}} != min({{var1}}), {{var1}} != max({{var1}})) %>%
sample_n(1) %>%
pull({{var1}})
res1 <-
dt %>%
mutate(random_split = random_split,
dummy = if_else({{var1}} >= random_split, 1, 0)) %>%
group_by(random_split, dummy) %>%
summarise(avg = mean(math_score),
rss = sum((math_score - avg)^2),
.groups = "drop")
res1 %>%
mutate(x = c((unique(random_split) + min(dt[[quo_name(var1)]])) / 2, (unique(random_split) + max(dt[[quo_name(var1)]])) / 2),
y = 350,
y2 = 310,
label = paste0("Average math \n score of ", round(avg, 0)),
label2 = "Calculate RSS",
total_rss = paste0("Total RSS: ", round(sum(rss), 0)))
}
set.seed(23151)
line1 <- calculate_random_split(pisa_ex, HISEI)
set.seed(23153)
linen <- lapply(1:6, function(x) {
res <- calculate_random_split(pisa_ex, HISEI)
res$iter <- paste0("Random split: ", x)
res
})
res_linen <- do.call(rbind, linen)
res_linen$iter <- factor(res_linen$iter, levels = paste0("Random split: ", 1:6))
p1 <-
pisa_ex %>%
ggplot(aes(HISEI)) +
geom_histogram(bins = 100, alpha = 1/3) +
scale_x_continuous("Socio-economic status index (higher -> richer)") +
theme_minimal()
p2 <-
p1 +
geom_vline(xintercept = line1$random_split, color = "red") +
geom_text(data = line1, aes(x = x, y = y, label = label))
p3 <-
p2 +
geom_text(data = line1, aes(x = x, y = y2, label = label2))
p4 <-
p3 +
annotate("text", x = 52, y = 200, label = paste0("Total RSS: ", round(sum(line1$rss), 0)))
p1_many <-
pisa_ex %>%
ggplot(aes(HISEI)) +
geom_histogram(bins = 30, alpha = 1/4) +
scale_x_continuous("Socio-economic status index (higher -> richer)") +
theme_minimal()
p_many <-
p1_many +
geom_vline(data = res_linen, aes(xintercept = random_split), color = "red") +
geom_text(data = res_linen, x = 52, y = 200, aes(label = total_rss), size = 2.5) +
facet_wrap(~ iter, nrow = 2) +
theme(aspect.ratio = 0.7, panel.border = element_rect(color = "black", fill = NA))
```
`HISEI` is an index for the socio-economic status of families. It's continuous and has a distribution like this:
```{r, echo = FALSE}
p1
```
As we saw in the first tree of this section, `HISEI` is the **root node**. To decide on the **root node**, the decision tree algorithm chooses a random location in the distribution of `HISEI` and draws a split:
```{r, echo = FALSE}
p2
```
The two sides have an average `math_score` which serves as the baseline for how different these two groups are. At this point, the algorithm does something very simple: for each split, it calculates the **R**esidual **S**um of **S**quares (RSS). This is just the sum of the `math_score` of each respondent ($math_i$) minus the average `math_score` ($\hat{math}$) for that split squared. In other words, it applies the $RSS$ for each split:
\begin{equation}
RSS = \sum_{k = 1}^n(math_i - \hat{math})^2
\end{equation}
Each side of the split then has a corresponding $RSS$:
```{r, echo = FALSE}
p3
```
After that, it calculates the total $RSS$ of the split by adding the two $RSS$:
```{r, echo = FALSE}
p4
```
So far we should have a single random split with an associated $RSS$ for $HISEI$. The decision tree algorithm is called *recursive binary splitting* because it is recursive: it repeats itself again many times. It repeats the strategy of $Split$ -> $RSS_{split}$ -> $RSS_{total}$ many times such that we get a distribution of splits and $RSS$ for $HISEI$:
```{r, echo = FALSE}
p_many
```
This produces a distribution of random splits with an associated metric of fit ($RSS$) for $HISEI$. *Recursive binary splitting* applies this same logic to every single variable in the model such that you have a distribution of splits for every single variable:
```{r, echo = FALSE}
line1$iter <- 0
res <- rbind(line1, res_linen)
res$variable <- "HISEI"
res_hisei <-
res %>%
select(variable, random_split, total_rss) %>%
distinct()
```
```{r, echo = FALSE}
set.seed(23159)
linen_pred2 <- lapply(1:3, function(x) {
res <- calculate_random_split(pisa_ex, FISCED)
res$iter <- x
res
})
res_fisced <- do.call(rbind, linen_pred2)
res_fisced <-
res_fisced %>%
mutate(variable = "FISCED") %>%
select(variable, random_split, total_rss) %>%
distinct()
dummy_df <- data.frame(variable = '', random_split = "...", total_rss = '')
rbind(res_hisei[1:3, ], dummy_df, res_fisced, dummy_df)
```
With such a distribution, the algorithm can objectively ask: which random split best separates the data into two branches with the lowest $RSS$? And based on that answer, the first **node** is chosen. After this first node is chosen, two branches grow to both sides. The algorithm then applies exactly the same set of rules *recursively* for each branch until a maximum depth is reached.
Although this explanation will be in nearly all cases invisible to you, this intuition can help you understand better which criteria is used for choosing a split. For example, understanding how this splitting is done gives you insight into how outliers do not affect the selection of splits because the splitting criteria is random and navigates the entire distribution.
In addition, there might be cases where you might want to switch the $RSS$ for another loss function because it makes sense for your problem. For example, using decision trees with binary dependent variables merits another type of loss function: *Gini impurity*. We won't delve into this but it serves as an example that these are things which are not fixed. These are decision that depend on your research problem and it might make sense to experiment with them if needed.
## Bagging
The problem with decision trees is that even if you work really hard to avoid overfitting, they can be very susceptible to the exact composition of the data. For some extreme cases, you might even get completely different trees every time you run your model. Quite literally, running the same model might offer very different trees if some part of the sample changes. This small simulation predicts `math_score` on all variables in the `pisa` data set but in each iteration, makes a random sample from the total dataset:
```{r manydtrees, echo = FALSE, warning = FALSE, message = FALSE, out.width = "99%", fig.cap = "Visualization of many trees from the sample with varying compositions"}
mod1 <- decision_tree(mode = "regression") %>% set_engine("rpart")
tflow <-
pisa %>%
tidyflow(seed = 23151) %>%
plug_split(initial_split) %>%
plug_formula(math_score ~ . - scie_score - read_score) %>%
plug_model(mod1)
all_mods <-
lapply(c(.15, .55, .3, .75, .65, .5), function(size) {
set.seed(23511)
pisa_sim <- pisa %>% sample_frac(size, replace = TRUE)
res <- tflow %>% replace_data(pisa_sim) %>% fit()
pull_tflow_fit(res)$fit
})
par(mfrow=c(2,3))
for (i in all_mods) rpart.plot(i)
```
These drastic differences between each iteration is because decision trees have a lot of variance and very little bias. They learn the current data very well (little bias) but if you generalize them to new data, they can perform very badly (a lot of variance). This is where bagging, or **B**ootstrap **Agg**regation comes in.
Before we explain what bagging is all about, let's spend a minute explaining what bootstrapping is. Let's work out a manual example and limit our `pisa` dataset to only five rows, keep a few selected columns and add a unique id for each row:
```{r}
sel_cols <- c("math_score", "HISEI", "REPEAT", "IMMIG", "read_score")
pisa_small <- pisa[1:5, sel_cols]
pisa_small$id <- 1:5
pisa_small
```
This is the same `pisa` dataset but only with the first five rows, and id column for each respondent and some columns. Bootstraping is a statistical technique that randomly picks observations from the sample. This means that some observations might get picked while others might not. In fact, some observations might even get picked many times! We can do this manually in R:
```{r}
# Sample from the number of rows in `pisa_small`
# and allow certain numbers to be replaced.
set.seed(23551)
row_index <- sample(nrow(pisa_small), replace = TRUE)
pisa_small[row_index, ]
```
We randomly sampled from the data and got the respondent number four repeated twice. We can run this many times and get many **resamples** of our data:
```{r}
lapply(1:2, function(x) {
row_index <- sample(nrow(pisa_small), replace = TRUE)
pisa_small[row_index, ]
})
```
Since the choosing of rows is random, in some instances we might randomly obtain the same row repeated ten times, in others only one time and others even zero times! This is what bootstrapping is all about. If we run $10$ bootstraps, it just means we have $10$ datasets where in each one some rows are repeated many times and others are randomly removed.
Bootstrapping is mainly used to calculate statistics such as standard errors and standard deviations because it has very nice properties to estimate uncertainty in situations where its impossible to calculate it. However, it also has advantages for reducing the variance in models such as decision trees.
Let's get back to how bagging works. Bagging works by bootstraping your data $N$ times and fitting $N$ decision trees. Each of these decision trees has a lot of variance because we allow the tree to overfit the data. The trick with bagging is that we **average** over the predictions of all the $N$ decision trees, improving the high variability of each single decision tree.
In the same spirit as before, let's work out a manual example just so you can truly grasp that intuition. However, don't worry, there are functions inside `tidymodels` and `tidyflow` that will perform all of this for you.
Let's adapt the code from above to use the original `pisa` data, sample only 60\% of the data in each bootstrap and generate 20 copies of our data with random picks of rows in each iteration:
```{r}
pisa$id <- 1:nrow(pisa)
bootstrap_pisa <-
lapply(1:20, function(x) {
row_index <- sample(nrow(pisa) * .6, replace = TRUE)
pisa[row_index, ]
})
```
The result is named `bootstrap_pisa` and is list with 20 data frames. You can inspect the first two with `bootstrap_pisa[[1]]` and `bootstrap_pisa[[2]]`. Inside each of these, there should be a data frame with 60\% of the original number of rows of the `pisa` data where each row was randomly picked. Some of these might be repeated many times, others might just be there once and others might not even be there.
Let's now loop over these 20 datasets, fit a decision tree to each one and predict on the original `pisa` data. The result of this loop should be 20 data frames each with a prediction for every respondent:
```{r }
tflow <-
tidyflow() %>%
plug_formula(math_score ~ .) %>%
plug_model(decision_tree(mode = "regression") %>% set_engine("rpart"))
all_pred_models <-
lapply(bootstrap_pisa, function(x) {
small_model <-
tflow %>%
plug_data(x) %>%
fit()
cbind(
pisa["id"],
predict(small_model, new_data = pisa)
)
})
```
The first slot contains predictions for all respondents. Let's confirm that:
```{r}
head(all_pred_models[[1]])
```
Here we only show the first five rows, but you can check that it matches the same number of rows as the original `pisa` with `nrow(all_pred_model[[1]])` and `nrow(pisa)`. Let's confirm the same thing for the second slot:
```{r}
head(all_pred_models[[2]])
```
The second slot also contains predictions for all respondents but they are different from the first slot because they are based on a random sample. This same logic is repeated 20 times such that every respondent has 20 predictions. The trick behind bagging is that it **averages** the prediction of each respondent over the 20 bootstraps.
This averaging has two advantages. First, it allows each single tree to grow as much as possible, allowing it to have a lot of variance and little bias. This has a good property which is little bias but a negative aspect, which is a lot of variance. Bagging compensates this high level of variance by averaging the predictions of all the small trees:
```{r}
# Combine all the 20 predictions into one data frame
all_combined <- all_pred_models[[1]]
for (i in seq_along(all_pred_models)[-1]) {
all_combined <- cbind(all_combined, all_pred_models[[i]][-1])
}
# Average over the 20 predictions
res <- data.frame(id = all_combined[1], final_pred = rowMeans(all_combined[-1]))
# Final prediction for each respondent
head(res)
```
We get a final prediction for each respondent. If we wanted to, we could calculate the standard deviation of these 20 predictions for each respondent and generate uncertainty intervals around each respondent's predictions. More often than not, this is a good idea.
In the previous example we used 20 bootstraps for the sake of simplicity but generally speaking as the number of trees increases, the less variance we will have in the final prediction and thus a stronger model. We can see more clearly the power of combining many trees with the simulation below:
```{r, eval = FALSE, echo = FALSE}
library(furrr)
plan(multiprocess)
res <- future_map(1:100, function(x) {
mod1 <-
set_engine(
bag_tree(mode = "regression"),
"rpart",
times = x
)
tflow <-
pisa %>%
tidyflow(seed = 23151) %>%
plug_split(initial_split) %>%
plug_formula(math_score ~ .) %>%
plug_model(mod1) %>%
fit()
tflow %>%
predict_training() %>%
rmse(math_score, .pred) %>%
mutate(iter = x)
}, .progress = TRUE)
res_final <- bind_rows(res)
p1 <-
res_final %>%
ggplot(aes(iter, .estimate)) +
geom_line(group = 1) +
scale_y_continuous(name = "Average RMSE") +
scale_x_continuous(name = "Number of bagged trees") +
theme_minimal()
ggsave("./img/bagging_sim.png", p1)
```
```{r, echo = FALSE, out.width = "80%", fig.align = 'center'}
knitr::include_graphics("./img/bagging_sim.png")
```
The `x` axis shows the number of bootstraps (or fitted decision trees, it's the same) and the `y` axis shows the average $RMSE$ in `math_score` for each of these bagged trees. As we increase the number of decision trees (or bootstraps, it's the same), there is a substantial reduction in the error rate of `math_score`. This is an impressive improvement relative to our initial single decision tree.
Having seen the power of increasing the number of trees, how many trees should your model use? For models which exhibit reasonable levels of variability (like our `math_score` example), $100$-$200$ bootstraps is often enough to stabilize the error in the predictions. However, very unstable models might require up to $500$ bootstraps.
Let's fit the same model we just implemented manually above using `tidymodels` and `tidyflow`. Bagged trees can be implemented with the function `bag_tree` from the package `baguette`. With this package we can control the number of bootstraps with the argument `times`. We can define our model as usual using `tidyflow`:
```{r}
btree <- bag_tree(mode = "regression") %>% set_engine("rpart", times = 50)
tflow <-
pisa %>%
tidyflow(seed = 566521) %>%
plug_split(initial_split) %>%
plug_formula(math_score ~ .) %>%
plug_model(btree)
tflow
```
You might be asking yourself, why don't we define `bootstraps` inside `plug_resample`? After all,`bootstraps` **is** a resampling technique. We could do that but it doesn't make sense in this context. `plug_resample` is aimed more towards doing grid search of tuning values together with `plug_grid`. Since `bag_trees` is not performing any type of grid search but rather fitting a model many times and making predictions, it automatically incorporates this procedure inside `bag_trees`. If instead we were doing a grid search of let's say, `min_n` and `tree_depth` for `bag_tree`, using `plug_resample` with `boostraps` would be perfectly reasonable.
Let's fit both a simple decision tree and the bagged decision tree, predict on the training set and record the average $RMSE$ for both:
```{r }
res_btree <- tflow %>% fit()
res_dtree <- tflow %>% replace_model(decision_tree() %>% set_engine("rpart")) %>% fit()
rmse_dtree <-
res_dtree %>%
predict_training() %>%
rmse(math_score, .pred)
rmse_btree <-
res_btree %>%
predict_training() %>%
rmse(math_score, .pred)
c("Decision tree" = rmse_dtree$.estimate, "Bagged decision tree" = rmse_btree$.estimate)
```
The bagged decision tree improves the error rate from $33$ math test points to $11$. That is a $66\%$ reduction in the error rate! That is an impressive improvement for such a simple extension of decision trees. We might decide to improve upon our model by tweaking `min_n` and `tree_depth` but you'll be doing that in the exercises. Instead, let's predict on the testing data to check whether how much our final model is overfitting the data.
```{r }
rmse_dtree_test <-
res_dtree %>%
predict_testing() %>%
rmse(math_score, .pred)
rmse_btree_test <-
res_btree %>%
predict_testing() %>%
rmse(math_score, .pred)
c("Decision tree" = rmse_dtree_test$.estimate, "Bagged decision tree" = rmse_btree_test$.estimate)
```
It looks like our model is better than the decision tree but it was overfitting the training data considerably.
As all other models, bagging also has limitations. First, although bagged decision trees offer improved predictions over single decision trees, they do this at the expense of interpretability. Unfortunately, there is no equivalent of an 'average' tree that we can visualize. Remember, we have $100$ predictions from $100$ different trees. It is not possible nor advisable to visualize $100$ trees. Instead, we can look at the average variable importance. Bagging offers the 'contribution' of each variable using loss functions. For continuous variables, it uses the $RSS$ (which we have described and used throughout this chapter) and for binary variables it uses the Gini index. We can look at the importance of the variables to get a notion of which variables are contributing the most for predictions:
```{r}
res_btree %>%
pull_tflow_fit() %>%
.[['fit']] %>%
var_imp()
```
Secondly, bagging might seem like a deal breaker for **any** type of model (you can apply it to any type of model such as logistic regression, regularized regression, etc..) but it works well only for models which are very unstable. For example, linear regression and logistic regression are models with very little variance. With enough sample size, running a bagged linear regression should return very similar estimates as a single fitted model.
## Random Forests
In section \@ref(advancedsplit) we worked out a simple example on how decision trees choose where to split. If you remember correctly, decision trees evaluate several cutoff points for all the variables in the data. This is done recursively such that in each split, this iteration is performed again for all variables inside the split. This strategy is clever but fails spectacularly whenever some variables are very correlated with the outcome of the decision tree.
I did not present the code for the simulation in figure \@ref(fig:manydtrees) but I purposely excluded the variables `scie_score` and `read_score` from the data. That's why they're never in any of the trees. Why did I do that? Because they are extremely correlated to `math_score` and dominate the entire tree. Here's the same simulation **including** `scie_score` and `read_score`:
```{r, echo = FALSE}
mod1 <- decision_tree(mode = "regression") %>% set_engine("rpart")
tflow <-
pisa %>%
tidyflow(seed = 23151) %>%
plug_split(initial_split) %>%
plug_formula(math_score ~ .) %>%
plug_model(mod1)
all_mods <-
lapply(c(.15, .55, .3, .75, .65, .5), function(size) {
set.seed(23511)
pisa_sim <- pisa %>% sample_frac(size, replace = TRUE)
res <- tflow %>% replace_data(pisa_sim) %>% fit()
pull_tflow_fit(res)$fit
})
par(mfrow=c(2,3))
for (i in all_mods) rpart.plot(i)
```
Regardless of how much the data composition changes between every decision tree, `scie_score` and `read_score` are virtually the only two variables present. This is called **correlated trees**. They are correlated because the predictions between these different trees will be strongly correlated due to the same main variables dominating the trees.
When performing bagged decision trees this can be a problem given that the whole idea behind bagging is to average predictions from very different trees. If we have a variable that is constantly repeated in every single tree, then the predictions will be very similar. Random Forests are an extension of bagged decision trees because they randomly sample $N$ variables in each split. More specifically, instead of considering all variables in the data, for calculating a given split, random forests pick a random sample of $N$ variables to be considered for that split.
This intuition can be much more accessible through a manual example. Let's refer back to the first plot of this chapter:
```{r, echo = FALSE}
mod1 <- set_engine(decision_tree(mode = "regression"), "rpart")
tflow <-
pisa %>%
tidyflow(seed = 23151) %>%
plug_split(initial_split) %>%
plug_formula(math_score ~ MISCED + FISCED + HISEI + REPEAT + IMMIG + DURECEC + BSMJ) %>%
plug_model(mod1)
vanilla_fit <- fit(tflow)
tree <- pull_tflow_fit(vanilla_fit)$fit
cols <- c("black", "grey", "grey", "grey", "grey", "grey", "grey", "grey", "grey", "grey", "grey")
rpart.plot(tree, col = cols, branch.col = cols, split.col = cols)
```
For estimating the split of `HISEI < 56`, decision trees evaluate many random splits in the distribution of `HISEI` and record the total $RSS$ of each split such that you get a distribution of splits:
```{r, echo = FALSE}
p_many
```
This is repeated for **every** variable and then the best split out of all the variables is chosen as the split. Random Forests differ from decision trees in that instead of evaluating cutoff splits for every variable, they evaluate cutoff splits only for a **random sample of $N$ variables**. The number of variables $N$ used for each split can be chosen by the user and often it is chosen through a grid search. For this manual example, let's just assume we'll use half of all the variables in `pisa`.
For the **root node** split, we randomly select half of the variables in `pisa` (`pisa` has `r ncol(pisa)` variables), so we randomly choose `r floor(ncol(pisa) / 2)`) variables and allow the decision tree to work as usual. It will calculate many random cutoff points for the `r floor(ncol(pisa) / 2)` variables and choose the most optimal cutoff point among the `r floor(ncol(pisa) / 2)` variables. Once the **root node** is determined, two branches are grown to both sides:
```{r, echo = FALSE}
cols <- c("black", "black", "grey", "grey", "black", "grey", "grey", "grey", "grey", "grey", "grey")
rpart.plot(tree, col = cols, branch.col = cols, split.col = cols)
```
Random Forests repeat exactly the same logic for **every** split. For example, to determine the best split for the left branch, it randomly samples `r floor(ncol(pisa) / 2)` variables from the total of `r ncol(pisa)` but this time limiting the sample to those with `HISEI < 56`. Similarly for the right branch, it will randomly select `r floor(ncol(pisa) / 2)` variables for those with `HISEI > 56` and determine the best cutoff split from these limited number of variables. This is then repeated recursively for every split until the usual tree stopping criteria is reached (`min_n` per node or the `tree_depth`).
The number of columns to be used in each split is called `mtry` and there are standard rules of thumb for the number of variables to be randomly picked. In particular, the two most common are $\sqrt{Total\text{ }number\text{ }of\text{ }variables}$ and $\frac{Total\text{ }number\text{ }of\text{ }variables}{3}$. For example, using the total number of columns from the `pisa` dataset (`r ncol(pisa)`) and the first equation, it would be $\sqrt{`r ncol(pisa)`}$ which is equal to `r sqrt(ncol(pisa))`, rounded down always to the lowest number which is `r floor(sqrt(ncol(pisa)))`. This means that at every split, the random forest will be picking `r floor(sqrt(ncol(pisa)))` variables at random. As you might expect, if you set `mtry` to randomly pick the same number of columns in the data, you're effectively performing a bagged decision tree rather than a random forest. This would be the equivalent of setting `mtry` to `r ncol(pisa)` for the `pisa` example.
This approach of sampling random columns at each split was created to **uncorrelate** the distribution of decision trees. But, can we be sure that all variables will be used? Well, random sampling is by definition **random**. On average, all variables will be present in some splits and will allow other variables which are not that strongly correlated with the predictor to play a role in the tree construction.
Random Forests have gained a lot of fame because they often achieve incredible accuracy with just standard values for the tuning parameters (for example, the default `mtry` of $\sqrt{Total\text{ }number\text{ }of\text{ }variables}$ is often enough, without having to try multiple values in a grid search). Let's try running the same `math_score` example but with a random forest, predict on the training set and calculate the $RMSE$ of the model:
```{r}
# Define the random forest
rf_mod <-
rand_forest(mode = "regression") %>%
set_engine("ranger", importance = "impurity")
# Define the `tidyflow` with the random forest model
# and include all variables (including scie_score and read_score)
tflow <-
pisa %>%
tidyflow(seed = 23151) %>%
plug_formula(math_score ~ .) %>%
plug_split(initial_split) %>%
plug_model(rf_mod)
rf_fitted <- tflow %>% fit()
rf_fitted %>%
predict_training() %>%
rmse(math_score, .pred)
```
It performs slightly worse than the bagged decision trees (the bagged decision tree had an error of `r round(rmse_btree$.estimate, 2)` math test points). Why is that? Let's look at which variables are important in the random forest:
```{r }
rf_fitted %>%
pull_tflow_fit() %>%
.[['fit']] %>%
vip() +
theme_minimal()
```
`scie_score` and `read_score` seem to be the most relevant variables. But not only are they the most relevant ones, they are disproportionately the strongest: they both are **seven times** more important than the next most strongest variable.
When there are **only** a few very strong predictors (let's say, two), then you might have a lot of trees which don't offer a lot of good predictions. In the model we ran above, the total number of variables used at each split was `r floor(sqrt(ncol(pisa)))` meaning that if `scie_score` and `read_score` are the only important variables that predict `math_score`, they might be excluded from a lot of the splits. This means that out of the `500` default trees that are repeated, you might get trees which are not better at predicting `math_score` than random luck.
Based on this intuition, if we increase the number of variables used at each split, we should see an increase in predictive error. Why? Because it means the `scie_score` and `read_score` will have greater probability of being included at each split. Let's try something close to $1/3$ of the number of variables (this is `150` variables):
```{r}
rf_mod <-
rand_forest(mode = "regression", mtry = 150) %>%
set_engine("ranger")
rf_fitted <- tflow %>% replace_model(rf_mod) %>% fit()
rf_fitted %>%
predict_training() %>%
rmse(math_score, .pred)
```
The predictive error is reduced to be the same as the one from the bagged decision tree. However, time wise this model is superior than `bag_tree` because each decision tree uses less variables in total.
You might be asking yourself: if bagged decision trees have a lot of correlated trees and the random forest decorrelates the trees, why is it performing just as well and not better? It's not entirely clear. Random Forests are considered to be a bit of a 'black box' and they might work well in certain cases and bad in others. However, having the intuition of how random forests work can help us to approximate a likely explanation.
If `scie_score` and `read_score` are the most important variables for predicting `math_score` and no other variables have strong correlations with `math_score`, then **excluding** these two variables from a model, might produce trees which just don't predict `math_score` well. The value of a random forest over bagged decision trees is that each individual tree must have some sort of **predictive power** despite excluding the strongest predictors.
We can check whether our intuition is correct by running a bagged decision tree and a random forest without `scie_score` and `read_score`. If we exclude `scie_score` and `read_score`, the remaining variables are very weakly correlated with `math_score` and thus the random forest produces trees which have richer variability then the bagged trees in predictive performance:
```{r }
rf_mod <-
rand_forest(mode = "regression", mtry = 150) %>%
set_engine("ranger", importance = "impurity")
bt_mod <-
bag_tree(mode = "regression") %>%
set_engine("rpart")
tflow <- tflow %>% replace_formula(math_score ~ . - scie_score - read_score)
rf_res <- tflow %>% replace_model(rf_mod) %>% fit()
bt_res <- tflow %>% replace_model(bt_mod) %>% fit()
bt_rmse <-
bt_res %>%
predict_training() %>%
rmse(math_score, .pred) %>%
pull(.estimate)
rf_rmse <-
rf_res %>%
predict_training() %>%
rmse(math_score, .pred) %>%
pull(.estimate)
c("Bagged decision trees" = bt_rmse, "Random Forest" = rf_rmse)
```
The random forest is performing better than the bagged decision tree, as our reasoning predicted. Having shown this, since we exclude the two most important predictors, the overall error is **higher** than the previous models. This is expected and we have fitted the previous model just to understand why the random forest is not doing better than the bagged decision tree.
Let's predict on the testing data to check whether any of the two is overfitting more than the other:
```{r }
bt_rmse <-
bt_res %>%
predict_testing() %>%
rmse(math_score, .pred) %>%
pull(.estimate)
rf_rmse <-
rf_res %>%
predict_testing() %>%
rmse(math_score, .pred) %>%
pull(.estimate)
c("Bagged decision trees" = bt_rmse, "Random Forest" = rf_rmse)
```
Both models seem to have been overfitting the data considerably, but they were both overfitting to the same degree: the difference is of around 4 points in both the training and testing data.
Random Forests have several advantages, among which are that they are considerably faster than bagged decision trees because they use only a fraction of the variables used in bagging. However, as we have just showed, their performance gain is not straight forward when the outcome variable only has a few predictors which are strongly correlated with the outcome and very few which are not. There has to be some type of gradient in the correlations between the predictors and the outcome for the random forest to have a considerable improvement.
Our model also offers a good example of how tuning different values of the parameters can offer improved accuracy. If we would've stayed with the default `mtry` of $\sqrt{Total\text{ }number\text{ }of\text{ }variables}$, the trees would have consistently included uncorrelated variables in each decision tree, making many of our trees obsolete in predictive accuracy. Instead, whenever we're running random forests, we should try several values of `mtry` and empirically check which number of columns produce an improvement in the error rate. One important note which sometimes has aroused discussion in machine learning is whether increasing the number of trees increase the chance of overfitting. Increasing the number of trees in a random forest does not lead to increased overfitting because the random forest averages the predictions of all the different trees. Single trees can overfit the data, and that's ok, because we take the average of all these hundred trees.
Aside from the argument `mtry`, random forests also have other parameters which can be tuned for their optimal value in a grid search. `min_n` and `trees` are the most important ones to tune. Chapter 11 of @boehmke2019 offers good examples on some of the default values you can explore in a random forest and some general intuitions on which parameters have the biggest impact on computational resources, including the time spent running models. This next example can serve as an introductory template if you want to tune some values for your random forest. Beware, searching for the most optimal value of a 10 x 10 grid through 10 cross-validated sets can take a lot of time (even a few hours). You are effectively running 1000 models:
```{r, eval = FALSE}
rf_mod <-
rand_forest(mode = "regression",
mtry = tune(),
trees = tune(),
min_n = tune()) %>%
set_engine("ranger")
tflow <-
pisa %>%
tidyflow(seed = 2151) %>%
plug_split(initial_split) %>%
plug_resample(vfold_cv) %>%
plug_grid(grid_random, levels = 10) %>%
plug_formula(math_score ~ .) %>%
plug_model(rf_mode)
res <- rf_mod %>% fit()
res
```
The problem with random forest is mostly about computing time. Although they often do really well with the default values, a user often has to perform a grid search that takes up a lot of time. This can be very time consuming and thus it's always recommended to start with simple models and then elaborate from there.
## Boosting
So far, the tree based methods we've seen use decision trees as baseline models and an ensemble approach to calculate the average predictions of all decision trees. Boosting also uses decision trees as the baseline model but the ensemble strategy is fundamentally different.
The name boosting comes from the notion that we fit a weak decision tree and we 'boost' it iteratively. This strategy is fundamentally different from bagging and random forests because instead of relying on hundreds of **independent** decision trees, boosting works by recursively boosting the the result of the same decision tree. How does it do that? Let's work it out manually with our `math_score` example.
Let's fit a very weak decision tree of `math_score` regressed on `scie_score`.
```{r}
dt_tree <-
decision_tree(mode = "regression", tree_depth = 1, min_n = 10) %>%
set_engine("rpart")
pisa_tr <- training(initial_split(pisa))
tflow <-
pisa_tr %>%
tidyflow(seed = 51231) %>%
plug_formula(math_score ~ scie_score) %>%
plug_model(dt_tree)
mod1 <- fit(tflow)
mod1 %>% pull_tflow_fit() %>% .[['fit']] %>% rpart.plot()
```
This decision tree is very weak. It is only allowed to have a tree depth of 1 which probably means that it is underfitting the data. We can checkout the $RMSE$ of the model:
```{r }
res_mod1 <-
pisa_tr %>%
cbind(., predict(mod1, new_data = .))
res_mod1 %>%
rmse(math_score, .pred)
```
This is not a good model. The $RMSE$ is much bigger than our previous models. In fact, if we look at the residuals, we should see a very strong pattern, something that signals that a model is not properly specified:
```{r }
res_mod1 <-
res_mod1 %>%
mutate(.resid = math_score - .pred)
res_mod1 %>%
ggplot(aes(scie_score, .resid)) +
geom_point(alpha = 1/3) +
scale_x_continuous(name = "Science scores") +
scale_y_continuous(name = "Residuals") +
theme_minimal()
```
Boosting works by predicting the residuals of previous decision tree. In our example, we just fitted our first model and calculated the residuals. Boosting works by fitting a second model but the dependent variable should now be the residuals of the first model rather than the `math_score` variable. Let's do that:
```{r }
# Convert `math_score` to be the residuals of model 1
res_mod1 <- mutate(res_mod1, math_score = .resid)
# Replace the new data in our `tflow`
# In the data `res_mod1`, `math_score`
# is now the residuals of the first model
mod2 <-
tflow %>%
replace_data(res_mod1) %>%
fit()
mod2 %>% pull_tflow_fit() %>% .[['fit']] %>% rpart.plot()
```
This second model is exactly the same as the first model with the only difference that the dependent variable is the residuals (the unexplained test scores from the first model) of the first model. You can actually see a hint of this in the average score of the dependent variable in the two leaf nodes: they are both in the scale of residuals rather than the scale of `math_score` (the average `math_score` is between 300 and 600 points whereas the average of each leaf node is very far from this). Let's visualize the residuals from the **second** model:
```{r }
res_mod2 <-
pisa_tr %>%
cbind(., predict(mod2, new_data = .)) %>%
mutate(.resid = math_score - .pred)
res_mod2 %>%
ggplot(aes(scie_score, .resid)) +
geom_point(alpha = 1/3) +
scale_x_continuous(name = "Science scores") +
scale_y_continuous(name = "Residuals") +
theme_minimal()
```
The pattern seems to have changed although it's not clear that it's closer to a random pattern. In the small-scale example we just did, we only fitted two models (or in machine learning jargon, **two trees**). If we fitted 20 trees, each using the previous model's residuals as the dependent variable, we can start to see how the residuals are converging towards randomness:
```{r, echo = FALSE}
library(tidyr)
library(patchwork)
dt_tree <-
decision_tree(mode = "regression", tree_depth = 1, min_n = 10) %>%
set_engine("rpart")
pisa_tr <- training(initial_split(pisa))
tflow <-
pisa_tr %>%
tidyflow(51231) %>%
plug_formula(math_score ~ scie_score) %>%
plug_model(dt_tree)
n <- 20
res <- vector("list", n)
df_resid <- data.frame(resid_1 = rep(0, nrow(pisa_tr)))
res[[1]] <- fit(tflow)
for (i in 1:(n-1)) {
df_pred <- predict(res[[i]], new_data = pull_tflow_rawdata(res[[i]]))[[1]]
pred_data <-