-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathprototype.Rmd
711 lines (558 loc) · 26.8 KB
/
prototype.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
---
title: 'tidyflow: a prototype for statistical workflows'
author: Jorge Cimentada
date: "`r format(Sys.time(), '%d %B, %Y')`"
output: html_document
---
```{r knitr-setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,
message = FALSE,
warning = FALSE,
fig.path = "../figs/",
fig.align = "center",
fig.asp = 0.618,
out.width = "80%",
eval = FALSE)
```
## Introduction
The idea of the statistical workflow should be simple. It should be useful for holding only the minimum steps needed to perform a statistical analysis. The data, the formula and the model. In particular, the data should play the main role in the analysis: you should specify it once and let the workflow take care of all the details. For example:
```{r }
simple_tflow <-
mtcars %>%
tflow() %>%
plug_recipe(~ recipe(mpg ~ ., data = .x)) %>%
plug_model(set_engine(linear_reg(), "lm"))
simple_tflow %>%
fit()
```
The steps are declaractive. Start with the data and workflow, specify the formula and specify the model. `fit` takes cares of running the steps. In more convoluted examples, iteration is a necessary part of the workflow (create a new column, refit the data, try a new model, refit the data, etc...). Statistical iteration is somewhat at odds with some `tidyverse` packages where the order of operations is not important (`mutate` can be used at many different parts of an analysis, as well as `pivot_longer`, etc...). For statistical analysis, the order of operations is paramount: for example, you simply cannot specify for a column to be created after a model is run. For example:
```{r }
simple_tflow <-
mtcars %>%
tflow() %>%
plug_recipe(~ recipe(mpg ~ ., data = .x)) %>%
plug_model(set_engine(linear_reg(), "lm"))
simple_tflow %>%
fit() %>%
# complement_recipe will pass the already existing recipe to
# the main argument of the function
complement_recipe(~ .x %>% step_mutate(cyl = log10(cyl)))
```
The operation over `cyl` is something that is specified after `fit`ting the model; it simply won't happen before the `fit`. This tension suggests that there are basic operations for statistical modelling that **need** to happen before fitting a model. In particular, there are some heuristics that force analysts to perform steps in a particular order. For example:
`data` -> `split training` -> `preprocessing` -> `model`
Preprocessing needs to happen before the modelling. You can iterate and alter columns after fitting the model but you'll need to run the model again. Having said that, if we have a somewhat established order of operations, as long as the workflow knows about the order, how the user declares the operations is not important:
```{r }
simple_tflow <-
mtcars %>%
tflow() %>%
plug_recipe(~ recipe(mpg ~ ., data = .x)) %>%
plug_model(set_engine(linear_reg(), "lm")) %>%
# New_step
complement_recipe(~ .x %>% step_mutate(cyl = log10(cyl)))
simple_tflow %>%
# Split data into training/testing
plug_split(prop = 0.75) %>%
fit()
```
Two new steps have been added: create a new `cyl` column and split the data into training and testing. For this case, their order is unimportant. `cyl` was recreated after setting the model and the training/testing split was done after data creation, model specification, etc... I don't think it's a good idea to define a workflow this way but this illustrates the point that in theory, the order of steps doesn't matter, as long as `fit` knows the correct order. The correct order in this case is:
`data` -> `data split`-> `formula` -> `data wrangling` -> `model definition` -> `fit`
`tflow` is a placeholder (or plan) of all these steps and reorders them for `fit` to execute them. The benefits of the placeholder strategy is that it allows both an ordered set of steps and interactivity. For example, we can fit a model with the initial workflow and add a new `step_*` for newer iterations:
```{r }
simple_tflow <-
mtcars %>%
tflow() %>%
# Split data into training/testing
plug_split(rsample::initial_split, prop = 0.75) %>%
# `plug_recipe` recives your recipe and attaches the
# data already provided in `.x`
plug_recipe(~ recipe(mpg ~ ., data = .x)) %>%
plug_model(set_engine(linear_reg(), "lm"))
# Fit the workflow
simple_tflow %>%
fit()
# Refit model with one step
simple_tflow %>%
# Add more steps to the recipe. `.x` is the supplied
# recipe above.
complement_recipe(~ .x %>% step_mutate(cyl = log10(cyl)))
fit()
```
This applies to any of the placeholders in `tflow`. For example, we can add a few more steps and also change the model:
```{r }
simple_tflow <-
mtcars %>%
tflow() %>%
# Split data into training/testing
plug_split(rsample::initial_split, prop = 0.75) %>%
plug_recipe(~ recipe(mpg ~ ., data = .x)) %>%
plug_model(set_engine(linear_reg(), "lm"))
# Fit the workflow
simple_tflow %>%
fit()
# Refit model with newer steps and new model
simple_tflow <-
simple_tflow %>%
complement_recipe(~ {
.x %>%
step_mutate(cyl = log10(cyl)) %>%
step_dummy(gear)
}) %>%
# Replaces the model all together
replace_model(set_engine(rand_forest(), "ranger"))
simple_tflow %>%
fit()
```
`tflow` has a predefined set of placeholders defined as:
* `data`: contains the data and split
* `recipe`: contains the formula and steps
* `modelling`: contains the model
If you specify a recipe, you can either update it with `complement_recipe` or provide a completely new recipe with `replace_recipe`, if you update a new dataset, it replaces the old one, if you update a new model it replaces the old one. Now, let's think how this workflow works with more challenging examples.
# Abstracting the three placeholders
### Updating the data
The data should play the main role during the analysis. We risk a great deal by having many parts of our data laying around in our environment: the raw data, the training data, testing data and the cross-validated set. Having these many pieces around can induce mistakes such as adding the testing data to the cross-validation by mistake. Or specifying the training data for the model but mistakenly adding the testing data to the predict method. `tflow` aims to solve that by specifying the data only once. You can extract all of the pieces from the `tflow` such as the training/testing but that's opted-in: `tflow` will never touch your testing and always pass it correctly to where it needs to go.
A `tflow` should always define the data at first but it also gives the possibility of updating the data. This can be useful if you've defined a complete pipeline with a particular data source but an updated version of our data has just arrived:
```{r }
# Define main workflow with old data
simple_tflow <-
mtcars %>%
tflow() %>%
plug_split(rsample::initial_split, prop = 0.75) %>%
plug_recipe(~ {
recipe(mpg ~ cyl + gear, data = .x) %>%
step_mutate(cyl = log10(cyl)) %>%
step_dummy(gear)
}) %>%
plug_model(set_engine(linear_reg(), "lm"))
simple_tflow %>%
# Replaces the old data and reruns everything
replace_data(mtcars2) %>%
fit()
```
### Updating the preprocessing step
A formula is always tied to your preprocessing. If you specify a new variable in your formula, you need not specify a new preprocessing step. However, if you specify a new step for a variable, you need to specify it in the formula. Since `tflow` is a placeholder of steps, adding new terms to formulas is just as easy with `step_add`.
```{r }
# Define main workflow
simple_tflow <-
mtcars %>%
tflow() %>%
plug_split(rsample::initial_split, prop = 0.75) %>%
plug_recipe(~ {
recipe(mpg ~ cyl + gear, data = .x) %>%
step_mutate(cyl = log10(cyl)) %>%
step_dummy(gear)
}) %>%
plug_model(set_engine(linear_reg(), "lm"))
# Run the model
simple_tflow %>%
fit()
# Update the recipe to include one more term
simple_tflow %>%
complement_recipe(~ .x %>% step_add(am)) %>%
fit()
# Update the recipe to include one more term and add a step
simple_tflow %>%
complement_recipe(~ .x %>% step_add(am) %>% step_dummy(disp)) %>%
fit()
# Update the recipe to include many terms and add a step
simple_tflow %>%
complement_recipe(~ {
.x %>%
step_add(am, carb, hp, drat) %>%
step_dummy(disp)
}) %>%
fit()
```
If the user's data exploration process is short, they can scroll up to the initial recipe and redefine it there. For more complicated workflows, `step_add` can make it easy to iteratively update the initial specification and add subsequent steps to the workflow ^[`recipes` has an equivalent `step_rm` to remove columns from the recipe. However, the above has a problem: adding new variables will add it to the formula but `step_rm` removes the final column from the already prepped recipe. In other words, `step_rm` is trickier because you're not removing something from the formula but you're removing it from the final prepped recipe. If you have `step_*`'s that already use the term you want to remove, these steps won't be removed and will be computed before the actual column is removed].
Since this step is interactive, the user should be able to specify a workflow and extract the recipe prepped to explore the data:
```{r }
simple_tflow <-
mtcars %>%
tflow() %>%
plug_split(rsample::initial_split, prop = 0.75) %>%
plug_recipe(~ {
recipe(mpg ~ cyl + gear, data = .x) %>%
step_mutate(cyl = log10(cyl)) %>%
step_dummy(gear)
}) %>%
plug_model(set_engine(linear_reg(), "lm"))
# Visualize your changes and iterate until happy
simple_tflow %>%
# trained will extract the prepped recipe on the data
# if `plug_split` is specified, it will use the training data
# for everything.
trained() %>%
ggplot(aes(cyl)) %>%
geom_histogram()
```
Another scenario is one in which the user wants to remove and update the complete formula. `replace_recipe` will always accept a function with one argument, which will be passed the data.
```{r }
simple_tflow <-
mtcars %>%
tflow() %>%
plug_split(rsample::initial_split, prop = 0.75) %>%
plug_recipe(~ {
recipe(mpg ~ cyl + gear, data = .x) %>%
step_mutate(cyl = log10(cyl)) %>%
step_dummy(gear)
}) %>%
plug_model(set_engine(linear_reg(), "lm"))
# Update the complete recipe rather than only the
# add new steps
simple_tflow %>%
replace_recipe(
# in replace_recipe, `.x` will always be the data
# in completement_recipe, `.x` will be the already defined recipe
recipe(mpg ~ cyl + gear, data = .x) %>%
step_center(cyl) %>%
step_log(cyl, base = 10) %>%
step_dummy(gear)
)
```
All of the above also includes the option of adding a resamples specification. This means that you can add a cross-validation function that will be added to the plan. For example:
```{r }
simple_tflow <-
mtcars %>%
tflow() %>%
plug_split(rsample::initial_split, prop = 0.75) %>%
# adds cross-validation
plug_resamples(rsample::vfold_cv) %>%
plug_recipe(~ {
recipe(mpg ~ cyl + gear, data = .x) %>%
step_mutate(cyl = log10(cyl)) %>%
step_dummy(gear)
}) %>%
plug_model(set_engine(linear_reg(), "lm"))
# Run model and look at results
simple_tflow %>%
fit()
# Didn't like it? Rerun updating with another function
simple_tflow %>%
# add a montecarlo cross-validation
replace_resample(rsample::mc_cv) %>%
fit()
```
### Updating the modelling workflow
Setting models is somewhat easy, as was shown in the examples above. However, there are more complicated layers such as adding tuning parameters and grids of tuning parameters. Just as a formula is tied to the preprocessing step, tuning parameters are usually tied to a model. This means that we can think of tuning parameters/grids as part of a model. Why have them separate? For example:
```{r}
# Define main workflow
simple_tflow <-
mtcars %>%
tflow() %>%
plug_split(rsample::initial_sample, prop = 0.75) %>%
plug_recipe(~ {
recipe(mpg ~ cyl + gear, data = .x) %>%
step_mutate(cyl = log10(cyl)) %>%
step_dummy(gear)
}) %>%
plug_model(set_engine(linear_reg(), "lm"))
# Create the model with some tuning parameters
mod1 <- set_engine(linear_reg(penalty = 0.000001, mixture = 0.00002),
engine = "glmnet")
simple_tflow %>%
replace_model(mod1) %>%
fit()
# Redefine the model with empty parameters for the tuning grid
mod2 <- set_engine(linear_reg(penalty = tune(), mixture = tune()),
engine = "glmnet")
simple_tflow %>%
replace_model(mod2) %>%
# Add grid of values for empty parameters
plug_grid_regular(levels = 10) %>%
fit()
mod3 <- set_engine(rand_forest(trees = 2000), "ranger")
simple_tflow %>%
replace_model(mod1) %>%
# Fill unfilled params with `tune()`, such as min_n and ntry
plug_all_params(tune()) %>%
plug_grid_regular(levels = 10) %>%
fit()
```
I haven't though the tuning/model integration really well but the above should be more flexible allowing you to create tuning grids with minimum/maximums along the grid space.
There are scenarios in which you'd like to also specify tuning parameters in your recipe as well as in your model. By specifying `tune()` in all empty parameters, as long as you specify the `grid_*` function, it should create a grid of values that are fed to `fit`.
```{r }
mod4 <- set_engine(rand_forest(trees = tune(), mtry = 10, min_n = tune()),
"ranger")
# Define main workflow
simple_tflow <-
mtcars %>%
tflow() %>%
plug_split(rsample::initial_split, prop = 0.75) %>%
plug_resample(rsample::bootstraps) %>%
plug_recipe(~ {
recipe(mpg ~ cyl + gear + disp, data = .x) %>%
step_mutate(cyl = log10(cyl)) %>%
step_dummy(gear) %>%
# NOTE: see the tune() in the recipe?
step_ns(disp, deg_free = tune())
}) %>%
plug_model(mod4)
# This should create a grid of values for `deg_free`, `trees` and `min_n`
simple_tflow %>%
grid_regular(levels = 3) %>%
fit()
```
The workflow showed so far takes care of returning all default metrics associated with a particular variable. That is, if no metric of fit is specified (root mean square error, R square, etc...), it returns the default when tuning the grid (this is controlled automatically by `tune_grid`). However, this workflow allows to specify any metric for the model, as it will be saved together with the model.
```{r }
mod5 <- set_engine(rand_forest(trees = tune(), mtry = 10, min_n = tune()),
"ranger")
# Define main workflow
simple_tflow <-
mtcars %>%
tflow() %>%
plug_split(rsample::initial_split, prop = 0.75) %>%
plug_resample(rsample::vfold_cv) %>%
plug_recipe(~ {
recipe(mpg ~ cyl + gear + disp, data = .x) %>%
step_mutate(cyl = log10(cyl)) %>%
step_dummy(gear) %>%
step_ns(disp, deg_free = tune())
}) %>%
plug_model(mod5) %>%
# Adds the metrics
plug_metric(metric_set(rmse, rsq))
simple_tflow %>%
plug_grid_regular(levels = 3) %>%
fit()
```
Thanks to the fact that `tidymodels` defines pretty much every step (recipe, metrics, cross validation, grids, etc...) of the workflow with separate classes, each of these classes has a slot in the workflow and you can explicitly add/update them with the `plug_*` or `replace_*` functions. It's still not clear how the `plug_*` or `replace_*` functions will work since I want to make them memorable: they all work the same way and accept similar things. I need to think about this more.
```{r }
mod6 <- set_engine(rand_forest(trees = tune(), mtry = 10, min_n = tune()),
"ranger")
# Define main workflow
simple_tflow <-
mtcars %>%
tflow() %>%
plug_split(rsample::initial_split, prop = 0.75) %>%
plug_resample(rsample::vfold_cv)
plug_recipe(~ {
recipe(mpg ~ cyl + gear + disp, data = .x) %>%
step_mutate(cyl = log10(cyl)) %>%
step_dummy(gear) %>%
step_ns(disp, deg_free = tune())
}) %>%
plug_model(mod6) %>%
# Adds the metrics
plug_metric(metric_set(rmse, rsq))
mod7 <- set_engine(linear_reg(penalty = tune(), mixture = tune()), "glmnet")
simple_tflow %>%
# Replaces old model
replace_model(mod7) %>%
plug_grid_regular(levels = 3) %>%
# Replace previous grid
replace_grid_regular(levels = 10) %>%
fit()
```
Finally, the process behind `tflow` should be somewhat transparent and the user should be able to print the workflow before fitting to see the order of steps. In particular, there should be a method that when ran on the object return a data frame along the lines of:
```{r eval = TRUE, echo = FALSE}
library(tibble)
library(rlang)
tflow <- tribble(
~ "step", ~ "execution",
"data" , mtcars,
"split", quo(initial_split(mtcars)),
"vfold", quo(vfold_cv(training(initial_split(mtcars)))),
"recipe", quo(recipe(mpg ~ cyl + gear + disp, data = training(initial_split(mtcars))) %>% step_mutate(cyl = log10(cyl)) %>% step_ns(disp, deg_free = tune())),
"model", quo(set_engine(rand_forest(trees = tune(), mtry = 10, min_n = tune()), "ranger")),
"grid", quo(grid_regular(levels = 10))
)
tflow
```
where each step is accompanied by the captured expression to which it makes reference to. In addition, functions to extract specific objects such as the executed grid, the executed recipe, etc.. should be available to check whether the steps specified produced the results that were expected.
Now finally, what does running `fit` return? As for the printing, it should print something along the lines of:
```{r}
mtcars %>%
workflow() %>%
plug_split(rsample::initial_split) %>%
plug_resample(rsample::vfold_cv) %>%
plug_recipe(~ recipe(mpg ~ cyl, data = .x) %>% step_log(cyl)) %>%
plug_model(set_engine(linear_reg(penalty = tune(), mixture = tune()), "glmnet"))
```
```
══ Workflow ════════════════════════════════════════════════════════════════════
Data: 32 rows x 11 columns; 0% missing values
Preprocessor: Recipe
Model: linear_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
Split: initial_split w/ default args
Resample: vfold_cv w/ default args
1 Recipe Step
● step_log()
── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)
Main Arguments:
penalty = tune()
mixture = tune()
Computational engine: glmnet
```
When `fit` is ran the same print out is returned with the response of the fitting (coefficients, lambdas, etc.. This captures the exact printout of the engine being run behind the scenes). However, the user will be able to extract 4 things out of the workflow: (1) the data frame with the metrics of fit for the training and cross-validation, (2) the best model fit on the complete training data, (3) the prediction of the best model fit on the training data for the test set, (4) the coefficients (if available, this is passed to `broom`).
In particular, it will have methods to extract the `vfold` tibble and explore metrics, an automatic method for plotting the metrics across cross-validation sets, a method to extract the best tuning parameters and another one to train the best tuning parameters on the whole training data and return the model.
# Final thoughts
This description is mostly based on a workflow for doing machine learning. This is the case simply because the core objective of `tidymodels` is to focus on machine learning. However, this workflow should be useful for doing other types of statistical analysis. For example, the statistical analysis workflow is very similar to the one described above without separating into training/testing, adding cross-validation and performing a grid search. You could in principle do the same for statistical inferece. As long as the result is not focused on choosing the best model but on visualizing coefficients:
```{r }
# Define main workflow
simple_tflow <-
mtcars %>%
tflow() %>%
plug_recipe(~ recipe(mpg ~ ., data = .x)) %>%
plug_model(set_engine(linear_reg(), "lm"))
# Run first model and explore
simple_tflow %>%
fit()
```
Initiatives such as [Zelig](http://docs.zeligproject.org/index.html) could be adapted to `tflow` such that more statistical models can be used (as well as other standard statistical software such as `brms`, `rstanarm`, etc...).
All of the above is a prototype idea and none of it has been implemented.
<!-- ## Thoughts -->
<!-- - If no `initial_split` is set, the everything is performed on the main data. This is ok, as statistical analysis doesn't usually split into testing/training. However, if the user tries to use the method to test the analysis on the test data, an error should be raised saying that no initial split was performed. This shouldn't be very compromising as everything is done on the training and no leak to the test is every done. The user just needs to remake the above including the initial split. -->
<!-- - If vfold is called and not initial_split it set, a warning should be raised saying: are you sure you want that? -->
<!-- ```{r } -->
<!-- library(AmesHousing) -->
<!-- library(tidymodels) -->
<!-- ames <- make_ames() -->
<!-- ml_tflow <- -->
<!-- ames %>% -->
<!-- workflow() %>% -->
<!-- initial_split(prop = .75) %>% -->
<!-- recipe(Sale_Price ~ Longitude + Latitude + Neighborhood, data = ames) %>% -->
<!-- step_log(Sale_Price, base = 10) %>% -->
<!-- step_other(Neighborhood, threshold = 0.05) %>% -->
<!-- step_dummy(recipes::all_nominal()) %>% -->
<!-- step_scale(Sale_Price) %>% -->
<!-- step_scale(Latitude) -->
<!-- mod1 <- -->
<!-- linear_reg(penalty = tune(), mixture = tune()) %>% -->
<!-- set_engine("glmnet") %>% -->
<!-- grid_regular(levels = 10) -->
<!-- mod2 <- -->
<!-- rand_forest(mtry = tune(), trees = tune()) %>% -->
<!-- set_engine("glmnet") %>% -->
<!-- grid_regular(levels = 10) -->
<!-- ## The combination of empty parameter specification with any grid_* -->
<!-- ## should throw an error. -->
<!-- ml_tflow %>% -->
<!-- plug_model(mod1) %>% -->
<!-- vfold_cv() %>% -->
<!-- fit() -->
<!-- # Add new recipes -->
<!-- final_res <- -->
<!-- ml_tflow %>% -->
<!-- replace_rcp( -->
<!-- .recipe %>% -->
<!-- step_scale(Sale_Price) %>% -->
<!-- step_center(Sale_Price) -->
<!-- ) %>% -->
<!-- plug_model(mod2) %>% -->
<!-- vfold_cv() %>% -->
<!-- fit() -->
<!-- ## Add new variable -->
<!-- ml_tflow %>% -->
<!-- replace_frm( -->
<!-- ~ . + x2 -->
<!-- ) %>% -->
<!-- replace_rcp( -->
<!-- .rcp %>% step_dummy(x2) -->
<!-- ) %>% -->
<!-- plug_model(mod2) %>% -->
<!-- vfold_cv() %>% -->
<!-- fit() -->
<!-- ## Replace y -->
<!-- ml_tflow %>% -->
<!-- replace_frm( -->
<!-- Latitude ~ . -->
<!-- ) %>% -->
<!-- replace_rcp( -->
<!-- .rcp %>% step_log(Latitude) -->
<!-- ) %>% -->
<!-- plug_model(mod2) %>% -->
<!-- vfold_cv() %>% -->
<!-- fit() -->
<!-- ## Update whole formula -->
<!-- ml_tflow %>% -->
<!-- replace_frm( -->
<!-- Latitude ~ Neighborhood + whatever -->
<!-- ) %>% -->
<!-- replace_rcp( -->
<!-- .rcp %>% -->
<!-- step_log(Latitude) %>% -->
<!-- step_dummy(Neighborhood) %>% -->
<!-- step_BoxCox(Whatever), -->
<!-- # Remove recipe and accept new one -->
<!-- new = TRUE -->
<!-- ) %>% -->
<!-- plug_model(mod2) %>% -->
<!-- vfold_cv() %>% -->
<!-- fit() -->
<!-- final_res %>% -->
<!-- replace_rcp( -->
<!-- -step_scale, -->
<!-- ) -->
<!-- # ml_tflow shouldn't run anything -- it's just a specification -->
<!-- # of all the different steps. `fit` should run everything -->
<!-- ml_tflow <- fit(ml_tflow) -->
<!-- # Plot results of tuning parameters -->
<!-- ml_tflow %>% -->
<!-- autoplot() -->
<!-- # Automatically extract best parameters and fit to the training data -->
<!-- final_model <- -->
<!-- ml_tflow %>% -->
<!-- fit_best_model(metrics = metric_set(rmse)) -->
<!-- # Predict on the test data using the last model -->
<!-- # Everything is bundled into a workflow object -->
<!-- # and everything can be extracted with separate -->
<!-- # functions with the same verb -->
<!-- final_model %>% -->
<!-- holdout_error() -->
<!-- ``` -->
<!-- ```{r } -->
<!-- library(AmesHousing) -->
<!-- # devtools::install_github("tidymodels/tidymodels") -->
<!-- library(tidymodels) -->
<!-- ames <- make_ames() -->
<!-- ############################# Data Partitioning ############################### -->
<!-- ############################################################################### -->
<!-- ames_split <- rsample::initial_split(ames, prop = .7) -->
<!-- ames_train <- rsample::training(ames_split) -->
<!-- ames_cv <- rsample::vfold_cv(ames_train) -->
<!-- ############################# Preprocessing ################################### -->
<!-- ############################################################################### -->
<!-- mod_rec <- -->
<!-- recipes::recipe(Central_Air ~ Longitude + Latitude + Neighborhood, -->
<!-- data = ames_train) %>% -->
<!-- recipes::step_other(Neighborhood, threshold = 0.05) %>% -->
<!-- ## recipes::step_dummy(all_predictors(), recipes::all_nominal()) -->
<!-- ############################# Model Training/Tuning ########################### -->
<!-- ############################################################################### -->
<!-- ## Define a regularized regression and explicitly leave the tuning parameters -->
<!-- ## empty for later tuning. -->
<!-- lm_mod <- -->
<!-- parsnip::rand_forest(mtry = 2, trees = 2000, min_n = 5) %>% -->
<!-- set_mode("classification") %>% -->
<!-- parsnip::set_engine("ranger") -->
<!-- ## Construct a workflow that combines your recipe and your model -->
<!-- ml_tflow <- -->
<!-- workflows::workflow() %>% -->
<!-- workflows::plug_recipe(mod_rec) %>% -->
<!-- workflows::plug_model(lm_mod) -->
<!-- # Find best tuned model -->
<!-- res <- -->
<!-- ml_tflow %>% -->
<!-- fit(data = ames_train) -->
<!-- ############################# Validation ###################################### -->
<!-- ############################################################################### -->
<!-- # Select best parameters -->
<!-- best_params <- -->
<!-- res %>% -->
<!-- tune::select_best(metric = "rmse", maximize = FALSE) -->
<!-- # Refit using the entire training data -->
<!-- reg_res <- -->
<!-- ml_tflow %>% -->
<!-- tune::finalize_workflow(best_params) %>% -->
<!-- parsnip::fit(data = ames_train) -->
<!-- # Predict on test data -->
<!-- ames_test <- rsample::testing(ames_split) -->
<!-- reg_res %>% -->
<!-- parsnip::predict(new_data = recipes::bake(mod_rec, ames_test)) %>% -->
<!-- bind_cols(ames_test, .) %>% -->
<!-- mutate(Sale_Price = log10(Sale_Price)) %>% -->
<!-- select(Sale_Price, .pred) %>% -->
<!-- rmse(Sale_Price, .pred) -->
<!-- ``` -->