prototype.Rmd

---
title: 'tidyflow: a prototype for statistical workflows'
author: Jorge Cimentada
date: "`r format(Sys.time(), '%d %B, %Y')`"
output: html_document
---

```{r knitr-setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,
                      message = FALSE,
                      warning = FALSE,
                      fig.path = "../figs/",
                      fig.align = "center",
                      fig.asp = 0.618,
                      out.width = "80%",
                      eval = FALSE)
```


## Introduction

The idea of the statistical workflow should be simple. It should be useful for holding only the minimum steps needed to perform a statistical analysis. The data, the formula and the model. In particular, the data should play the main role in the analysis: you should specify it once and let the workflow take care of all the details. For example:

```{r }
simple_tflow <-
  mtcars %>%
  tflow() %>%
  plug_recipe(~ recipe(mpg ~ ., data = .x)) %>% 
  plug_model(set_engine(linear_reg(), "lm"))

simple_tflow %>%
  fit()

```

The steps are declaractive. Start with the data and workflow, specify the formula and specify the model. `fit` takes cares of running the steps. In more convoluted examples, iteration is a necessary part of the workflow (create a new column, refit the data, try a new model, refit the data, etc...). Statistical iteration is somewhat at odds with some `tidyverse` packages where the order of operations is not important (`mutate` can be used at many different parts of an analysis, as well as `pivot_longer`, etc...). For statistical analysis, the order of operations is paramount: for example, you simply cannot specify for a column to be created after a model is run. For example:

```{r }
simple_tflow <-
  mtcars %>%
  tflow() %>%
  plug_recipe(~ recipe(mpg ~ ., data = .x)) %>% 
  plug_model(set_engine(linear_reg(), "lm"))

simple_tflow %>%
  fit() %>%
  # complement_recipe will pass the already existing recipe to
  # the main argument of the function
  complement_recipe(~ .x %>% step_mutate(cyl = log10(cyl)))

```

The operation over `cyl` is something that is specified after `fit`ting the model; it simply won't happen before the `fit`. This tension suggests that there are basic operations for statistical modelling that **need** to happen before fitting a model. In particular, there are some heuristics that force analysts to perform steps in a particular order. For example:

`data` -> `split training` -> `preprocessing` -> `model`

Preprocessing needs to happen before the modelling. You can iterate and alter columns after fitting the model but you'll need to run the model again. Having said that, if we have a somewhat established order of operations, as long as the workflow knows about the order, how the user declares the operations is not important:

```{r }
simple_tflow <-
  mtcars %>%
  tflow() %>%
  plug_recipe(~ recipe(mpg ~ ., data = .x)) %>% 
  plug_model(set_engine(linear_reg(), "lm")) %>%
  # New_step
  complement_recipe(~ .x %>% step_mutate(cyl = log10(cyl)))

simple_tflow %>%
  # Split data into training/testing
  plug_split(prop = 0.75) %>%
  fit()

```

Two new steps have been added: create a new `cyl` column and split the data into training and testing. For this case, their order is unimportant. `cyl` was recreated after setting the model and the training/testing split was done after data creation, model specification, etc... I don't think it's a good idea to define a workflow this way but this illustrates the point that in theory, the order of steps doesn't matter, as long as `fit` knows the correct order. The correct order in this case is:

`data` -> `data split`-> `formula` -> `data wrangling` -> `model definition` -> `fit`

`tflow` is a placeholder (or plan) of all these steps and reorders them for `fit` to execute them. The benefits of the placeholder strategy is that it allows both an ordered set of steps and interactivity. For example, we can fit a model with the initial workflow and add a new `step_*` for newer iterations:

```{r }

simple_tflow <-
  mtcars %>%
  tflow() %>%
  # Split data into training/testing
  plug_split(rsample::initial_split, prop = 0.75) %>%
  # `plug_recipe` recives your recipe and attaches the
  # data already provided in `.x`
  plug_recipe(~ recipe(mpg ~ ., data = .x)) %>% 
  plug_model(set_engine(linear_reg(), "lm"))

# Fit the workflow
simple_tflow %>%
  fit()

# Refit model with one step
simple_tflow %>%
  # Add more steps to the recipe. `.x` is the supplied
  # recipe above.
  complement_recipe(~ .x %>% step_mutate(cyl = log10(cyl)))
  fit()

```

This applies to any of the placeholders in `tflow`. For example, we can add a few more steps and also change the model:

```{r }
simple_tflow <-
  mtcars %>%
  tflow() %>%
  # Split data into training/testing
  plug_split(rsample::initial_split, prop = 0.75) %>%
  plug_recipe(~ recipe(mpg ~ ., data = .x)) %>% 
  plug_model(set_engine(linear_reg(), "lm"))

# Fit the workflow
simple_tflow %>%
  fit()

# Refit model with newer steps and new model
simple_tflow <-
  simple_tflow %>%
  complement_recipe(~ {
    .x %>%
      step_mutate(cyl = log10(cyl)) %>%
      step_dummy(gear)
  }) %>%
  # Replaces the model all together
  replace_model(set_engine(rand_forest(), "ranger"))

simple_tflow %>%
  fit()

```

`tflow` has a predefined set of placeholders defined as:

* `data`: contains the data and split
* `recipe`: contains the formula and steps
* `modelling`: contains the model

If you specify a recipe, you can either update it with `complement_recipe` or provide a completely new recipe with `replace_recipe`, if you update a new dataset, it replaces the old one, if you update a new model it replaces the old one. Now, let's think how this workflow works with more challenging examples.

# Abstracting the three placeholders

### Updating the data

The data should play the main role during the analysis. We risk a great deal by having many parts of our data laying around in our environment: the raw data, the training data, testing data and the cross-validated set. Having these many pieces around can induce mistakes such as adding the testing data to the cross-validation by mistake. Or specifying the training data for the model but mistakenly adding the testing data to the predict method. `tflow` aims to solve that by specifying the data only once. You can extract all of the pieces from the `tflow` such as the training/testing but that's opted-in: `tflow` will never touch your testing and always pass it correctly to where it needs to go.

A `tflow` should always define the data at first but it also gives the possibility of updating the data. This can be useful if you've defined a complete pipeline with a particular data source but an updated version of our data has just arrived:

```{r }
# Define main workflow with old data
simple_tflow <-
  mtcars %>%
  tflow() %>%
  plug_split(rsample::initial_split, prop = 0.75) %>%
  plug_recipe(~ {
    recipe(mpg ~ cyl + gear, data = .x) %>%
      step_mutate(cyl = log10(cyl)) %>%
      step_dummy(gear)
  }) %>% 
  plug_model(set_engine(linear_reg(), "lm"))

simple_tflow %>%
  # Replaces the old data and reruns everything
  replace_data(mtcars2) %>% 
  fit()

```

### Updating the preprocessing step

A formula is always tied to your preprocessing. If you specify a new variable in your formula, you need not specify a new preprocessing step. However, if you specify a new step for a variable, you need to specify it in the formula. Since `tflow` is a placeholder of steps, adding new terms to formulas is just as easy with `step_add`.

```{r }
# Define main workflow
simple_tflow <-
  mtcars %>%
  tflow() %>%
  plug_split(rsample::initial_split, prop = 0.75) %>%
  plug_recipe(~ {
    recipe(mpg ~ cyl + gear, data = .x) %>%
      step_mutate(cyl = log10(cyl)) %>%
      step_dummy(gear)
  }) %>% 
  plug_model(set_engine(linear_reg(), "lm"))

# Run the model
simple_tflow %>%
  fit()

# Update the recipe to include one more term
simple_tflow %>%
  complement_recipe(~ .x %>% step_add(am)) %>% 
  fit()

# Update the recipe to include one more term and add a step
simple_tflow %>%
  complement_recipe(~ .x %>% step_add(am) %>% step_dummy(disp)) %>% 
  fit()

# Update the recipe to include many terms and add a step
simple_tflow %>%
  complement_recipe(~ {
    .x %>%
      step_add(am, carb, hp, drat) %>%
      step_dummy(disp)
  }) %>% 
  fit()

```

If the user's data exploration process is short, they can scroll up to the initial recipe and redefine it there. For more complicated workflows, `step_add` can make it easy to iteratively update the initial specification and add subsequent steps to the workflow ^[`recipes` has an equivalent `step_rm` to remove columns from the recipe. However, the above has a problem: adding new variables will add it to the formula but `step_rm` removes the final column from the already prepped recipe. In other words, `step_rm` is trickier because you're not removing something from the formula but you're removing it from the final prepped recipe. If you have `step_*`'s that already use the term you want to remove, these steps won't be removed and will be computed before the actual column is removed].

Since this step is interactive, the user should be able to specify a workflow and extract the recipe prepped to explore the data:

```{r }
simple_tflow <-
  mtcars %>%
  tflow() %>%
  plug_split(rsample::initial_split, prop = 0.75) %>%
  plug_recipe(~ {
    recipe(mpg ~ cyl + gear, data = .x) %>%
      step_mutate(cyl = log10(cyl)) %>%
      step_dummy(gear)
  }) %>% 
  plug_model(set_engine(linear_reg(), "lm"))

# Visualize your changes and iterate until happy
simple_tflow %>%
  # trained will extract the prepped recipe on the data
  # if `plug_split` is specified, it will use the training data
  # for everything.
  trained() %>%
  ggplot(aes(cyl)) %>%
  geom_histogram()

```

Another scenario is one in which the user wants to remove and update the complete formula. `replace_recipe` will always accept a function with one argument, which will be passed the data.

```{r }

simple_tflow <-
  mtcars %>%
  tflow() %>%
  plug_split(rsample::initial_split, prop = 0.75) %>%
  plug_recipe(~ {
    recipe(mpg ~ cyl + gear, data = .x) %>%
      step_mutate(cyl = log10(cyl)) %>%
      step_dummy(gear)
  }) %>% 
  plug_model(set_engine(linear_reg(), "lm"))

# Update the complete recipe rather than only the
# add new steps
simple_tflow %>%
  replace_recipe(
    # in replace_recipe, `.x` will always be the data
    # in completement_recipe, `.x` will be the already defined recipe
    recipe(mpg ~ cyl + gear, data = .x) %>%
      step_center(cyl) %>%
      step_log(cyl, base = 10) %>%
      step_dummy(gear)
  )


```

All of the above also includes the option of adding a resamples specification. This means that you can add a cross-validation function that will be added to the plan. For example: 

```{r }
simple_tflow <-
  mtcars %>%
  tflow() %>%
  plug_split(rsample::initial_split, prop = 0.75) %>%
  # adds cross-validation
  plug_resamples(rsample::vfold_cv) %>% 
  plug_recipe(~ {
    recipe(mpg ~ cyl + gear, data = .x) %>%
      step_mutate(cyl = log10(cyl)) %>%
      step_dummy(gear)
  }) %>% 
  plug_model(set_engine(linear_reg(), "lm"))

# Run model and look at results
simple_tflow %>%
  fit()

# Didn't like it? Rerun updating with another function
simple_tflow %>%
  # add a montecarlo cross-validation
  replace_resample(rsample::mc_cv) %>%
  fit()

```

### Updating the modelling workflow

Setting models is somewhat easy, as was shown in the examples above. However, there are more complicated layers such as adding tuning parameters and grids of tuning parameters. Just as a formula is tied to the preprocessing step, tuning parameters are usually tied to a model. This means that we can think of tuning parameters/grids as part of a model. Why have them separate? For example:

```{r}

# Define main workflow
simple_tflow <-
  mtcars %>%
  tflow() %>%
  plug_split(rsample::initial_sample, prop = 0.75) %>%
  plug_recipe(~ {
    recipe(mpg ~ cyl + gear, data = .x) %>%
      step_mutate(cyl = log10(cyl)) %>%
      step_dummy(gear)
  }) %>% 
  plug_model(set_engine(linear_reg(), "lm"))

# Create the model with some tuning parameters
mod1 <- set_engine(linear_reg(penalty = 0.000001, mixture = 0.00002),
                   engine = "glmnet")

simple_tflow %>%
  replace_model(mod1) %>%
  fit()

# Redefine the model with empty parameters for the tuning grid
mod2 <- set_engine(linear_reg(penalty = tune(), mixture = tune()),
                   engine = "glmnet")

simple_tflow %>%
  replace_model(mod2) %>%
  # Add grid of values for empty parameters
  plug_grid_regular(levels = 10) %>% 
  fit()

mod3 <- set_engine(rand_forest(trees = 2000), "ranger")
simple_tflow %>%
  replace_model(mod1) %>%
  # Fill unfilled params with `tune()`, such as min_n and ntry
  plug_all_params(tune()) %>%
  plug_grid_regular(levels = 10) %>% 
  fit()

```

I haven't though the tuning/model integration really well but the above should be more flexible allowing you to create tuning grids with minimum/maximums along the grid space.

There are scenarios in which you'd like to also specify tuning parameters in your recipe as well as in your model. By specifying `tune()` in all empty parameters, as long as you specify the `grid_*` function, it should create a grid of values that are fed to `fit`.

```{r }

mod4 <- set_engine(rand_forest(trees = tune(), mtry = 10, min_n = tune()),
                   "ranger")

# Define main workflow
simple_tflow <-
  mtcars %>%
  tflow() %>%
  plug_split(rsample::initial_split, prop = 0.75) %>%
  plug_resample(rsample::bootstraps) %>%
  plug_recipe(~ {
    recipe(mpg ~ cyl + gear + disp, data = .x) %>%
      step_mutate(cyl = log10(cyl)) %>%
      step_dummy(gear) %>%
      # NOTE: see the tune() in the recipe?
      step_ns(disp, deg_free = tune())
  }) %>% 
  plug_model(mod4)

# This should create a grid of values for `deg_free`, `trees` and `min_n`
simple_tflow %>%
  grid_regular(levels = 3) %>%
  fit()

```

The workflow showed so far takes care of returning all default metrics associated with a particular variable. That is, if no metric of fit is specified (root mean square error, R square, etc...), it returns the default when tuning the grid (this is controlled automatically by `tune_grid`). However, this workflow allows to specify any metric for the model, as it will be saved together with the model.

```{r }

mod5 <- set_engine(rand_forest(trees = tune(), mtry = 10, min_n = tune()),
                   "ranger")

# Define main workflow
simple_tflow <-
  mtcars %>%
  tflow() %>%
  plug_split(rsample::initial_split, prop = 0.75) %>%
  plug_resample(rsample::vfold_cv) %>% 
  plug_recipe(~ {
    recipe(mpg ~ cyl + gear + disp, data = .x) %>%
      step_mutate(cyl = log10(cyl)) %>%
      step_dummy(gear) %>%
      step_ns(disp, deg_free = tune())
  }) %>% 
  plug_model(mod5) %>%
  # Adds the metrics
  plug_metric(metric_set(rmse, rsq))

simple_tflow %>%
  plug_grid_regular(levels = 3) %>%
  fit()

```

Thanks to the fact that `tidymodels` defines pretty much every step (recipe, metrics, cross validation, grids, etc...) of the workflow with separate classes, each of these classes has a slot in the workflow and you can explicitly add/update them with the `plug_*` or `replace_*` functions. It's still not clear how the `plug_*` or `replace_*` functions will work since I want to make them memorable: they all work the same way and accept similar things. I need to think about this more.

```{r }

mod6 <- set_engine(rand_forest(trees = tune(), mtry = 10, min_n = tune()),
                   "ranger")


# Define main workflow
simple_tflow <-
  mtcars %>%
  tflow() %>%
  plug_split(rsample::initial_split, prop = 0.75) %>%
  plug_resample(rsample::vfold_cv)
  plug_recipe(~ {
    recipe(mpg ~ cyl + gear + disp, data = .x) %>%
      step_mutate(cyl = log10(cyl)) %>%
      step_dummy(gear) %>%
      step_ns(disp, deg_free = tune())
  }) %>% 
  plug_model(mod6) %>%
  # Adds the metrics
  plug_metric(metric_set(rmse, rsq))

mod7 <- set_engine(linear_reg(penalty = tune(), mixture = tune()), "glmnet")

simple_tflow %>%
  # Replaces old model
  replace_model(mod7) %>%
  plug_grid_regular(levels = 3) %>%
  # Replace previous grid
  replace_grid_regular(levels = 10) %>% 
  fit()
```

Finally, the process behind `tflow` should be somewhat transparent and the user should be able to print the workflow before fitting to see the order of steps. In particular, there should be a method that when ran on the object return a data frame along the lines of:

```{r eval = TRUE, echo = FALSE}
library(tibble)
library(rlang)

tflow <- tribble(
  ~ "step", ~ "execution",
  "data" , mtcars,
  "split", quo(initial_split(mtcars)),
  "vfold", quo(vfold_cv(training(initial_split(mtcars)))),
  "recipe", quo(recipe(mpg ~ cyl + gear + disp, data = training(initial_split(mtcars))) %>% step_mutate(cyl = log10(cyl)) %>% step_ns(disp, deg_free = tune())),
  "model", quo(set_engine(rand_forest(trees = tune(), mtry = 10, min_n = tune()), "ranger")),
  "grid", quo(grid_regular(levels = 10))
)

tflow
```

where each step is accompanied by the captured expression to which it makes reference to. In addition, functions to extract specific objects such as the executed grid, the executed recipe, etc.. should be available to check whether the steps specified produced the results that were expected.

Now finally, what does running `fit` return? As for the printing, it should print something along the lines of:

```{r}
mtcars %>%
  workflow() %>%
  plug_split(rsample::initial_split) %>%
  plug_resample(rsample::vfold_cv) %>%
  plug_recipe(~ recipe(mpg ~ cyl, data = .x) %>% step_log(cyl)) %>%
  plug_model(set_engine(linear_reg(penalty = tune(), mixture = tune()), "glmnet"))
```

```
══ Workflow ════════════════════════════════════════════════════════════════════
Data: 32 rows x 11 columns; 0% missing values
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
Split: initial_split w/ default args
Resample: vfold_cv w/ default args
1 Recipe Step

● step_log()

── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Main Arguments:
  penalty = tune()
  mixture = tune()

Computational engine: glmnet 
```

When `fit` is ran the same print out is returned with the response of the fitting (coefficients, lambdas, etc.. This captures the exact printout of the engine being run behind the scenes). However, the user will be able to extract 4 things out of the workflow: (1) the data frame with the metrics of fit for the training and cross-validation, (2) the best model fit on the complete training data, (3) the prediction of the best model fit on the training data for the test set, (4) the coefficients (if available, this is passed to `broom`).

In particular, it will have methods to extract the `vfold` tibble and explore metrics, an automatic method for plotting the metrics across cross-validation sets, a method to extract the best tuning parameters and another one to train the best tuning parameters on the whole training data and return the model.

# Final thoughts

This description is mostly based on a workflow for doing machine learning. This is the case simply because the core objective of `tidymodels` is to focus on machine learning. However, this workflow should be useful for doing other types of statistical analysis. For example, the statistical analysis workflow is very similar to the one described above without separating into training/testing, adding cross-validation and performing a grid search. You could in principle do the same for statistical inferece. As long as the result is not focused on choosing the best model but on visualizing coefficients:

```{r }
# Define main workflow
simple_tflow <-
  mtcars %>%
  tflow() %>%
  plug_recipe(~ recipe(mpg ~ ., data = .x)) %>% 
  plug_model(set_engine(linear_reg(), "lm"))

# Run first model and explore
simple_tflow %>%
  fit()
```

Initiatives such as [Zelig](http://docs.zeligproject.org/index.html) could be adapted to `tflow` such that more statistical models can be used (as well as other standard statistical software such as `brms`, `rstanarm`, etc...). 

All of the above is a prototype idea and none of it has been implemented.

<!-- ## Thoughts -->

<!-- - If no `initial_split` is set, the everything is performed on the main data. This is ok, as statistical analysis doesn't usually split into testing/training. However, if the user tries to use the method to test the analysis on the test data, an error should be raised saying that no initial split was performed. This shouldn't be very compromising as everything is done on the training and no leak to the test is every done. The user just needs to remake the above including the initial split. -->

<!-- - If vfold is called and not initial_split it set, a warning should be raised saying: are you sure you want that? -->

<!-- ```{r } -->
<!-- library(AmesHousing) -->
<!-- library(tidymodels) -->

<!-- ames <- make_ames() -->

<!-- ml_tflow <- -->
<!--   ames %>% -->
<!--   workflow() %>% -->
<!--   initial_split(prop = .75) %>% -->
<!--   recipe(Sale_Price ~ Longitude + Latitude + Neighborhood, data = ames) %>% -->
<!--   step_log(Sale_Price, base = 10) %>% -->
<!--   step_other(Neighborhood, threshold = 0.05) %>% -->
<!--   step_dummy(recipes::all_nominal()) %>% -->
<!--   step_scale(Sale_Price) %>% -->
<!--   step_scale(Latitude) -->

<!-- mod1 <- -->
<!--   linear_reg(penalty = tune(), mixture = tune()) %>% -->
<!--   set_engine("glmnet") %>% -->
<!--   grid_regular(levels = 10) -->

<!-- mod2 <- -->
<!--   rand_forest(mtry = tune(), trees = tune()) %>% -->
<!--   set_engine("glmnet") %>% -->
<!--   grid_regular(levels = 10) -->

<!-- ## The combination of empty parameter specification with any grid_* -->
<!-- ## should throw an error. -->

<!-- ml_tflow %>% -->
<!--   plug_model(mod1) %>% -->
<!--   vfold_cv() %>% -->
<!--   fit() -->

<!-- # Add new recipes -->
<!-- final_res <- -->
<!--   ml_tflow %>% -->
<!--   replace_rcp( -->
<!--     .recipe %>%  -->
<!--       step_scale(Sale_Price) %>% -->
<!--       step_center(Sale_Price) -->
<!--   ) %>% -->
<!--   plug_model(mod2) %>% -->
<!--   vfold_cv() %>%  -->
<!--   fit() -->

<!-- ## Add new variable -->
<!-- ml_tflow %>% -->
<!--   replace_frm( -->
<!--     ~ . + x2 -->
<!--   ) %>%  -->
<!--   replace_rcp( -->
<!--     .rcp %>% step_dummy(x2) -->
<!--   ) %>% -->
<!--   plug_model(mod2) %>% -->
<!--   vfold_cv() %>%  -->
<!--   fit() -->

<!-- ## Replace y -->
<!-- ml_tflow %>% -->
<!--   replace_frm( -->
<!--     Latitude ~ . -->
<!--   ) %>%  -->
<!--   replace_rcp( -->
<!--     .rcp %>% step_log(Latitude) -->
<!--   ) %>% -->
<!--   plug_model(mod2) %>% -->
<!--   vfold_cv() %>%  -->
<!--   fit() -->

<!-- ## Update whole formula -->
<!-- ml_tflow %>% -->
<!--   replace_frm( -->
<!--     Latitude ~ Neighborhood + whatever -->
<!--   ) %>%  -->
<!--   replace_rcp( -->
<!--     .rcp %>% -->
<!--       step_log(Latitude) %>% -->
<!--       step_dummy(Neighborhood) %>% -->
<!--       step_BoxCox(Whatever), -->
<!--     # Remove recipe and accept new one -->
<!--     new = TRUE -->
<!--   ) %>% -->
<!--   plug_model(mod2) %>% -->
<!--   vfold_cv() %>%  -->
<!--   fit() -->


<!-- final_res %>% -->
<!--   replace_rcp( -->
<!--     -step_scale, -->
    
<!--     ) -->

<!-- # ml_tflow shouldn't run anything -- it's just a specification -->
<!-- # of all the different steps. `fit` should run everything -->
<!-- ml_tflow <- fit(ml_tflow) -->

<!-- # Plot results of tuning parameters -->
<!-- ml_tflow %>% -->
<!--   autoplot() -->

<!-- # Automatically extract best parameters and fit to the training data -->
<!-- final_model <- -->
<!--   ml_tflow %>% -->
<!--   fit_best_model(metrics = metric_set(rmse)) -->

<!-- # Predict on the test data using the last model -->
<!-- # Everything is bundled into a workflow object -->
<!-- # and everything can be extracted with separate -->
<!-- # functions with the same verb -->
<!-- final_model %>% -->
<!--   holdout_error() -->

<!-- ``` -->

<!-- ```{r } -->
<!-- library(AmesHousing) -->
<!-- # devtools::install_github("tidymodels/tidymodels") -->
<!-- library(tidymodels) -->

<!-- ames <- make_ames() -->

<!-- ############################# Data Partitioning ############################### -->
<!-- ############################################################################### -->

<!-- ames_split <- rsample::initial_split(ames, prop = .7) -->
<!-- ames_train <- rsample::training(ames_split) -->
<!-- ames_cv <- rsample::vfold_cv(ames_train) -->

<!-- ############################# Preprocessing ################################### -->
<!-- ############################################################################### -->

<!-- mod_rec <- -->
<!--   recipes::recipe(Central_Air ~ Longitude + Latitude + Neighborhood, -->
<!--                   data = ames_train) %>% -->
<!--   recipes::step_other(Neighborhood, threshold = 0.05) %>% -->
<!--   ## recipes::step_dummy(all_predictors(), recipes::all_nominal()) -->


<!-- ############################# Model Training/Tuning ########################### -->
<!-- ############################################################################### -->

<!-- ## Define a regularized regression and explicitly leave the tuning parameters -->
<!-- ## empty for later tuning. -->
<!-- lm_mod <- -->
<!--   parsnip::rand_forest(mtry = 2, trees = 2000, min_n = 5) %>% -->
<!--   set_mode("classification") %>%  -->
<!--   parsnip::set_engine("ranger") -->

<!-- ## Construct a workflow that combines your recipe and your model -->
<!-- ml_tflow <- -->
<!--   workflows::workflow() %>% -->
<!--   workflows::plug_recipe(mod_rec) %>% -->
<!--   workflows::plug_model(lm_mod) -->

<!-- # Find best tuned model -->
<!-- res <- -->
<!--   ml_tflow %>% -->
<!--   fit(data = ames_train) -->

<!-- ############################# Validation ###################################### -->
<!-- ############################################################################### -->
<!-- # Select best parameters -->
<!-- best_params <- -->
<!--   res %>% -->
<!--   tune::select_best(metric = "rmse", maximize = FALSE) -->

<!-- # Refit using the entire training data -->
<!-- reg_res <- -->
<!--   ml_tflow %>% -->
<!--   tune::finalize_workflow(best_params) %>% -->
<!--   parsnip::fit(data = ames_train) -->

<!-- # Predict on test data -->
<!-- ames_test <- rsample::testing(ames_split) -->
<!-- reg_res %>% -->
<!--   parsnip::predict(new_data = recipes::bake(mod_rec, ames_test)) %>% -->
<!--   bind_cols(ames_test, .) %>% -->
<!--   mutate(Sale_Price = log10(Sale_Price)) %>%  -->
<!--   select(Sale_Price, .pred) %>%  -->
<!--   rmse(Sale_Price, .pred) -->
<!-- ``` -->