Let's verify that the technique called bagging improves decision trees when they become random forests.
In this exercise we are looking at the California Housing dataset. Here, each data point consists of 8 census attributes (e.g. average income, building size, location, etc.) together with the median house value.
We will examine at how random forests can be used for regularization and compare them to decision trees. Open the file src/ex1_random_forests.py
and go to the __main__
function.
-
Import the California Housing dataset (
sklearn.datasets.fetch_california_housing
). -
Split the dataset into train and tests sets. Use a test size of 10% and
random_state=21
.
We will now implement the train_dt_and_rf
function to train decision trees and random forests by following these steps:
- Create a decision tree regressor (
sklearn.tree.DecisionTreeRegressor
) with a maximum tree depth of 1 and fit it to your training data. - Make the predictions of your regressor on the test data and calculate the mean squared error between the predictions and the ground truth targets. For convenience you can use
sklearn.metrics.mean_squared_error
. Print the result. - Repeat steps 3 and 4 for a maximum depth of 2, 3, ..., 30. Save the resulting MSEs in an array.
- Repeat steps 3-5, but this time using a random forest regressor (
sklearn.ensemble.RandomForestRegressor
). Reduce the number of estimators the random forest uses to 10 to speed up the training. - Make sure your function returns the two lists of MSEs in a dictionary.
Take a look at your results and analyse them:
- Call your
train_dt_and_rf
function and get both the MSE curve of the decision trees from step 5 and the MSE curve of the random forests from step 6. - Plot both MSE curves together in one figure (x-axis: maximum depth, y-axis: MSE).
- Look at the curve of the decision trees and how the MSE changes as the maximum depth increases. What do you observe? Why do you think this is happening? How does the curve of the random forests differ from the previous one? Why is this the case?
For 2D data it is possible to directly visualize how a classifier divides the plane into classes. Luckily, scikit-learn provides such a DecisionBoundaryDisplay
for its estimators. Like in the previous exercise, we will create decision trees and random forests and compare their performance on some synthetic datasets.
This time, the datasets will be provided by parameters of some functions. In the file src/ex2_decision_boundaries.py
implement the train_and_visualize_decision_boundaries()
function by following these steps:
- Create a scatter plot of the dataset provided via the function parameters. Colorize the points according to their class membership (
c=targets
). - Fit a decision tree classifier on the whole dataset using the
sklearn.tree
module and plot the tree. Look at thesklearn.tree
module for help. - Create a
DecisionBoundaryDisplay
using thesklearn.inspection.DecisionBoundaryDisplay.from_estimator
function and usevmax=2/0.29
andcmap=plt.cm.tab10
. To show the data points in the same plot, you can call theax_.scatter()
method of the display you created and use it likeplt.scatter()
before you callplt.show()
. Inax_.scatter()
setvmax=2/0.29
andcmap=plt.cm.tab10
as well. This way, all the plots should use the same colors. - If you run the script with
python ./src/ex2_decision_boundaries.py
, you will see that your function will be called with five different datasets: vertical lines, diagonal lines, nested circles, half-moons and spirals. Do the decision trees created by the datasets have different complexities? If yes, why do you think is that the case? - Now train a random forest classifier from
sklearn.ensemble
on the whole data. - Repeat step 3 using the classifier from step 5. How do the decision boundaries of the random forest classifier differ from the ones described by the decision tree classifier?
- Make sure, your function returns the decision tree classifier and the random forest classifier you created - again using a dictionary.
- Play around with different values for
n_samples
andnoise
inmake_circles
andmake_moons
.
Now, we will implement our own random forest for classification that will be able to handle missing values. Navigate to the file src/ex3_my_forest.py
.
- Implement the
entropy()
function. - Now use your
entropy()
function to implement theinformation_gain()
function. - Look at the class
RandomForest
and use the functionbuild_tree()
to implement thefit()
function including bootstrapping and random feature selection. - Finally, implement the
predict()
function, that predicts on all of the resulting trees and returns a majority vote. - You can now compare your results to the
sklearn
implementation of Random forest algorithm. - If you now uncomment the commented part in the
main()
function, you can experiment with missing values.