{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.8.3"
    },
    "colab": {
      "provenance": [],
      "collapsed_sections": [],
      "include_colab_link": true
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/danielbauer1979/CAS_PredMod/blob/main/pa_pynb_sess7_BaggingAndBoosting.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Bagging and Boosting\n",
        "\n",
        "Dani Bauer, 2022\n",
        "\n",
        "In this tutorial, we discuss approaches to improve on the predictive porformance of CARTs via *aggregation*. We first consider a basic bootstrap aggregation approach in the context of a simple example with a single predictor, illustrating key aspects and pitfalls. We then use random forests and boosted trees in our case study examples, analyzing whether they can improve on the learners considered so far.\n",
        "\n",
        "As usually, let's start with loading the relevant libaries."
      ],
      "metadata": {
        "id": "_Dgf3D6YiA0C"
      }
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "7vCC-SQb7VRR"
      },
      "source": [
        "import numpy as np\n",
        "import matplotlib.pyplot as plt\n",
        "import pandas as pd\n",
        "import statsmodels.formula.api as smf\n",
        "import statsmodels.api as sm\n",
        "import seaborn as sns \n",
        "\n",
        "from sklearn.model_selection import train_test_split, cross_val_score\n",
        "from io import StringIO\n",
        "from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier, export_graphviz\n",
        "from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, BaggingRegressor, RandomForestRegressor, GradientBoostingRegressor\n",
        "from sklearn.metrics import mean_squared_error,confusion_matrix, classification_report, roc_curve, auc"
      ],
      "execution_count": 1,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "And this function, which creates images of tree models using pydot, as the packagesklearn doesn't offer graphs of the trees"
      ],
      "metadata": {
        "id": "Mc9fHrS03ydi"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import pydot\n",
        "from IPython.display import Image\n",
        "def print_tree(estimator, features, class_names=None, filled=True):\n",
        "  tree = estimator\n",
        "  names = features\n",
        "  color = filled\n",
        "  classn = class_names\n",
        "  dot_data = StringIO()\n",
        "  export_graphviz(estimator, out_file=dot_data, feature_names=features, class_names=classn, filled=filled)\n",
        "  graph = pydot.graph_from_dot_data(dot_data.getvalue())\n",
        "  return(graph)"
      ],
      "metadata": {
        "id": "M1PCEHYw3y7w"
      },
      "execution_count": 2,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "And:"
      ],
      "metadata": {
        "id": "CIxS7A-w4GGl"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "def sigmoid(x):\n",
        "    return(1 / (1 + np.exp(-x)))"
      ],
      "metadata": {
        "id": "QGnILMfU4HaI"
      },
      "execution_count": 3,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Improvement via Aggregation\n",
        "\n",
        "## Review of Concepts and Maths\n",
        "\n",
        "As we have discussed throughout this class, there is a key tradeoff between *bias* and *variance*:  A more complex learner may perform marvelously *in-sample*, but the *out-of-sample* performance is poor due to *overfitting*.  *Ensemble* learning techniques use the input from basic learners trained on different data sets (or also the input from different learners trained on the same data set -- this is referred to as *stacking*), to improve on the predictive performance of the basic learner.\n",
        "\n",
        "In *Bagging* -- short for *bootstrap-aggregating* -- the idea is to sample from the original dataset to obtain $M$ bootstrap samples, which are in turn used for training the predictive model yielding predictions $\\hat{Y}_j,$ $j=1,\\ldots,M.$  The final prediction $\\hat{Y}$ then is based on the average of the different predictions: $\\hat{Y} = \\frac{1}{M} \\sum_{j=1}^M \\hat{Y}_j$.  \n",
        "\n",
        "For instance, if each of the $M$ predictions is *unbiased*, $\\mathbb{E}_x[\\hat{Y}_i] = Y,$ then of course the aggregated prediction will be unbiased as well: $\\mathbb{E}_x[\\hat{Y}] = Y.$ However, we generally have for the standard deviation:\n",
        "$$\n",
        "\\text{StDev}_x(\\hat{Y}) = \\frac{1}{N} \\text{StDev}_x\\left(\\sum_{j=1}^M \\hat{Y}_j\\right) \\leq \\frac{1}{N} \\sum_{j=1}^M \\text{StDev}_x(Y_j),\n",
        "$$\n",
        "where the inequality is sharp if the predictions are not perfectly positively correlated.  Hence, aggregating can reduce the variance! *Random Forests* rely on bagging, but additionally sample from the set of features to control correlation between the ensemble predictions.\n",
        "\n",
        "In *Boosting*, different learners are fit stage-wise focusing on the residual.  Hence, the learners at later stages attempt to improve the predictions by focusing on the portion that is not fit well by learners in earlier stages.  Hence, aggregating can reduce the bias!\n",
        "\n",
        "## A Simulated Example with One Predictor\n",
        "\n",
        "Let us revisit the simple example from the previous tutorial on trees.  As a reminder, there we used the so-called *sigmoid* function that can depict highly linear as well as highly non-linear relationships by different choices of its parameter.  We used different parameters to generate two data sets, and compared the predictive preformance of trees vs. that of OLS regression.  Our conclusion was that trees work well in the non-linear situation whereas (linear) regression works well in the linear situation:\n"
      ],
      "metadata": {
        "id": "oFco9P3jzMD4"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "1. Generate the datasets:"
      ],
      "metadata": {
        "id": "7jE6cihh4bpC"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "np.random.seed(42)\n",
        "x = 3 * np.random.normal(0, 1, 150)\n",
        "eps = 0.25 * np.random.normal(0, 1, 150)\n",
        "y_1 = sigmoid(0.5 * x) + eps\n",
        "y_2 = sigmoid(5 * x) + eps\n",
        "mydata1 = pd.DataFrame({'y_1':y_1,'x':x})\n",
        "mytraindata1 = mydata1[0:100]\n",
        "mytestdata1 = mydata1[100:150]\n",
        "mydata2 = pd.DataFrame({'y_2':y_2,'x':x})\n",
        "mytraindata2 = mydata2[0:100]\n",
        "mytestdata2 = mydata2[100:150]\n"
      ],
      "metadata": {
        "id": "8ijLyBLP4Sgk"
      },
      "execution_count": 5,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "2. Fit an OLS regression to the first dataset:"
      ],
      "metadata": {
        "id": "o9WH3oR_4Q1Q"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "lmfit1 = smf.ols(formula=\"y_1 ~ x\", data=mytraindata1).fit()\n",
        "yhat_OOS1 = lmfit1.predict(mytestdata1)\n",
        "OLS_OOS_MSE1 = sum((mytestdata1['y_1'] - yhat_OOS1)**2)/50\n",
        "OLS_OOS_MSE1"
      ],
      "metadata": {
        "id": "1xqsIIJL4hXb"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "3. Fit a tree:"
      ],
      "metadata": {
        "id": "M3ZE0KNN4Rq9"
      }
    },
    {
      "cell_type": "code",
      "source": [
        " tree1 = DecisionTreeRegressor(max_leaf_nodes=2)\n",
        "X = mytraindata1['x'].values.reshape(-1, 1)\n",
        "y = mytraindata1['y_1'].values\n",
        "tree1.fit(X, y)\n",
        "ytreehat1 = tree1.predict(mytestdata1['x'].values.reshape(-1, 1))\n",
        "TREE_OOS_MSE1 = sum((mytestdata1['y_1'] - ytreehat1)**2)/50\n",
        "TREE_OOS_MSE1"
      ],
      "metadata": {
        "id": "2UstDPtB5DEA"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Now, rather than generating one tree, let's contemplate an alternative.  Let us draw new data sets from sampling from the original data set (*bootstrapping*), let's fit an (unpruned) tree to each of the sampled data sets, and let's predict by averaging over the predictions of these new trees."
      ],
      "metadata": {
        "id": "cBNwtI1-5E76"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "ybaggedtreehat1 = np.zeros(mytestdata1['y_1'].shape)\n",
        "atree = DecisionTreeRegressor()\n",
        "for i in range(0, 100):\n",
        "  subset = np.random.choice(len(mytraindata1), 25, replace=True)\n",
        "  X = mytraindata1['x'][subset].values.reshape(-1, 1)\n",
        "  y = mytraindata1['y_1'][subset].values\n",
        "  atree.fit(X, y)\n",
        "  ybaggedtreehat1 = ybaggedtreehat1 + atree.predict(mytestdata1['x'].values.reshape(-1, 1))\n",
        "ybaggedtreehat1 = ybaggedtreehat1/100\n",
        "BAGGED_MSE1 = sum((mytestdata1['y_1'] - ybaggedtreehat1)**2)/50\n",
        "BAGGED_MSE1"
      ],
      "metadata": {
        "id": "t_cfdEX85FOJ"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "We notice that by aggregating, the tree-based predictions perform notably better than the single tree.  What is going on?  There are two aspects worth noting:\n",
        "\n",
        "1. As explained above, by aggregating the individual trees we potentially reduce the variance of the prediction.\n",
        "\n",
        "2. We are not pruning the individual trees which leads to lower bias.  While this may lead to overfitting by any individual tree, we control the variance by subsequently averaging.  Thus,  we potentially reduce the bias of the prediction.\n",
        "\n",
        "Let's compare the predictions:"
      ],
      "metadata": {
        "id": "iTOoQgZK4RYz"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "plt.scatter(mytestdata1['x'], mytestdata1['y_1'], c = 'k')\n",
        "plt.plot(mytestdata1['x'], yhat_OOS1, c = 'k')\n",
        "plt.scatter(mytestdata1['x'], ytreehat1, c = 'r')\n",
        "plt.scatter(mytestdata1['x'], ybaggedtreehat1, c = 'b')"
      ],
      "metadata": {
        "id": "xRRtyuyZ5lBZ"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "ybaggedtreehat1_100_1 = np.zeros(mytestdata1['y_1'].shape)\n",
        "for i in range(0, 100):\n",
        "  subset = np.random.choice(len(mytraindata1), 100, replace=True)\n",
        "  X = mytraindata1['x'][subset].values.reshape(-1, 1)\n",
        "  y = mytraindata1['y_1'][subset].values\n",
        "  atree.fit(X, y)\n",
        "  ybaggedtreehat1_100_1 = ybaggedtreehat1_100_1 + atree.predict(mytestdata1['x'].values.reshape(-1, 1))\n",
        "  ybaggedtreehat1_100_1 = ybaggedtreehat1_100_1/100\n",
        "  BAGGED_MSE1_100_1 = sum((mytestdata1['y_1'] - ybaggedtreehat1_100_1)**2)/50\n",
        "BAGGED_MSE1_100_1"
      ],
      "metadata": {
        "id": "L-nED2Ud52nB"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Here the MSE even increases relative to the simple tree.  Why?  Because we are not pruning, so the individual trees are overfitting the data and, due to the high positive correlation between the predictions originating from the similarities in  datasets, the overfitting here is not mitigated by the variance reduction. "
      ],
      "metadata": {
        "id": "kM9BDzUn6Fc2"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Let's also try the second dataset:\n",
        "\n",
        "1. Tree:"
      ],
      "metadata": {
        "id": "JdD4221k5j99"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "tree2 = DecisionTreeRegressor(max_leaf_nodes=2)\n",
        "X = mytraindata2['x'].values.reshape(-1, 1)\n",
        "y = mytraindata2['y_2'].values\n",
        "tree2.fit(X, y)\n",
        "ytreehat2 = tree2.predict(mytestdata2['x'].values.reshape(-1, 1))\n",
        "TREE_OOS_MSE2 = sum((mytestdata2['y_2'] - ytreehat2)**2)/50\n",
        "TREE_OOS_MSE2"
      ],
      "metadata": {
        "id": "fw0OPS2C66PO"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "2. Bagging:"
      ],
      "metadata": {
        "id": "iHBjbElU5jwB"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "ybaggedtreehat2 = np.zeros(mytestdata2['y_2'].shape)\n",
        "atree = DecisionTreeRegressor()\n",
        "for i in range(0, 100):\n",
        "  subset = np.random.choice(len(mytraindata2), 30, replace=True)\n",
        "  X = mytraindata2['x'][subset].values.reshape(-1, 1)\n",
        "  y = mytraindata2['y_2'][subset].values\n",
        "  atree.fit(X, y)\n",
        "  ybaggedtreehat2 = ybaggedtreehat2 + atree.predict(mytestdata2['x'].values.reshape(-1, 1))\n",
        "ybaggedtreehat2 = ybaggedtreehat2/100\n",
        "BAGGED_MSE2 = sum((mytestdata2['y_2'] - ybaggedtreehat2)**2)/50\n",
        "BAGGED_MSE2"
      ],
      "metadata": {
        "id": "YBWlLq5U7CyG"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "So it turns out here the aggregated sample beats even the basic tree model, which performed quite well.  But why?  Let's look at the predictions:"
      ],
      "metadata": {
        "id": "yrofpLI-7WBE"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "lmfit2 = smf.ols(formula=\"y_2 ~ x\", data=mytraindata2).fit()\n",
        "yhat_OOS2 = lmfit2.predict(mytestdata2)\n",
        "plt.scatter(mytestdata2['x'], mytestdata2['y_2'], c = 'k')\n",
        "plt.plot(mytestdata2['x'], yhat_OOS2, c = 'k')\n",
        "plt.scatter(mytestdata2['x'], ytreehat2, c = 'r')\n",
        "plt.scatter(mytestdata2['x'], ybaggedtreehat2, c = 'b')"
      ],
      "metadata": {
        "id": "UdnpfKG75jdY"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [],
      "metadata": {
        "id": "neDMOyIZ8w_A"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "So in the extremal areas, the predictions are similar, but the predictions of the aggregated estimator are smooth around the cutoff area.  A single tree that has one precise cutoff is likely to get it (slightly) wrong, so the smoothed transition (reflecting some ambiguity of whether the points in this area should be \"up\" or \"down\") generally performs better.\n",
        "\n",
        "Now contemplate the (more realistic) situation where it is not ex ante clear of whether the relationship is more linear or more non-linear.  In this case, the aggregated estimator may be a conservative choice -- and it appears to definitely outperform the tree!"
      ],
      "metadata": {
        "id": "TOGZ0MLE8i8A"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Case Study: Caravan Insurance Purchases\n",
        "\n",
        "Let's go back to the `Caravan` insurance data:"
      ],
      "metadata": {
        "id": "zNnvu0I48wYg"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!git clone https://github.com/danielbauer1979/CAS_PredMod.git"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "Gb1EdVnT8xy-",
        "outputId": "38aca4fa-909a-4c3e-c8a6-8630285b498a"
      },
      "execution_count": 20,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Cloning into 'CAS_PredMod'...\n",
            "remote: Enumerating objects: 96, done.\u001b[K\n",
            "remote: Counting objects: 100% (18/18), done.\u001b[K\n",
            "remote: Compressing objects: 100% (18/18), done.\u001b[K\n",
            "remote: Total 96 (delta 6), reused 0 (delta 0), pack-reused 78\u001b[K\n",
            "Unpacking objects: 100% (96/96), done.\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "Caravan = pd.read_csv('CAS_PredMod/pa_data_caravan.csv', index_col=0)"
      ],
      "metadata": {
        "id": "zhmtJl7D82OM"
      },
      "execution_count": 23,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Let's split the dataset:"
      ],
      "metadata": {
        "id": "v2IifQXX84Rs"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "Caravan.Purchase = Caravan.Purchase=='Yes'\n",
        "test = Caravan.iloc[0:1000]\n",
        "train = Caravan.iloc[1000:len(Caravan)]\n",
        "X = train.drop(columns = ['Purchase'])\n",
        "y = train.Purchase\n",
        "Xtest = test.drop(columns = ['Purchase'])\n",
        "ytest = test.Purchase"
      ],
      "metadata": {
        "id": "YmQbQjam9BWB"
      },
      "execution_count": 24,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Tree\n",
        "\n",
        "First, let's redo a tree"
      ],
      "metadata": {
        "id": "x6wwpLz79U-X"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "TREE_OOS_MSE = []\n",
        "cv_score = []\n",
        "for i in range(2,40):\n",
        "  tree = DecisionTreeClassifier(max_leaf_nodes=i)\n",
        "  tree.fit(X, y)\n",
        "  ytreehat = tree.predict(Xtest)\n",
        "  TREE_OOS_MSE.append(np.mean((ytest == ytreehat)))\n",
        "plt.plot(range(2,40),TREE_OOS_MSE)"
      ],
      "metadata": {
        "id": "KJmVihnO9UCp"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "So let's use the insight to build an appropriate tree:"
      ],
      "metadata": {
        "id": "8T2RXHyP_BjU"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "optimal_nodes = TREE_OOS_MSE.index(min(TREE_OOS_MSE))+2\n",
        "Car_tree = DecisionTreeClassifier(max_leaf_nodes=optimal_nodes)\n",
        "Car_tree.fit(X, y)\n",
        "graph, = print_tree(Car_tree, features=X.columns)\n",
        "Image(graph.create_png())"
      ],
      "metadata": {
        "id": "DPUJ1Uru9qTl"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Let's check the predictions -- first, let's check the confusion matrix:"
      ],
      "metadata": {
        "id": "-hTN6jd9_Gqv"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "pred_tree = Car_tree.predict_proba(Xtest)\n",
        "def Extract(lst):\n",
        "  return [item[1] for item in lst]\n",
        "pred_tree = Extract(pred_tree)\n",
        "table = pd.DataFrame({'Purchase':ytest,'pred':(np.array(pred_tree) > 0.5)})\n",
        "table.groupby(['Purchase','pred']).size().unstack('Purchase')"
      ],
      "metadata": {
        "id": "NvPQIulF9sZf"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "And the ROC curve and AUC:"
      ],
      "metadata": {
        "id": "MbeYEct0_Prj"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "fpr, tpr, threshold = roc_curve(ytest, pred_tree)\n",
        "roc_auc = auc(fpr, tpr)\n",
        "plt.title('Receiver Operating Characteristic')\n",
        "plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)\n",
        "plt.legend(loc = 'lower right')\n",
        "plt.plot([0, 1], [0, 1],'r--')\n",
        "plt.xlim([0, 1])\n",
        "plt.ylim([0, 1])\n",
        "plt.ylabel('True Positive Rate')\n",
        "plt.xlabel('False Positive Rate')\n",
        "plt.show()"
      ],
      "metadata": {
        "id": "uNNvpsjB91--"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Boosting"
      ],
      "metadata": {
        "id": "lvzknaCb96qh"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Let's run a boosting model:"
      ],
      "metadata": {
        "id": "xlQu_oRg_V6R"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "boost = GradientBoostingRegressor(n_estimators=5000, learning_rate=0.01,random_state=1)\n",
        "boost.fit(X, y)"
      ],
      "metadata": {
        "id": "_PX2C-Kw-Hmu"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "To appraise what features matter, let's consider feature importance scores:"
      ],
      "metadata": {
        "id": "w9UBr7X8_YPt"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "feature_importance = boost.feature_importances_*100\n",
        "rel_imp = pd.Series(feature_importance, index=X.columns).sort_values(ascending=False, inplace=False)\n",
        "rel_imp = rel_imp[0:20]\n",
        "print(rel_imp)\n",
        "rel_imp.plot(kind='barh', color='b', ).invert_yaxis()\n",
        "plt.xlabel('Variable Importance')"
      ],
      "metadata": {
        "id": "jp4Bjexf-NRM"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "The predictions are:"
      ],
      "metadata": {
        "id": "EyovqtSR_hqL"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "pred_boost = boost.predict(Xtest)"
      ],
      "metadata": {
        "id": "VddIufzC-Voe"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Resulting in the following ROC curve and AUC:"
      ],
      "metadata": {
        "id": "kKsmeqdN_l-J"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "fpr, tpr, threshold = roc_curve(ytest, pred_boost)\n",
        "roc_auc = auc(fpr, tpr)\n",
        "plt.title('Receiver Operating Characteristic')\n",
        "plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)\n",
        "plt.legend(loc = 'lower right')\n",
        "plt.plot([0, 1], [0, 1],'r--')\n",
        "plt.xlim([0, 1])\n",
        "plt.ylim([0, 1])\n",
        "plt.ylabel('True Positive Rate')\n",
        "plt.xlabel('False Positive Rate')\n",
        "plt.show()"
      ],
      "metadata": {
        "id": "Lzc8IX1J-aKc"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "So quite an improvement"
      ],
      "metadata": {
        "id": "iMPyXoxw_ps8"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "###Random Forest"
      ],
      "metadata": {
        "id": "II5Xb1RT-a-l"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Let's also run a random forest:"
      ],
      "metadata": {
        "id": "SxOEFk5Q_tGJ"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "rf = RandomForestRegressor(max_features=20, n_estimators=500, random_state=1)\n",
        "rf.fit(X, y)"
      ],
      "metadata": {
        "id": "aOXgXBgU-dd6"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Feature importance scores are:"
      ],
      "metadata": {
        "id": "OP9MaM0h_wFr"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "Importance_ = pd.DataFrame({'Importance':rf.feature_importances_*100}, index=X.columns)\n",
        "Importance = Importance_.sort_values('Importance', axis=0, ascending=False)[0:20]\n",
        "Importance.plot(kind='barh', color='b', ).invert_yaxis()\n",
        "plt.xlabel('Variable Importance')\n",
        "plt.gca().legend_ = None"
      ],
      "metadata": {
        "id": "BN8FUuQ1-hay"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "So quite a difference.\n",
        "\n",
        "Let's look at the predictions:"
      ],
      "metadata": {
        "id": "f6t_rsaH_zsZ"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "pred_rf = rf.predict(Xtest)"
      ],
      "metadata": {
        "id": "1Ff_lxHt-un9"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "And ROC curve/AUC:"
      ],
      "metadata": {
        "id": "y01n0qIH_5mA"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "fpr, tpr, threshold = roc_curve(ytest, pred_rf)\n",
        "roc_auc = auc(fpr, tpr)\n",
        "plt.title('Receiver Operating Characteristic')\n",
        "plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)\n",
        "plt.legend(loc = 'lower right')\n",
        "plt.plot([0, 1], [0, 1],'r--')\n",
        "plt.xlim([0, 1])\n",
        "plt.ylim([0, 1])\n",
        "plt.ylabel('True Positive Rate')\n",
        "plt.xlabel('False Positive Rate')\n",
        "plt.show()"
      ],
      "metadata": {
        "id": "UDlvRmp7-yaz"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "So not quite the same performance as the boosted model."
      ],
      "metadata": {
        "id": "Ri3NA_kp_7fY"
      }
    }
  ]
}