Using the importance() function, we can view the importance of each are by far the two most important variables. R displays the split criterion (e.g. We now apply the prune.misclass() function in order to prune the tree to a good fit to the (training) data. case of regression trees, the node impurity is measured by the training $Price<142$), the socioeconomic status. that this model leads to test predictions that are within around \$12,428 of A small deviance indicates a tree that provides The In the The tree indicates that lower values of lstat correspond cross-validation error rate, with 59 cross-validation errors. The predict() function can be used for this purpose. What's one real-world scenario where you might try using Boosting. status ($lstat<9.715$). By default, randomForest() The square root of the MSE is therefore around 12.428, indicating Now we'll use the gbm package, and within it the gbm() function, to fit boosted First, we create a illustrate the marginal effect of the selected variables on the response after Let's start with bagging: The argument mtry = 13 indicates that all 13 predictors should be considered rather than simply displaying a letter for each category: The most important indicator of High sales appears to be shelving location, If we just type the name of the tree object, R prints output corresponding indicated using asterisks: Finally, let's evaluate the tree's performance on classification error rate as our cost function to guide the cross-validation and pruning process, cost-complexity parameter used ($k$, which corresponds to $\alpha$ in the equation we saw in lecture). the true median home value for the suburb. Let's see if we can improve on this result using bagging and random forests.

The syntax of the tree() function is quite locations. set: We now use the tree() function to fit a classification tree in order to predict tree, the deviance is simply the sum of squared errors for the tree Let's improve performance: The 7-node tree is selected by cross-validation. We first split the observations into a training set and a test the test data. The residual \ mean \ deviance reported is The tree predicts a median house price the relative influence statistics: We see that lstat and rm are again the most important variables by far. measures can be produced using the varImpPlot() function: The results indicate that across all of the trees considered in the random RSS, and for classification trees by the deviance. Recall that bagging is simply a special case of In these distribution="gaussian" since this is a regression problem; if it were a binary High using all variables but Sales (that would be a little silly). Plots of these importance Let's plot the error upon the mean decrease of accuracy in predictions on the out-of-bag samples case of a classification tree, the argument type="class" instructs R to return since the first branch differentiates Good locations from Bad and Medium prune.tree() function as before: Now we'll use the pruned tree to make predictions on the test set: In other words, the test set MSE associated with the regression tree is value is 0.001, but this is easily modified. argument n.trees=5000 indicates that we want 5000 trees, and the option variable: Two measures of variable importance are reported. the pruning process produced a more interpretable tree, but at a slight cost in classification accuracy. well does this bagged model perform on the test set? The We run gbm() with the option around 77% of the test data set: Next, we consider whether pruning the tree might lead to improved Note that, despite the name, the dev field corresponds to the cross-validation error Branches that lead to terminal nodes are The former is based High, which takes on a value of Yes if the Sales variable exceeds 8, and We can The latter is a measure We use the ifelse() function to create a variable, called house prices are increasing with rm and decreasing with lstat: Now let's use the boosted model to predict medv on the test set: The test MSE obtained is similar to the test MSE for random forests to more expensive houses. These plots pretty = 0 instructs R to include the category names for any qualitative predictors, usemtry = 6`: The test set MSE is even lower; this indicates that random forests yielded an the training error. We can prune the tree using the also produce partial dependence plots for these two variables. the actual class prediction. uses $p/3$ variables when building a random forest of regression trees, and Here we take $\lambda = 0.1$: In this case, using $\lambda = 0.1$ leads to a slightly lower test MSE than $\lambda = 0.001$.

we can apply the predict() function top find out: Now $\frac{(96+54)}{200} =$ 75% of the test observations are correctly classified, so The argument for the branch (Yes or No), and the fraction of observations in that branch be used to perform both random forests and bagging. Now we use the cv.tree() function to see whether pruning the tree will depend on the version of R and the version of the randomForest package Once again, the number of trees grown by randomForest() using the ntree argument: We can grow a random forest in exactly the same way, except that We use the plot() function to display the tree structure, Download the .Rmd or Jupyter Notebook version. obtain the nine-node tree by setting the parameter best = 7: How well does this pruned tree perform on the test data set? is used in order to select a sequence of trees for consideration. Want to follow along on your own machine? training set, and fit the tree to the training data using medv (median home value) as our response: Notice that the output of summary() indicates that only three of the variables integrating out the other variables. We use data, Sales is a continuous variable, and so we begin by converting it to a How results. classification problem, we would use distribution="bernoulli". It was re-implemented in Fall 2016 in tidyverse format by Amelia McNamara and R. Jordan Crouser at Smith College. rate in this instance. What's one real-world scenario where you might try using Random Forests? The default with a different value of the shrinkage parameter $\lambda$. regression trees to the Boston data set. variable, averaged over all tree_ In the when a given variable is excluded from the model. To get credit for this lab, post your responses to the following questions: to Moodle: https://moodle.smith.edu/mod/quiz/view.php?id=264671, https://moodle.smith.edu/mod/quiz/view.php?id=264671. that take on values of Yes and No. rather than the default for the cv.tree() function, which is deviance. have been used in constructing the tree In the context of a regression Here we'll and the text() function to display the node labels. The exact results obtained in this section may $\sqrt{pvariables when building a random forest of classification trees. simply the deviance divided by $n|T_0|$. Applications in R" by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. and bagging. to each branch of the tree. The function cv.tree() performs cross-validation in order to binary variable. takes on a value of No otherwise: In order to properly evaluate the performance of a classification tree on Now let's try fitting a regression tree to the Boston data set from the MASS library.

improvement over bagging in this case. of the total decrease in node impurity that results from splits over that This approach leads to correct predictions for the data, we must estimate the test error rather than simply computing interaction.depth=4 limits the depth of each tree: The summary() function produces a relative influence plot and also outputs similar to that of the lm() function: The summary() function lists the variables that are used as internal nodes (forming decision points) of \$46,380 for larger homes ($rm>=7.437$) in suburbs in which residents have high socioeconomic in the tree, the number of terminal nodes, and the (training) error rate: We see that the training error rate 9%. This lab on Decision Trees in R is an abbreviated version of p. 324-331 of "Introduction to Statistical Learning with The test set MSE associated with the bagged regression tree is dramatically smaller than that obtained using an optimally-pruned single tree! 154.4729. forest, the wealth level of the community (lstat) and the house size (rm) In this case, as we might expect, median The tree library is useful for constructing classification and regression trees: We'll start by using classification trees to analyze the Carseats data set. reported in the output of summary() is given by: where $n_{mk}$ is the number of observations in the $m^{th}$ terminal node that for each split of the tree -- in other words, that bagging should be done. we'll use a smaller value of the mtry argument. graphically displayed. rate as a function of size: We see from this plot that the tree with 7 terminal nodes results in the lowest the argument FUN = prune.misclass in order to indicate that we want the cv.tree() function reports the number of terminal nodes of each tree considered number of observations in that branch, the deviance, the overall prediction installed on your computer, so don't stress out if you don't match up exactly with the book. plot the tree: The variable lstat measures the percentage of individuals with lower

belong to the $k^{th}$ class. We can change For classification trees, the deviance If we want to, we can perform boosting One of the most attractive properties of trees is that they can be Therefore, the randomForest() function can (size) as well as the corresponding error rate and the value of the determine the optimal level of tree complexity; cost complexity pruning What's one real-world scenario where you might try using Bagging? a random forest with $m = p$.