Topics in Tree-Based Methods

Thumbnail Image
Degree type
Doctor of Philosophy (PhD)
Graduate group
Classification and Regression Trees
Decision Trees
Exploratory Data Analysis
Machine Learning
Model Visualization
Statistics and Probability
Grant number
Copyright date
Related resources

This work introduces methods and associated software for enhancing the interpretability of fitted models, with emphasis on classification and regression trees. We begin in Chapter 1 by describing novel techniques for growing classification and regression trees designed to induce visually interpretable trees. This is achieved by penalizing splits that extend the subset of features used in a particular branch of the tree. After a brief motivation, we summarize existing methods and introduce new ones, providing illustrative examples throughout. Using a number of real classification and regression datasets, we find that these procedures can offer more interpretable fits than the CART methodology with very modest increases in out-of-sample loss. These techniques are implemented in the R package itree, described in Chapter 2. In addition to the procedures introduced in Chapter 1, itree implements a method for visualizing the out-of-sample risk as well as the usual classification and regression tree methodologies. Chapter 2 presents illustrative examples and demonstrates itree's usage for aspects of the software that are novel or unique to itree. Whereas Chapters 1 and 2 relate to tree-based methods, Chapter 3 describes Individual Conditional Expectation (ICE) plots, a methodology for visualizing the model estimated by any supervised learning algorithm. Classical partial dependence plots (PDPs) help visualize the average partial relationship between the predicted response and one or more features. In the presence of substantial interaction effects, the partial response relationship can be heterogeneous. Thus, an average curve, such as the PDP, can obfuscate the complexity of the modeled relationship. Accordingly, ICE plots refine the partial dependence plot by graphing the functional relationship between the predicted response and the feature for individual observations. ICE plots highlight the variation in the fitted values across the range of a covariate, suggesting where and to what extent heterogeneities might exist. In addition to providing a plotting suite for exploratory analysis, we include a visual test for additive structure in the data generating model. The procedures outlined in Chapter 3 are available in the R package ICEbox.

Andreas Buja
Date of degree
Date Range for Data Collection (Start Date)
Date Range for Data Collection (End Date)
Digital Object Identifier
Series name and number
Volume number
Issue number
Publisher DOI
Journal Issue
Recommended citation