Topics in Tree-Based Methods

Goldstein, Alex Lauf

Topics in Tree-Based Methods

Files

Goldstein_upenngdas_0175C_11077.pdf (2.68 MB)

Degree type

Doctor of Philosophy (PhD)

Graduate group

Statistics

Subject

Classification and Regression Trees
Decision Trees
Exploratory Data Analysis
Machine Learning
Model Visualization
Statistics and Probability

Copyright date

2015-11-16T20:14:00-08:00

Permalink

https://repository.upenn.edu/handle/20.500.14332/28083

View all metadata

Author

Goldstein, Alex Lauf

Abstract

This work introduces methods and associated software for enhancing the interpretability of fitted models, with emphasis on classification and regression trees. We begin in Chapter 1 by describing novel techniques for growing classification and regression trees designed to induce visually interpretable trees. This is achieved by penalizing splits that extend the subset of features used in a particular branch of the tree. After a brief motivation, we summarize existing methods and introduce new ones, providing illustrative examples throughout. Using a number of real classification and regression datasets, we find that these procedures can offer more interpretable fits than the CART methodology with very modest increases in out-of-sample loss. These techniques are implemented in the R package itree, described in Chapter 2. In addition to the procedures introduced in Chapter 1, itree implements a method for visualizing the out-of-sample risk as well as the usual classification and regression tree methodologies. Chapter 2 presents illustrative examples and demonstrates itree's usage for aspects of the software that are novel or unique to itree. Whereas Chapters 1 and 2 relate to tree-based methods, Chapter 3 describes Individual Conditional Expectation (ICE) plots, a methodology for visualizing the model estimated by any supervised learning algorithm. Classical partial dependence plots (PDPs) help visualize the average partial relationship between the predicted response and one or more features. In the presence of substantial interaction effects, the partial response relationship can be heterogeneous. Thus, an average curve, such as the PDP, can obfuscate the complexity of the modeled relationship. Accordingly, ICE plots refine the partial dependence plot by graphing the functional relationship between the predicted response and the feature for individual observations. ICE plots highlight the variation in the fitted values across the range of a covariate, suggesting where and to what extent heterogeneities might exist. In addition to providing a plotting suite for exploratory analysis, we include a visual test for additive structure in the data generating model. The procedures outlined in Chapter 3 are available in the R package ICEbox.

Advisor

Andreas Buja

Date of degree

2014-01-01

Collection

Dissertations and Theses