
Statistics Papers
Document Type
Journal Article
Date of this Version
2004
Publication Source
Journal of the American Statistical Association
Volume
99
Issue
466
Start Page
303
Last Page
313
DOI
10.1198/016214504000000287
Abstract
We predict the onset of personal bankruptcy using least squares regression. Although well publicized, only 2,244 bankruptcies occur in our dataset of 2.9 million months of credit-card activity. We use stepwise selection to find predictors of these from a mix of payment history, debt load, demographics, and their interactions. This combination of rare responses and over 67,000 possible predictors leads to a challenging modeling question: How does one separate coincidental from useful predictors? We show that three modifications turn stepwise regression into an effective methodology for predicting bankruptcy. Our version of stepwise regression (1) organizes calculations to accommodate interactions, (2) exploits modern decision theoretic criteria to choose predictors, and (3) conservatively estimates p-values to handle sparse data and a binary response. Omitting any one of these leads to poor performance. A final step in our procedure calibrates regression predictions. With these modifications, stepwise regression predicts bankruptcy as well as, if not better than, recently developed data-mining tools. When sorted, the largest 14,000 resulting predictions hold 1,000 of the 1,800 bankruptcies hidden in a validation sample of 2.3 million observations. If the cost of missing a bankruptcy is 200 times that of a false positive, our predictions incur less than 2/3 of the costs of classification errors produced by the tree-based classifier C4.5.
Copyright/Permission Statement
This is an Accepted Manuscript of an article published by Taylor & Francis in Journal of the American Statistical Association on 31 Dec 2011, available online: http://wwww.tandfonline.com/10.1198/016214504000000287.
Keywords
AIC, Bonferroni, Cp, calibration, hard thresholding, risk inflation criterion, (RIC), step-down testing, stepwise regression
Recommended Citation
Foster, D. P. (2004). Variable Selection in Data Mining: Building a Predictive Model for Bankruptcy. Journal of the American Statistical Association, 99 (466), 303-313. http://dx.doi.org/10.1198/016214504000000287
Date Posted: 27 November 2017