Discrete Methods in Statistics: Feature Selection and Fairness-Aware Data Mining

Johnson, Kory

Discrete Methods in Statistics: Feature Selection and Fairness-Aware Data Mining

Files

Johnson_upenngdas_0175C_12231.pdf (1.44 MB)

Degree type

Doctor of Philosophy (PhD)

Graduate group

Statistics

Subject

Fairness Aware Data Mining
Feature Selection
Forward Stepwise
Post-Selection Inference
Sequential Testing
Submodular
Statistics and Probability

Copyright date

2016-11-29T00:00:00-08:00

Permalink

https://repository.upenn.edu/handle/20.500.14332/28636

View all metadata

Author

Johnson, Kory

Abstract

This dissertation is a detailed investigation of issues that arise in models that change discretely. Models are often constructed by either including or excluding features based on some criteria. These discrete changes are challenging to analyze due to correlation between features. Feature selection is the problem of identifying an appropriate set of features to include in a model, while fairness-aware data mining is the problem of needing to remove the \emph{influence} of protected features from a model. This dissertation provides frameworks for understanding each problem and algorithms for accomplishing the desired goal. The feature selection problem is addressed through the framework of sequential hypothesis testing. We elucidate the statistical challenges in repeatedly using inference in this domain and demonstrate how current methods fail to address them. Our algorithms build on classically motivated, multiple testing procedures to control measures of false rejections when using hypothesis testing during forward stepwise regression. Furthermore, these methods have much higher power than recent proposals from the conditional inference literature. The fairness-aware data mining community is grappling with fundamental questions concerning fairness in statistical modeling. Tension exists between identifying explainable differences between groups and discriminatory ones. We provide a framework for understanding the connections between fairness and the use of protected information in modeling. With this discussion in hand, generating fair estimates is straight-forward.

Advisor

Robert A. Stine
Dean P. Foster

Date of degree

2016-01-01

Collection

Dissertations and Theses