Date of Award
Doctor of Philosophy (PhD)
Linda H. Zhao
We consider several statistical approaches to binary classification and multiple hypothesis testing problems. Situations in which a binary choice must be made are common in science. Usually, there is uncertainty involved in making the choice and a great number of statistical techniques have been put forth to help researchers deal with this uncertainty in separating signal from noise in reasonable ways. For example, in genetic studies, one may want to identify genes that affect a certain biological process from among a larger set of genes. In such examples, costs are attached to making incorrect choices and many choices must be made at the same time. Reasonable ways of modeling the cost structure and choosing the appropriate criteria for evaluating the performance of statistical techniques are needed. The following three chapters have proposals of some Bayesian methods for these issues.
In the first chapter, we focus on an empirical Bayes approach to a popular binary classification problem formulation. In this framework, observations are treated as independent draws from a hierarchical model with a mixture prior distribution. The mixture prior combines prior distributions for the ``noise'' and for the ``signal'' observations. In the literature, parametric assumptions are usually made about the prior distribution from which the ``signal'' observations come. We suggest a Bayes classification rule which minimizes the expectation of a flexible and easily interpretable mixture loss function which brings together constant penalties for false positive misclassifications and $L_2$ penalties for false negative misclassifications. Due in part to the form of the loss function, empirical Bayes techniques can then be used to construct the Bayes classification rule without specifying the ``signal'' part of the mixture prior distribution. The proposed classification technique builds directly on the nonparametric mixture prior approach proposed by Raykar and Zhao (2010, 2011).
Many different criteria can be used to judge the success of a classification procedure. A very useful criterion called the False Discovery Rate (FDR) was introduced by Benjamini and Hochberg in a 1995 paper. For many applications, the FDR, which is defined as the expected proportion of false positive results among the observations declared to be ``signal'', is a reasonable criterion to target. Bayesian versions of the false discovery rate, the so-called positive false discovery rate (pFDR) and local false discovery rate, were proposed by Storey (2002, 2003) and Efron and coauthors (2001), respectively. There is an interesting connection between the local false discovery rate and the nonparametric mixture prior approach for binary classification problems. The second part of the dissertation is focused on this link and provides a comparison of various approaches for estimating Bayesian false discovery rates.
The third chapter is an account of a connection between the celebrated Neyman-Pearson lemma and the area (AUC) under the receiver operating characteristic (ROC) curve when the observations that need to be classified come from a pair of normal distributions. Using this connection, it is possible to derive a classification rule which maximizes the AUC for binormal data.
Fuki, Igar, "Bayesian Aspects of Classification Procedures" (2013). Publicly Accessible Penn Dissertations. 863.