Date of Award
Doctor of Philosophy (PhD)
Edward I. George
A core focus of statistics is determining how much of the variation in data may be attributed to the signal of interest, and how much to noise. When the sources of variation are many and complex, a Bayesian approach to data analysis offers a number of advantages. In this thesis, we propose and implement new Bayesian methods for modeling variation in two general settings. The first setting is high-dimensional linear regression where the unknown error variance is also of interest. Here, we show that a commonly used class of conjugate shrinkage priors can lead to underestimation of the error variance. We then extend the Spike-and-Slab Lasso (SSL, Rockova and George, 2018) to the unknown variance case, using an alternative, independent prior framework. This extended procedure outperforms both the fixed variance approach and alternative penalized likelihood methods on both simulated and real data.
For the second setting, we move from univariate response data where the predictors are known, to multivariate response data in which potential predictors are unobserved. In this setting, we first consider the problem of biclustering, where a motivating example is to find subsets of genes which have similar expression in a subset of patients. For this task, we propose a new biclustering method called Spike-and-Slab Lasso Biclustering (SSLB). SSLB utilizes the SSL prior to find a doubly-sparse factorization of the data matrix via a fast EM algorithm. Applied to both a microarray dataset and a single-cell RNA-sequencing dataset, SSLB recovers biologically meaningful signal in the data.
The second problem we consider in this setting is nonlinear factor analysis. The goal here is to find low-dimensional, unobserved ``factors'' which drive the variation in the high-dimensional observed data in a potentially nonlinear fashion. For this purpose, we develop factor analysis BART (faBART), an MCMC algorithm which alternates sampling from the posterior of (a) the factors and (b) a functional approximation to the mapping from the factors to the data. The latter step utilizes Bayesian Additive Regression Trees (BART, Chipman et al., 2010). On a variety of simulation settings, we demonstrate that with only the observed data as the input, faBART is able to recover both the unobserved factors and the nonlinear mapping.
Moran, Gemma Elyse, "Bayesian Approaches For Modeling Variation" (2019). Publicly Accessible Penn Dissertations. 3350.