Community Membership Testing And Missing Value Imputation: Theory And Methods

Yezheng Li, University of Pennsylvania


Modern machine learning methods have been widely applied in genomics and metagenomicsdata analysis. This dissertation focuses on the area of unsupervised machine learning and discusses community membership testing, matrix completion and generative adversarial nets with applications to several problems in genomics. While analysis of singular subspace based on principal component analysis has a long history, the rst chapter focuses on recent theory of statistical distribution of singular subspace in the setting of weighted stochastic block models. The theoretical results lead to statistical distribution of a test statistic in two-sample test of membership assignments. Chapter two of this dissertation deals with the problem of estimating the bacterial composition based on sparse count data, where a nuclear-norm penalized likelihood estimation based on a multinomial model is proposed in order to estimate the centered log-ratio (clr) matrix. An ecient optimization algorithm using the generalized accelerated proximal gradient is developed. In microbiome studies, clr transformation is most commonly used after bacterial composition is estimated from the sequencing read counts for downstream statistical analysis. Due to limited sequencing depth and DNA dropouts, many rare bacterial taxa might not be captured in the nal sequencing reads, which results in many zero counts. Naive composition estimation using count normalization leads to many zero proportions, which makes clr transformation infeasible. Our method estimates the clr transformation directly taking into account its low-rank property. Theoretical upper bounds are established and simulation studies and real data study demonstrate that the proposed estimator outperforms the naive estimators.

Chapter three presents a deep learning method using generative adversarial net (GAN) formissing data imputation of gene expressions in the GTEx dataset. A fundamental biological question to address is to what extent the gene expression of a subset of tissues can be used to recover the full transcriptome of other tissues. To address this challenge, we present a method for tissue-level gene expression imputation based on the generative adversarial imputation networks. In order to increase the applicability of our approach, we leverage data from GTEx v8, a reference resource that has generated a comprehensive collection of transcriptomes from a diverse set of human tissues. We show that generative adversarial nets outperform the methods in terms of the predictive performance.