Community Membership Testing and Missing Value Imputation: Theory and Methods

Yezheng Li, University of Pennsylvania


Modern machine learning methods have been widely applied in genomics and metagenomics data analysis. This dissertation focuses on the area of unsupervised machine learning and discusses community membership testing, matrix completion and generative adversarial nets with applications to several problems in genomics. While analysis of singular subspace based on principal component analysis has a long history, the first chapter focuses on recent theory of statistical distribution of singular subspace in the setting of weighted stochastic block models. The theoretical results lead to statistical distribution of a test statistic in two-sample test of membership assignments. Chapter two of this dissertation deals with the problem of estimating the bacterial composition based on sparse count data, where a nuclear-norm penalized likelihood estimation based on a multinomial model is proposed in order to estimate the centered log-ratio (CLR) matrix. An efficient optimization algorithm using the generalized accelerated proximal gradient is developed. In microbiome studies, CLR transformation is most commonly used after bacterial composition is estimated from the sequencing read counts for downstream statistical analysis. Due to limited sequencing depth and DNA dropouts, many rare bacterial taxa might not be captured in the final sequencing reads, which results in many zero counts. Naive composition estimation using count normalization leads to many zero proportions, which makes CLR transformation infeasible. Our method estimates the CLR transformation directly taking into account its low-rank property. Theoretical upper bounds are established and simulation studies and real data study demonstrate that the proposed estimator outperforms the naive estimators.

Subject Area

Applied Mathematics|Artificial intelligence|Genetics

Recommended Citation

Li, Yezheng, "Community Membership Testing and Missing Value Imputation: Theory and Methods" (2020). Dissertations available from ProQuest. AAI28258408.