Community Membership Testing And Missing Value Imputation: Theory And Methods

Loading...
Thumbnail Image
Degree type
Doctor of Philosophy (PhD)
Graduate group
Applied Mathematics
Discipline
Subject
Applied Mathematics
Funder
Grant number
License
Copyright date
2021-08-31T20:20:00-07:00
Distributor
Related resources
Author
Li, Yezheng
Contributor
Abstract

Modern machine learning methods have been widely applied in genomics and metagenomicsdata analysis. This dissertation focuses on the area of unsupervised machine learning and discusses community membership testing, matrix completion and generative adversarial nets with applications to several problems in genomics. While analysis of singular subspace based on principal component analysis has a long history, the rst chapter focuses on recent theory of statistical distribution of singular subspace in the setting of weighted stochastic block models. The theoretical results lead to statistical distribution of a test statistic in two-sample test of membership assignments. Chapter two of this dissertation deals with the problem of estimating the bacterial composition based on sparse count data, where a nuclear-norm penalized likelihood estimation based on a multinomial model is proposed in order to estimate the centered log-ratio (clr) matrix. An ecient optimization algorithm using the generalized accelerated proximal gradient is developed. In microbiome studies, clr transformation is most commonly used after bacterial composition is estimated from the sequencing read counts for downstream statistical analysis. Due to limited sequencing depth and DNA dropouts, many rare bacterial taxa might not be captured in the nal sequencing reads, which results in many zero counts. Naive composition estimation using count normalization leads to many zero proportions, which makes clr transformation infeasible. Our method estimates the clr transformation directly taking into account its low-rank property. Theoretical upper bounds are established and simulation studies and real data study demonstrate that the proposed estimator outperforms the naive estimators. Chapter three presents a deep learning method using generative adversarial net (GAN) formissing data imputation of gene expressions in the GTEx dataset. A fundamental biological question to address is to what extent the gene expression of a subset of tissues can be used to recover the full transcriptome of other tissues. To address this challenge, we present a method for tissue-level gene expression imputation based on the generative adversarial imputation networks. In order to increase the applicability of our approach, we leverage data from GTEx v8, a reference resource that has generated a comprehensive collection of transcriptomes from a diverse set of human tissues. We show that generative adversarial nets outperform the methods in terms of the predictive performance.

Advisor
Hongzhe LI
Date of degree
2020-01-01
Date Range for Data Collection (Start Date)
Date Range for Data Collection (End Date)
Digital Object Identifier
Series name and number
Volume number
Issue number
Publisher
Publisher DOI
Journal Issue
Comments
Recommended citation