Date of Award


Degree Type


Degree Name

Doctor of Philosophy (PhD)

Graduate Group


First Advisor

Hongzhe Li


Next generation sequencing (NGS) technologies make the studies of microbiomes in very large-scale possible without cultivation in vitro. One approach to sequencing-based microbiome studies is to sequence specific genes (often the 16S rRNA gene) to produce a profile of diversity of bacterial taxa. Alternatively, the NGS-based sequencing strategy, also called shotgun metagenomics, provides further insights at the molecular level, such as species/strain quantification, gene function analysis and association studies. Such studies generate large-scale high-dimensional count and compositional data, which are the focus of this dissertation.

In microbiome studies, the taxa composition is often estimated based on the sparse counts of sequencing reads in order to account for the large variability in the total number of reads. The first part of this thesis deals with the problem of estimating the bacterial composition based on sparse count data, where a penalized likelihood of a multinomial model is proposed to estimate the composition by regularizing the nuclear norm of the compositional matrix. Under the assumption that the observed composition is approximately low rank, a nearly optimal theoretical upper bound of the estimation error under the Kullback-Leibler divergence and the Frobenius norm is obtained. Simulation studies demonstrate that the penalized likelihood-based estimator outperforms the commonly used naive estimator in term of the estimation error of the composition matrix and various bacterial diversity measures. An analysis of a microbiome dataset is used to illustrate the methods.

Understanding the dependence structure among microbial taxa within a community, including co-occurrence and co-exclusion relationships between microbial taxa, is another important problem in microbiome research. However, the compositional nature of the data complicates the investigation of the dependency structure since there are no known multivariate distributions that are flexible enough to model such a dependency. The second part of the thesis develops a composition-adjusted thresholding (COAT) method to estimate the sparse covariance matrix of the latent log-basis components. The method is based on a decomposition of the variation matrix into a rank-2 component and a sparse component. The resulting procedure can be viewed as thresholding the

sample centered log-ratio covariance matrix and hence is scalable to large covariance matrice estimations based on compositional data. The issue of the identifiability problem of the covariance parameters is rigorously characterized. In addition, rate of convergence under the spectral norm is derived and the procedure is shown to have theoretical guarantee on support recovery under certain assumptions. In the application to gut microbiome data, the COAT method leads to more stable and biologically more interpretable results when comparing the dependence structures of lean and obese microbiomes.

The third part of the thesis considers the two-sample testing problem for high-dimensional compositional data and formulates a testable hypothesis of compositional equivalence for the means of two latent log-basis vectors. A test for such a compositional equivalence through the centered log-ratio transformation of the compositions is proposed and is shown have an asymptotic extreme value of type 1 distribution under the null. The power of the test against sparse alternatives is derived. Simulations demonstrate that the proposed tests can be significantly more powerful than existing tests that are applied to the raw and log-transformed compositional data. The usefulness of the proposed tests is illustrated by applications to test for differences in gut microbiome composition between lean and obese individuals and changes of gut microbiome between different time points during treatment in Crohn's disease patients.