Statistical Methods for Compositional and Tree-Structured Count Data in Human Microbiome Studies

Shi, Pixu

Statistical Methods for Compositional and Tree-Structured Count Data in Human Microbiome Studies

Files

Shi_upenngdas_0175C_12212.pdf (1.14 MB)

Degree type

Doctor of Philosophy (PhD)

Graduate group

Epidemiology & Biostatistics

Subject

compositional data
high-dimensional regression
hypothesis testing
microbiome
paired count data
taxonomic tree
Biostatistics

Copyright date

2016-11-29T00:00:00-08:00

Permalink

https://repository.upenn.edu/handle/20.500.14332/28886

View all metadata

Author

Shi, Pixu

Abstract

In human microbiome studies, sequencing reads data are often summarized as counts of bacterial taxa at various taxonomic levels. In this thesis, we develop statistical methods for analyzing such counts data. We first consider regression analysis with bacterial counts normalized into compositions as covariates. In order to satisfy the subcompositional coherence of the resulting model, linear models with a set of linear constraints on the regression coefficients are introduced. A penalized estimation procedure for estimating the regression coefficients and for selecting variables under the linear constraints is developed. A method is also proposed to obtain de-biased estimates of the regression coefficients that are asymptotically unbiased and have a joint asymptotic multivariate normal distribution. This provides valid confidence intervals of the regression coefficients and can be used to obtain the p-values. Simulation results have shown the validity of the confidence intervals and smaller variances of the de-biased estimates when the linear constraints are imposed. The proposed methods are applied to a gut microbiome data set and identify four bacterial genera that are associated with the body mass index after adjusting for the total fat and caloric intakes. We then consider the problem of testing difference between two repeated measurements of microbiome from the same subjects. Multiple microbiome measurements are often obtained from the same subject to assess the difference in microbial composition across body sites or time points. Existing models for analyzing such data are limited in modeling the covariance structure of the counts and in handling paired multinomial data. We propose a new probability distribution for paired multinomial count data, which allows flexible covariance structure of the counts and can be used to model repeatedly measured multivariate counts. Based on this new distribution, a test statistic is developed to test the difference in compositions of paired multinomial count data. The proposed test can be applied to count data observed on taxonomic trees in order to test difference in microbiome compositions and to identify subtrees with different subcompositions. Simulation results shown that the proposed test has correct type 1 errors and increased power compared to some commonly used methods. An analysis of an upper respiratory tract microbiome data set is used to illustrate the proposed methods.

Advisor

Hongzhe Li

Date of degree

2016-01-01

Collection

Dissertations and Theses