Statistical Methods for Compositional and Tree-Structured Count Data in Human Microbiome Studies

Loading...
Thumbnail Image
Degree type
Doctor of Philosophy (PhD)
Graduate group
Epidemiology & Biostatistics
Discipline
Subject
compositional data
high-dimensional regression
hypothesis testing
microbiome
paired count data
taxonomic tree
Biostatistics
Funder
Grant number
License
Copyright date
2016-11-29T00:00:00-08:00
Distributor
Related resources
Author
Contributor
Abstract

In human microbiome studies, sequencing reads data are often summarized as counts of bacterial taxa at various taxonomic levels. In this thesis, we develop statistical methods for analyzing such counts data. We first consider regression analysis with bacterial counts normalized into compositions as covariates. In order to satisfy the subcompositional coherence of the resulting model, linear models with a set of linear constraints on the regression coefficients are introduced. A penalized estimation procedure for estimating the regression coefficients and for selecting variables under the linear constraints is developed. A method is also proposed to obtain de-biased estimates of the regression coefficients that are asymptotically unbiased and have a joint asymptotic multivariate normal distribution. This provides valid confidence intervals of the regression coefficients and can be used to obtain the p-values. Simulation results have shown the validity of the confidence intervals and smaller variances of the de-biased estimates when the linear constraints are imposed. The proposed methods are applied to a gut microbiome data set and identify four bacterial genera that are associated with the body mass index after adjusting for the total fat and caloric intakes. We then consider the problem of testing difference between two repeated measurements of microbiome from the same subjects. Multiple microbiome measurements are often obtained from the same subject to assess the difference in microbial composition across body sites or time points. Existing models for analyzing such data are limited in modeling the covariance structure of the counts and in handling paired multinomial data. We propose a new probability distribution for paired multinomial count data, which allows flexible covariance structure of the counts and can be used to model repeatedly measured multivariate counts. Based on this new distribution, a test statistic is developed to test the difference in compositions of paired multinomial count data. The proposed test can be applied to count data observed on taxonomic trees in order to test difference in microbiome compositions and to identify subtrees with different subcompositions. Simulation results shown that the proposed test has correct type 1 errors and increased power compared to some commonly used methods. An analysis of an upper respiratory tract microbiome data set is used to illustrate the proposed methods.

Advisor
Hongzhe Li
Date of degree
2016-01-01
Date Range for Data Collection (Start Date)
Date Range for Data Collection (End Date)
Digital Object Identifier
Series name and number
Volume number
Issue number
Publisher
Publisher DOI
Journal Issue
Comments
Recommended citation