Statistical Methods for Human Microbiome Data Analysis

Chen, Jun

Statistical Methods for Human Microbiome Data Analysis

Files

Chen_upenngdas_0175C_10257.pdf (3.56 MB)

Degree type

Doctor of Philosophy (PhD)

Graduate group

Genomics & Computational Biology

Subject

High-dimensional statistics
Metagenomics
Microbiome
Variable selection
Bioinformatics
Biostatistics
Microbiology

Copyright date

2014-08-19T00:00:00-07:00

Permalink

https://repository.upenn.edu/handle/20.500.14332/31829

View all metadata

Author

Chen, Jun

Abstract

The human microbiome is the totality of the microbes, their genetic elements and the interactions they have with surrounding environments throughout the human body. Studies have implicated the human microbiome in health and disease. Two central themes of human microbiome studies are to identify potential factors influencing the microbiome composition, and to define the relationship between microbiome features and biological or clinical outcomes. With the development of next generation sequencing technologies, the human microbiome composition can be interrogated using high-throughput DNA sequencing. One strategy sequences the bacterial 16S ribosomal RNA gene for species identification. These 16S sequences are usually clustered into Operational Taxonomic Units (OTUs). Analysis of such OTU data raises several important statistical challenges, including taking into account the phylogenetic relationship among OTUs and modeling high-dimensional overdispersed count data. This dissertation presents three statistical methods developed specifically for 16S data analysis centering around the two themes. To test the association between overall microbiome composition and a covariate/an outcome, a testing procedure based on a generalized UniFrac distance was developed. The generalized UniFrac distance corrects the unduly weighting of classic UniFrac distances on either highly abundant or rare lineages, and was shown to be more powerful than the classic UniFracs. Under the framework of canonical correlation analysis (CCA), a structure-constrained sparse CCA was proposed to select the OTUs and their correlated covariates. A phylogenetic structure-constrained penalty function was imposed to induce certain smoothness on the linear coefficients according to the OTU phylogenetic relationship. Structure-constrained sparse CCA performed much better than sparse CCA in selecting relevant OTUs. Finally, a sparse Dirichlet-multinomial regression (SDMR) model was developed to link the microbiome composition to environmental covariates and to select the most important covariates and their affected OTUs. SDMR accounts for the overdispersion of OTU counts and uses a sparse group L1 penalty function to facilitate selection of covariates and OTUs simultaneously. These methods were illustrated using simulations as well as a real human gut microbiome data set from a study of dietary effects on gut microbiome composition.

Advisor

Hongzhe Li

Date of degree

2012-01-01

Collection

Dissertations and Theses