Microbiome and Metagenomics: Statistical Methods, Computation and Applications
multi-sample Poisson model
random effect model
zero-inflated beta regression model
Human microbial communities are associated with many human diseases such as obesity, diabetes and inflammatory bowel disease. High-throughput sequencing technology has been widely used to profile the microbial communities in order to understand their impact on human health. In the first part of this dissertation, we analyzed fecal samples using shotgun metagenomic sequencing from a prospective cohort of pediatric Crohn's disease patients, who started therapy with enteral nutrition or anti-TNF-alpha antibodies. The results reveal the full complement and dynamics of bacteria and fungi during treatment. Bacterial community membership was associated independently with dysbiosis, intestinal inflammation, antibiotic use, and therapy. Motivated by the problems in real data analysis, this dissertation also presents two novel statistical models for microbiome data analysis. One important aspect of metagenomic data analysis is to quantify the bacterial abundances based on the sequencing data. In order to account for certain systematic differences in read coverage along the genome, we propose a multi-sample Poisson model to quantify microbial abundances based on read counts that are assigned to species-specific taxonomic markers. Our model takes into account the marker-specific effects when normalizing the sequencing count data in order to obtain more accurate quantification of the species abundances. Another statistical model we proposed is for longitudinal microbiome data analysis. A key question in longitudinal microbiome studies is to identify the microbes that are associated with clinical outcomes or environmental factors. We develop a zero-inflated Beta regression model with random effects for testing the association between microbial abundance and clinical covariates for longitudinal microbiome data. The model includes a logistic regression component to model presence/absence of a microbe in samples and a Beta regression component to model non-zero microbial abundance, where each component includes a random effect to take into account the correlations among repeated measurements on the same subject. The statistical methods were evaluated using simulations as well as the real data from Penn microbiome study of pediatric Crohn's disease.