Statistical Methods for Modeling Complex Dependency Structures in Zero-inflated Metagenomic Sequencing Data

Loading...
Thumbnail Image
Degree type
Doctor of Philosophy (PhD)
Graduate group
Epidemiology and Biostatistics
Discipline
Biology
Statistics and Probability
Subject
Metagenomics
Microbiome
Mixture models
Network analysis
Two-stage estimation
Zero-inflated models
Funder
Grant number
License
Copyright date
2023
Distributor
Related resources
Author
Deek, Rebecca, Ann
Contributor
Abstract

Advances in high-throughput sequencing technologies have enabled large-scale metagenomic sequencing studies of microbial compositions. As such, there is a growing scientific interest in understanding the human microbiome, defined as all the microorganisms and their genes in, or on, the body. Of particular interest is its functional role in human-host health. Nevertheless, there remains a statistical and computational bottleneck in effectively analyzing data from 16S rRNA and metagenomic sequencing studies. This is due to the characteristic excessive zeros, sequencing depth constraints, and high dimensionality of such data. Motivated by numerous microbiome studies, this dissertation aims to narrow the gap by developing novel statistical methods specifically designed to capture the excessive zeros of the data. The specific aims are to develop statistical models, inference procedures, and computational fast algorithms to (1) identify distinct microbial communities in a given data set, as well as each community’s important bacterial taxa, and (2) build microbial covariation networks based upon the estimated covariation between a pair of zero-inflated variables. To this end, three methodological advances are proposed. First, a generative latent mixture model of microbial counts that distinguishes between structural and sampling zeros. Second, a mixture margin copula model and two-stage inference procedure for microbial covariation networks in cross-sectional studies. Third, an extension to random-effects mixture margin copula models, as well as a corresponding Monte Carlo EM algorithm and likelihood ratio test to build temporally conserved covariation networks from longitudinal data. Furthermore, the performance and utility of these methods are demonstrated using simulations and several publicly available microbiome data sets.

Advisor
Li, Hongzhe
Date of degree
2023
Date Range for Data Collection (Start Date)
Date Range for Data Collection (End Date)
Digital Object Identifier
Series name and number
Volume number
Issue number
Publisher
Publisher DOI
Journal Issue
Comments
Recommended citation