Statistical Methods for Modeling Complex Dependency Structures in Zero-inflated Metagenomic Sequencing Data

Deek, Rebecca, Ann

Statistical Methods for Modeling Complex Dependency Structures in Zero-inflated Metagenomic Sequencing Data

Files

Deek_upenngdas_0175C_15786.pdf (13.56 MB)

Degree type

Doctor of Philosophy (PhD)

Graduate group

Epidemiology and Biostatistics

Discipline

Biology
Statistics and Probability

Subject

Metagenomics
Microbiome
Mixture models
Network analysis
Two-stage estimation
Zero-inflated models

Copyright date

2023

Permalink

https://repository.upenn.edu/handle/20.500.14332/59102

View all metadata

Author

Deek, Rebecca, Ann

Abstract

Advances in high-throughput sequencing technologies have enabled large-scale metagenomic sequencing studies of microbial compositions. As such, there is a growing scientific interest in understanding the human microbiome, defined as all the microorganisms and their genes in, or on, the body. Of particular interest is its functional role in human-host health. Nevertheless, there remains a statistical and computational bottleneck in effectively analyzing data from 16S rRNA and metagenomic sequencing studies. This is due to the characteristic excessive zeros, sequencing depth constraints, and high dimensionality of such data. Motivated by numerous microbiome studies, this dissertation aims to narrow the gap by developing novel statistical methods specifically designed to capture the excessive zeros of the data. The specific aims are to develop statistical models, inference procedures, and computational fast algorithms to (1) identify distinct microbial communities in a given data set, as well as each community’s important bacterial taxa, and (2) build microbial covariation networks based upon the estimated covariation between a pair of zero-inflated variables. To this end, three methodological advances are proposed. First, a generative latent mixture model of microbial counts that distinguishes between structural and sampling zeros. Second, a mixture margin copula model and two-stage inference procedure for microbial covariation networks in cross-sectional studies. Third, an extension to random-effects mixture margin copula models, as well as a corresponding Monte Carlo EM algorithm and likelihood ratio test to build temporally conserved covariation networks from longitudinal data. Furthermore, the performance and utility of these methods are demonstrated using simulations and several publicly available microbiome data sets.

Advisor

Li, Hongzhe

Date of degree

2023

Collection

Dissertations and Theses