Date of Award
Doctor of Philosophy (PhD)
Nancy R. Zhang
Finding interpretable targets within the genome for diseases is a primary goal of biomedical research. This thesis focuses on developing statistical models and methods for analysis of high throughput genomic and transcriptomic sequencing data with the goal of finding actionable targets of two types, disease-associated genes and disease-implicated cell types.
Traditional genome wide association studies(GWAS) focus on finding the association between genetic variants and diseases. However, GWAS results are often difficult to interpret, and they do not directly lead to an understanding of the true biological mechanism of diseases. Following GWAS findings, we can study the causal effect by Mendelian randomization(MR), which uses segregating genomic loci as instrumental variables to estimate the causal effect of a given exposure to disease outcome. In this thesis, we introduced the concept of ``localizable exposures'', which are exposures that can be localized, or mapped, to a specific region in the genome, such as the expression of a single gene or the methylation of a specific loci. With sequencing technology, allele specific reads are observable for localizable exposures, which allow their quantifications in an allele-specific manner. In the first part of this thesis, we present a new model, ASMR, uses allele-specific information for Mendelian randomization.
This thesis also develops methods for finding cell types implicated in disease through the joint analysis of bulk and single cell RNA sequencing data. Bulk tissue sequencing is often used to probe genes that have tissue-level expression changes between biological cohorts. However, tissue are usually a mixture of multiple distinct cell types and the tissue-level changes are due to shifts of cell type proportions as well as cell type specific expression changes. Single-cell RNA sequencing (scRNA-seq) allows the investigation of the roles of individual cell types during disease initiation and development. We present MuSiC, a method that utilizes cell-type specific gene expression from single-cell RNA sequencing (RNA-seq) data to characterize cell type compositions from bulk RNA-seq data in complex tissues. When applied to pancreatic islet and whole kidney expression data in human, mouse, and rats, MuSiC outperforms existing methods, especially for tissues with closely related cell types. With MuSiC-estimated cell type proportions, we propose a reverse estimation procedure that can detect cell type specific differential expression, allowing for the elucidation of the roles of genes and cell types, as well as their interactions, on disease phenotypes.
Wang, Xuran, "Mendelian Randomization And Single Cell Deconvolution: Two Problems In Statistics Genetics" (2019). Publicly Accessible Penn Dissertations. 3325.