Identifying Complex Trait-Related Genes Via Regulation-Informed Gene-Based Analyses

Binglan Li, University of Pennsylvania


While our understanding of dysregulated genes is essential for improvement of clinical care, the majority of complex trait-associated genetic variants (>90%) are located in noncoding regions of the human genome. Also connecting noncoding genetic variants to downstream affected genes is challenging. On the other hand, noncoding elements can regulate genes. Regulatory elements such as expression quantitative trait loci (eQTLs) provides a potential means to link noncoding genetic variants to affected genes and to explore complex disease mechanisms.

Transcriptome-wide association studies (TWAS) is a popular algorithm that exploits eQTLs to prioritize transcriptionally regulated genes from genome-wide association studies (GWAS). Transcriptional regulation is tissue-specific. However, it was unclear how biological properties of eQTLs and gene expression levels will affect the power of different TWAS methods. To answer this question, I designed and developed a novel data simulation framework that efficiently simulates variant, gene, and disease data according to designed relationships across multiple tissues simultaneously. The simulation showed that TWAS performance differed for tissue-specific genes and for genes that were expressed across all tissues. Thus, I put forth a tissue specificity-aware TWAS (TSA-TWAS) framework, validated its utility in clinical trials data, and promoted further suggestions for future TWAS regarding varied scenarios.

Centralized biobanks, such as Penn Medicine Biobank (PMBB), and Electronic Medical Records and Genomics (eMERGE) network, have collected a plethora of biospecimen and disease diagnosis; and recruited participants of varied genetic ancestries. However, it is not clear how disease susceptibility genes are like for different genetic ancestries and categories of diseases. Based on the simulation of the thesis part one, I designed a framework that applies TWAS and other data integrative methods on multi-ancestry EHR-linked biobanks to identify ancestry-specific and cross-ancestry gene-disease associations under a discovery (eMERGE III network) and replication (PMBB) study design. This study characterized a multi-ancestry gene-disease connection landscape.

This thesis contributes (1) a novel multi-tissue variant-gene-trait simulation framework, comprehensive evaluation of TWAS and (2) a multi-ancestry gene-disease connection landscape. Together, the thesis helps improve the understanding of genetically regulated genes underlying complex diseases and promote translation of basic science discoveries to clinical health care.