Statistical Models for Alternative Splicing with Applications to Heterogeneous Disease

Loading...
Thumbnail Image
Degree type
Doctor of Philosophy (PhD)
Graduate group
Genomics and Computational Biology
Discipline
Bioinformatics
Biology
Genetics and Genomics
Subject
Cancer
Genetics
Machine Learning
QTL
RNA Splicing
Statistics
Funder
Grant number
License
Copyright date
01/01/2024
Distributor
Related resources
Author
Wang, David
Contributor
Abstract

This dissertation is divided into two distinct parts which are unified by a focus on developing novel statistical methods to analyze splicing data. In Chapter 2, we develop methods for subtype discovery in heterogeneous cancers.Identification of cancer subtypes characterized by actionable genetic lesions is a pivotal step for developing treatment and improving clinical care. However, in heterogeneous diseases such as Acute Myeloid Leukemia (AML), subtype discovery can be challenging since mutation burden, which has traditionally been prioritized for this task, is low. Recent studies pointing to splicing aberrations in AML motivate splicing based detection of cancer subtypes. We developed an unsupervised machine learning algorithm called CHESSBOARD to identify “tiles” defined by a subset of splicing events and patient samples that represent disease subtypes. The model allows for a flexible number of tiles, accounts for uncertainty of splicing quantification, and is able to model missing values as additional signals. We first apply CHESSBOARD to synthetic data to assess its domain specific modeling advantages, followed by analysis of several leukemia datasets. We show detected subtypes are reproducible in independent studies, investigate their possible regulatory drivers and probe their relation to known AML mutations. Finally, we demonstrate the potential clinical utility of CHESSBOARD by supplementing mutation based diagnostic assays with discovered splicing profiles to improve drug response correlation. In Chapter 3, we develop improved methods for discovery of splicing quantitative trait loci (sQTLs). Identification and characterization of sQTLs has emerged as a critical component in understanding the function of noncoding genetic variants implicated in disease. However, a significant number of sQTLs remain undiscovered due to limitations in both splicing quantification and statistical methods. Here we present a sQTL mapping framework that identifies thousands of novel variants that have been recurrently omitted in recent studies. Our method combines event and transcript level quantifications to identify variants associated with a more comprehensive set of splicing phenotypes, a regression model tailored for splicing data, and a hypothesis weighting method leveraging splicing specific covariates to improve sGene discovery power while controlling exact FWER. Using GTEX as a case study, we show that existing pipelines fail to report over 25% of sQTLs in comparison. We also introduce several techniques to improve downstream variant prioritization including multivariate fine mapping, effect size inference and visualization tools. Finally, in an application to GTEx data, we show that our pipeline discovers novel intron retention associated variants in the Alzheimer’s CASS4 gene and variants in NAGNAG motifs. Furthermore, we show that newly discovered sQTLs co-localize with GWAS variants in the GWAS catalog for neurodegenerative disease. The newly discovered sQTL thus explains addition GWAS signal compared to existing approaches which provide novel insight into the functional role of genetic variants in splicing regulation.

Advisor
Barash, Yoseph
Date of degree
2024
Date Range for Data Collection (Start Date)
Date Range for Data Collection (End Date)
Digital Object Identifier
Series name and number
Volume number
Issue number
Publisher
Publisher DOI
Journal Issue
Comments
Recommended citation