Assembling The Building Blocks For A Unified Splicing Code
DNA contains the code for the functioning of biological systems. The central dogma of molecular biology describes the processing of this code: DNA is transcribed to RNA, which is translated to proteins. In eukaryotes, DNA is transcribed to pre-mRNA. Subsequently, a single pre-mRNA can encode for different proteins by selectively including or excluding protein-coding and non-coding RNA fragments from the final mRNA product through a process called alternative splicing. Alternative splicing is a widespread mechanism of tissue-specific regulation, and its misregulation has been implicated in diseases such as cancer. This motivates experimental and computational streams of research to understand splicing. The first involves a myriad of techniques such as RNA-Seq to quantify splicing and CLIP-Seq to identify putative targets of splicing regulators known as RNA-binding proteins (RBP). The second involves predictive models, also known as splicing codes, which infer regulatory mechanisms and predict splicing outcomes directly from genomic sequences. In this dissertation, we try to gain a better understanding of splicing regulation through splicing codes. We use deep learning for this predictive modeling of experimental splicing quantification. Four challenges arise when developing splicing code models: integrating heterogeneous sources of experimental data, modeling the transcriptomic state of tissues and effect of regulatory perturbations, modeling complex splicing variations from noisy measurements, and interpreting these models to generate high confidence hypothesis for experimental follow-up. In this dissertation, we overcome these challenges to develop a unified splicing code framework. First, we integrate functional (RNA-Seq) and binding (CLIP-Seq) data for splicing and predict precise splicing outcomes for cassette splicing events in different tissues and upon RBP knockdown despite having noisy and sparse experimental observations. Then we develop a new method, enhanced integrated gradients, to interpret splicing code models and generate reliable experimental hypotheses. Subsequently, we model transcriptomic states of tissues through an autoencoder model of RBPs using thousands of RNA tissue samples and try to understand their role in transcriptomic regulation by simulating pan-tissue RBP knockdowns. Lastly, we combine these elements with new splicing features, vocabulary for complex splicing events, and CNNs to move towards a unified splicing code.