Human Mutation/substitution Rate: Variability, Modeling And Applications
Mutation generates genetic variation, and in turn selection purges deleterious variants from the population. Understanding both is critical for discovering causal genes and variants behind diseases or making inferences about evolutionary processes. Human mutation rate varies significantly across the genome although most studies have only considered the immediate flanking nucleotides around the polymorphic site to model and study patterns of variability. The impact of larger sequence context has not been fully clarified, even though it substantially influences rates of mutation. In the first part of this thesis, I develop a novel statistical framework and using data from the 1000 Genomes project, demonstrate that a larger heptanucleotide sequence context explains >81% variability in substitution probabilities, discovering novel mutation promoting motifs at ApT dinucleotides, CAAT, and TACG sequences. My approach also reveals previously undocumented variability in C-to-T substitutions at CpG sites, not immediately explained by differential methylation intensity. Building on this framework, I model the selective forces acting on the coding genome and develop statistical scores that measures the intolerance at the gene or amino-acid level for functional variants. I demonstrate clinical utility of such intolerance scores in identifying genes associated with multiple human diseases including Autism. Next, I apply these lessons of mutation rate variability to develop an algorithm to detect sub-genic enrichment of de novo germline mutations in RB1 gene of bilateral Retinoblastoma (RB) probands to further elucidate disease biology. I demonstrate that previously noted ‘hotspots’ of nonsense mutations in RB1 are compatible with the elevated mutation rates expected at CpG sites, refuting a specific mechanism in RB pathogenesis. I also find enrichment of splice-site donor mutations of exon 6 and 12 but depletion at exon 5, indicative of previously unappreciated heterogeneity in penetrance within this class of substitution. Finally, I generate more accurate and informative estimates of de novo germline mutation rate in humans, and develop a toolkit to simulate, distribute and interpret mutations in human diseases. Overall, my research uncovers novel variability in human mutation rate and provides a systematic framework for analyzing mutational data, which can be used from causal gene discovery to elucidating specific disease mechanisms.