Modeling mutation rate variation over time using Bayesian sequence context mutational trees
Degree type
Graduate group
Discipline
Statistics and Probability
Subject
Computational Biology
Genomics
Mutation Rate
Population Genetics
Funder
Grant number
License
Copyright date
Distributor
Related resources
Author
Contributor
Abstract
Germline mutation is the mechanism by which genetic variation in a population is created. It follows that mutation rate models are fundamental to population genetic inference. Assuming a single, global mutation probability at every site in the genome has been demonstrated to be an oversimplification, as rates of single nucleotide variations are highly variable across the genome. Sequence context mutation models are a natural modelling framework to capture this variability by taking into account the local nucleotide context. Models that include wider sequence context windows generally perform better. However, this comes at the cost of increased sparsity as the data is partitioned into smaller bins corresponding to every possible sequence context, risking an overfit model. Here I propose Baymer, a regularized Bayesian hierarchical tree model that dynamically captures sequence context-dependent mutability. This method is robust to sparse data settings by regularizing parameters where data is limited. Finally, Baymer emits uncertainty estimates that are crucial for model comparisons, allowing for nuanced evaluations of differences in parameter estimates. I demonstrate application of Baymer in three ways – first, identifying differences in polymorphism probabilities between continental populations in the 1000 Genomes Phase 3 dataset, second, in a sparse data setting to examine the use of polymorphism models as a proxy for de novo mutation probabilities as a function of variant age, sequence context window size, and demographic history, and third, comparing model concordance between different species across the tree of life. I discover new polymorphism probability differences between populations while validating my method by recapitulating known signatures. I find a shared context-dependent mutation rate architecture underlying our models, enabling a transfer-learning inspired strategy for modeling germline mutations. Finally, I generate 9-mer mutation models in 13 diverse species and demonstrate a relationship between Baymer model similarity and TMRCA in species beyond mammals for the first time. In summary, Baymer is an accurate polymorphism probability estimation algorithm that automatically adapts to data sparsity at different sequence context levels, thereby making efficient use of available data and opening doors for novel discovery.