Capturing complex patterns of association in genetic data: a rule based machine learning approach to survival analysis
Degree type
Graduate group
Discipline
Genetics and Genomics
Subject
GWAS
Heterogeneity
Learning classifier system
Machine learning
Survival analysis
Funder
Grant number
License
Copyright date
Distributor
Related resources
Author
Contributor
Abstract
Genetic heterogeneity, epistasis, and other complex genetic architectures underlie disease risk, survival, and other outcomes. Traditional statistical methods rely heavily on assumptions and struggle to detect these complex patterns. This dissertation applied unfamiliar approaches to familiar problems in genetic epidemiology to demonstrate the advantages of machine learning. In Aim 1, we applied Relief-based algorithms as a ranking scheme for enrichment analysis to generate hypotheses about the role of epistasis in conotruncal heart defects. We identified key pathways in the secondary heart field and cardiac neural crest cells that play a role in the development of the cardiac outflow tract. For Aim 2, we developed a method to specifically address genetic heterogeneity in survival data, avoiding the constraints of popular methods such as the Cox proportional hazards model. Our novel survival-Learning classifier system (LCS) fully accounts for right-censored observations, handles multiple feature types and missing data, and makes no assumptions about baseline hazard or survival distributions. LCSs are a type of rule-based machine learning algorithms that are uniquely suited to heterogeneous problem domains, but to date, have not been adapted for survival analysis. While most methods seek to develop a single model that represents all the data; LCSs evolve a generalizable and interpretable population of rules that flexibly models underlying interactions and heterogeneity. As proof of concept, we evaluated the survival-LCS on simulated genetic survival datasets of increasing complexity. The four genetic models included main effect, epistatic, additive, and heterogeneous models, simulated across a range of censoring proportions, minor allele frequencies, and number of features. The results of this sensitivity analysis demonstrated the ability of survival-LCS to identify complex patterns of association in survival data. Using integrated Brier scores as a performance metric, we showed that survival-LCS can also reliably predict survival times and distributions, potentially useful for clinical applications such as informing self-controls in single-arm clinical trials. Finally, in Aim 3, we applied the survival-LCS to GWAS data from a neuroblastoma cohort. This work introduces new approaches that are accessible to epidemiologists and others and provides a path forward for future methods development
Advisor
Shen, Li