Capturing complex patterns of association in genetic data: a rule based machine learning approach to survival analysis

Loading...
Thumbnail Image
Degree type
Doctor of Philosophy (PhD)
Graduate group
Epidemiology and Biostatistics
Discipline
Life Sciences
Genetics and Genomics
Subject
Epistasis
GWAS
Heterogeneity
Learning classifier system
Machine learning
Survival analysis
Funder
Grant number
License
Copyright date
2022
Distributor
Related resources
Author
Woodward, Alexa, Abigail
Contributor
Abstract

Genetic heterogeneity, epistasis, and other complex genetic architectures underlie disease risk, survival, and other outcomes. Traditional statistical methods rely heavily on assumptions and struggle to detect these complex patterns. This dissertation applied unfamiliar approaches to familiar problems in genetic epidemiology to demonstrate the advantages of machine learning. In Aim 1, we applied Relief-based algorithms as a ranking scheme for enrichment analysis to generate hypotheses about the role of epistasis in conotruncal heart defects. We identified key pathways in the secondary heart field and cardiac neural crest cells that play a role in the development of the cardiac outflow tract. For Aim 2, we developed a method to specifically address genetic heterogeneity in survival data, avoiding the constraints of popular methods such as the Cox proportional hazards model. Our novel survival-Learning classifier system (LCS) fully accounts for right-censored observations, handles multiple feature types and missing data, and makes no assumptions about baseline hazard or survival distributions. LCSs are a type of rule-based machine learning algorithms that are uniquely suited to heterogeneous problem domains, but to date, have not been adapted for survival analysis. While most methods seek to develop a single model that represents all the data; LCSs evolve a generalizable and interpretable population of rules that flexibly models underlying interactions and heterogeneity. As proof of concept, we evaluated the survival-LCS on simulated genetic survival datasets of increasing complexity. The four genetic models included main effect, epistatic, additive, and heterogeneous models, simulated across a range of censoring proportions, minor allele frequencies, and number of features. The results of this sensitivity analysis demonstrated the ability of survival-LCS to identify complex patterns of association in survival data. Using integrated Brier scores as a performance metric, we showed that survival-LCS can also reliably predict survival times and distributions, potentially useful for clinical applications such as informing self-controls in single-arm clinical trials. Finally, in Aim 3, we applied the survival-LCS to GWAS data from a neuroblastoma cohort. This work introduces new approaches that are accessible to epidemiologists and others and provides a path forward for future methods development

Advisor
Moore, Jason, H
Shen, Li
Date of degree
2022
Date Range for Data Collection (Start Date)
Date Range for Data Collection (End Date)
Digital Object Identifier
Series name and number
Volume number
Issue number
Publisher
Publisher DOI
Journal Issue
Comments
Recommended citation