NAVIGATING HETEROGENEITY TO LEARN FROM LARGE-SCALE CANCER DATA: OPTIMIZATION, REDUNDANCY, AND GENERALIZATION
Degree type
Graduate group
Discipline
Subject
Funder
Grant number
License
Copyright date
Distributor
Related resources
Author
Contributor
Abstract
In the pursuit of molecular characterization of diverse cancers, collaborative efforts have generated large publicly available datasets, which combine various data types and data sources. Simultaneously, machine learning has rapidly gravitated toward models with many parameters that can be trained on broad sets of data, and subsequently fine-tuned to a wide variety of tasks. Computational oncology sits squarely at the intersection between these advances. However, the structure of most cancer datasets is uniquely heterogeneous, relative to other fields and data types in which large models have proven successful. In this dissertation, we first study aspects of machine learning model tuning in cancer, showing that the choice of optimizer used to fit models on cancer transcriptomics datasets can have pronounced effects on model selection. We then explore two aspects of heterogeneity inherent to public cancer datasets that affect machine learning modeling choices. We first show that most -omics types available in the TCGA Pan-Cancer Atlas can capture information relevant to cancer function, but somewhat less intuitively, when multiple -omics types are combined there is considerable redundancyand model performance does not generally improve. Next, we study model generalization across biological contexts in cancer transcriptomics and its implications on model selection, finding that cross-validation performance on holdout data is a sufficient selection criterion, and criteria that incorporate model sparsity or simplicity do not tend to improve generalization performance. Overall, our results show that the particularities of large cancer genomics datasets must be considered for applications of machine learning to be successful in this domain. These findings suggest hurdles to, but also opportunities for, machine learning models integrating pan-cancer and pan-omics data for biological and clinical insights.
Advisor
Ritchie, Marylyn, D