Date of Award


Degree Type


Degree Name

Doctor of Philosophy (PhD)

Graduate Group

Genomics & Computational Biology

First Advisor

Casey S. Greene


Gene expression measurements capture downstream biological responses to molecular perturbations. This systems biology perspective can be investigated using both supervised and unsupervised machine learning approaches to rapidly derive insight, including cell type and pathway signatures, from transcriptomic compendia. Machine learning applied to transcriptomic compendia can aid in biological discovery, hypothesis generation, and precision medicine. We introduce these topics and discuss their impact in Chapter 1. In Chapters 2-4, we describe and extend a supervised learning approach to detect aberrant gene and pathway activity in cancer. We apply this approach to identify patient tumors, cell lines, and patient derived xenograft models with TP53 loss of function, Ras signaling activation, and NF1 loss. This approach facilitates the discovery of phenocopying variants and potential hidden responders to specific therapies. In Chapters 5-6, we focus on deriving transcriptomic signatures using unsupervised learning. We show that unsupervised learning can identify disease subtypes and can be used to develop gene expression signatures without the need to specify labels a priori. In Chapter 5, we assess the reproducibility of high grade serous ovarian cancer (HGSC) gene expression subtypes across populations and clustering algorithms. In Chapter 6, we train a variational autoencoder on patient tumors and use latent space arithmetic to identify gene signatures most distinguishing HGSC subtypes. Lastly, in Chapter 7, we develop an approach to rapidly interpret compressed features engineered in unsupervised learning algorithms. We train a series of unsupervised models across a wide range of latent space dimensions and develop a network-based method for interpreting these compressed gene expression features. Using this approach, we observe that modifying the hidden layer dimensionality impacts the identification of specific geneset and cell-type activation patterns in cancer and normal tissue. Machine learning models scale to large genomic datasets and have provided state of the art results in a variety of biomedical domains. However, model interpretation is critical to build knowledge and to generate hypotheses.

Files over 3MB may be slow to open. For best results, right-click and select "save as..."