Discovering Pathway And Cell Type Signatures In Transcriptomic Compendia With Machine Learning

Way, Gregory Philip

Discovering Pathway And Cell Type Signatures In Transcriptomic Compendia With Machine Learning

Files

Way_upenngdas_0175C_13580.pdf (46.57 MB)

Degree type

Doctor of Philosophy (PhD)

Graduate group

Genomics & Computational Biology

Subject

Cancer
Gene Expression
Machine Learning
Biology
Computer Sciences
Genetics

Copyright date

2019-08-27T20:19:00-07:00

Permalink

https://repository.upenn.edu/handle/20.500.14332/30271

View all metadata

Author

Way, Gregory Philip

Abstract

Gene expression measurements capture downstream biological responses to molecular perturbations. This systems biology perspective can be investigated using both supervised and unsupervised machine learning approaches to rapidly derive insight, including cell type and pathway signatures, from transcriptomic compendia. Machine learning applied to transcriptomic compendia can aid in biological discovery, hypothesis generation, and precision medicine. We introduce these topics and discuss their impact in Chapter 1. In Chapters 2-4, we describe and extend a supervised learning approach to detect aberrant gene and pathway activity in cancer. We apply this approach to identify patient tumors, cell lines, and patient derived xenograft models with TP53 loss of function, Ras signaling activation, and NF1 loss. This approach facilitates the discovery of phenocopying variants and potential hidden responders to specific therapies. In Chapters 5-6, we focus on deriving transcriptomic signatures using unsupervised learning. We show that unsupervised learning can identify disease subtypes and can be used to develop gene expression signatures without the need to specify labels a priori. In Chapter 5, we assess the reproducibility of high grade serous ovarian cancer (HGSC) gene expression subtypes across populations and clustering algorithms. In Chapter 6, we train a variational autoencoder on patient tumors and use latent space arithmetic to identify gene signatures most distinguishing HGSC subtypes. Lastly, in Chapter 7, we develop an approach to rapidly interpret compressed features engineered in unsupervised learning algorithms. We train a series of unsupervised models across a wide range of latent space dimensions and develop a network-based method for interpreting these compressed gene expression features. Using this approach, we observe that modifying the hidden layer dimensionality impacts the identification of specific geneset and cell-type activation patterns in cancer and normal tissue. Machine learning models scale to large genomic datasets and have provided state of the art results in a variety of biomedical domains. However, model interpretation is critical to build knowledge and to generate hypotheses.

Advisor

Casey S. Greene

Date of degree

2019-01-01

Collection

Dissertations and Theses