Date of Award
Doctor of Philosophy (PhD)
Modern machine learning methods have been widely applied in genomics and metagenomics data analysis. This dissertation develops two new machine learning methods for modeling censored survival data and a deep learning method for predicting biosynthetic gene clusters in bacterial genomes. Analysis of censored survival data using high dimensional genomics data plays important roles in modern biomedical and clinical research. While lots of research has been done in modeling survival data using proportional hazards models or modeling survival probabilities, these methods heavily rely on the proportional hazards assumption. This dissertation focuses on methods for modeling mean survival time and restricted mean survival time. Methods for mean survival time regression are very limited, especially in high dimensional settings. Two new methods for modeling the mean survival time are proposed and developed, including methods for statistical inference of high dimensional Tobit models under random censoring and fixed censoring setting, and methods for estimation and inference of the heterogenous restricted mean survival time (RMST) using random forests. We have shown through extensive simulations and analysis of several real data sets that our proposed methods performed better than existing methods in estimating and predicting restricted mean survival time.
The second part of this dissertation presents a deep learning method, DeepMBGC, for predicting biosynthetic gene clusters (BGCs) based on sequencing data of known bacterial genomes. Biosynthetic gene clusters (BGCs) in bacterial genomes code for important small molecules and secondary metabolites. Based on the current validated BGCs, protein domains (Pfam) similarlity network and protain domain functions, we develop a deep learning method for predicting BGCs and their types (DeepMBGC). DeepMBGC is the first model that affectively incoproate the Pfam domain biological function information, Pfam domain clan information, and Pfam domain similarity networks from EMBL database in BGC predictions. DeepMBGC utilizes the long short-term memory (LSTM) RNN architecture for sequence structure, CNN architecture for Pfam function encoding, and incorporates a novel method of data augmentation in order to overcome the limited number of true BGC cases. In addition, DeepMBGC can also be used to predict the BGC classes. We show that DeepMBGC leads to reduced false positive rates in BGC identification and an improved ability to extrapolate and identify novel BGCs compared to existing machine learning methods. Specifically, DeepMBGC has a higher F1-scores in both Pfam domain level and BGC level classification. We apply DeepMBGC to 5666 RefSeq bacterial genomes and predict a total of 161,026 BGCs with an average of 28.4 BGCs in each genome. We summarize all the predicted BGCs, their functional classes and the distributions of their BGCs in bacterial genomes.
Liu, Mingyang, "New Machine Learning Methods For Genomics And Metagenomics Applications" (2020). Publicly Accessible Penn Dissertations. 4063.